Venture the Void - Data Work

Same/Distinct Duality

I think a lot about the duality between "same" and "distinct". In some ways it seems to be at the bottom of so much human reasoning.

For instance: in order to count things, there needs to be some concept of things being different from each other (the things we point to and go to 1, 2, 3 while counting them) but also the same as each other (the reason we have them together in the first place.)

This duality also presents itself as the "naming problem". When we name something, we are drawing a boundary around it, in effect applying a label to whatever is is "inside" the boundary. This is the "distinct" part. The "same" part surfaces when we realize that we normally name things to differentiate them from other things of what we otherwise consider equivalent.

We do this intuitively and the most obvious naming is to name humans. We name a person precisely because "person" would not be useful-- we tend to care a lot about which individual person we are dealing with.

How specific we get depends on what we're naming. In fact, when we name non-human things, it can get a bit weird.

For instance: we name types of plants but not normally individual plants, unless become attached to them and start to anthropomorphise them. We generally name pets, but not other individual animals.

There is a lot to think about here.

It might seem the "same" part of this duality is suspect. After all, can't we create a set of any variety of things, and count them? In what way are they necessarily really the same?

But the process of grouping them, itself, depends on some form of sameness. In the language of set theory, inclusion in the set is defined by a well formed formula (or... something like this.) That means there is some function f where for a given input X, we can determine if f(X) is true or false (or... something like this.)

So the "well formed" part of the formula means it's logically sensible; i.e., to create a set of things to count, we need a logical way to determine that they are part of a set. So by some criteria (that is, whatever the well-formed formula is determining about them) they must be the same.

I wonder how this relates to the Axiom of Choice, or if it does. I don't have any insight or ideas here. Also I took Set Theory a long time ago so I probably have the finnicky details all wrong here as it pertains to well-formed formulas. Please read up if you are curious, it's very interesting stuff!

In a computer program, naming variables can be tricky.

int X = 2;

Seems straightforward enough, but there are already some questions-- does X stand for the number 2 itself (i.e., X is constant and unchanging, it is always 2) or is X some value we are calculating?

If the first case, is X another name for 2 (which we ordinarily call, well "2") and if so why do we need it? Can X be constant and unchanging, equal to 2, but still not mean the same thing as we mean when we say 2?

This is hard because it's not easy to draw boundaries around abstract concepts.

Object oriented programming is well-founded because an object (or instance) is something we would intuitively name, and a class defines how those objects are all the similar. This is the just the same-slash-distinct duality.

Put another way, for a given class, instances are both different to one another (we can name them and treat them separately) and also the same as one another (we can do the same sorts of things with them, and they behave the same.)

As we understand a problem better, we redefine what classes mean as we draw our abstract boundaries differently. Today in our client program we have a class for Server-- tomorrow it's ServerConnection, ServerRequest, ServerResponse, and so on.

The same-slash-distinct duality is at the heart of object oriented programming.

Functional Representation of Truth

When I started Venture the Void, it was going to be all hand-built. I would edit some files that described the world. These were text files in a special (and terrible) data language I called "info".

But the task of creating the world by hand became too time-consuming.

I decided the way around this was to use procedural generation. Since I already had a (partly) working game, I built another tool "universe maker" or "umake" that would generate the info files.

Umake works by building an abstract model of the world internally first, with some random variables and so on. It then spits out info files for the game to consume. These are then packaged up into "world files".

The game itself does not know what the original abstract model of the world was that umake created. Any info that is not adequately encoded in the info files cannot be accessed.

The abstract model that umake holds internally might know "this is a lava-type planet". The info files are much more specific and lack context, more along the lines of "use this texture, in this way."

As another example, umake's world model contains information about relationships between NPCs. But the info files just contain dialog and quest data. The quests themselves work because all the pieces are there to execute them, and the dialog conveys the relationships so the player can understand it, but from the perspective of the game while it is running, the underlying relationships are lost.

When hand-crafting a world, this isn't an obvious problem. The higher level concept only needs to exist in the creator's mind.

When dealing with procedural generation, though, there are a ton of problems. Two are:

  • bugs are hard to track down; we can track the bug back to the info file, but then we have to reverse engineer umake to tell where that info file came from; e.g., if quest dialog feels "off", we don't have access to any information in the world files to tell us why.
  • any small change we want to make to how the abstract world model is generated requires spitting out the entire collection of info files again; in practice this slows development to a crawl and is painful.

Ultimately, it can be seen as an unnecessary layer:

[umake] <=> [world model] => [info files] <=> [game]

Because the [world model] => [info files] part is a one way street, the game can't ever "reach back" to the world model and find out what was actually intended.

From a development workflow, we can't reinterpret the world model in a better way by changing the game (e.g., to use a better lava texture); we have to go all the way back to umake. This simply slows and complicates the creative process of making things.

To get around this, I want to persist the world model, and rework the game engine to process the world model directly, not relying on intermediate info files at all. Anything that would have been in the info files will now just be derived as it's needed.

Umake will still exist, but it will spit out files that represent the world model itself. These can then be loaded back to recreate the generated world model. The translation into concrete data (what texture to use, what dialog, and so on) will happen directly in the game, as much as possible.

These specific choices (e.g., what texture to use) will, therefore, be defined as functions on the original world model. It becomes possible to change how the world model is interpreted much more quickly.

This will happen gradually, so I can upgrade pieces one by one.

The immediate motivation for that is to change how planets and other things are rendered. But eventually the whole picture will be much simpler and more flexible.

A side note, but this is one of the patterns in the refactoring book. Put another way, this is one widely-accepted way to simplify computer programs-- replace variables with functions where you can. Instead of storing the result of a computation, just re-compute it every time you need it.

Intuitively this seems bad because it feels like unnecessary work for the computer. In practice, and especially when working in a programming language that lets you blur the line syntactically between variable names and function calls (like ruby), it's very effective at managing complexity.

Side side note, but concerning yourself with efficiency is a good goal. It's even noble. Full stop. It's just that you need to work backwards, from a complete running program, and figure out where the wasted cycles (hence energy, hence time, and so on) are in the first place, NOT work forwards and try to predict where those will be beforehand.

At least in my experience, my intuition about where cycles are wasted is always wrong, wrong, wrong!

So, the above refactoring is NOT something to avoid in the interest of optimization. In fact, the opposite is the case-- if your program is already simple to manage overall, those few places you need to really optimize it (for instance, by storing the results of expensive computations) are not a problem. Moreover, you can make higher level optimizations more easily.

This is not a small thing but in practice it's a hard lesson.


Anyways, this brings us to serialization, which is the process of encoding objects in some way that their state can later be decoded. Generally this means creating a string to write to disk so that later it can be read back, and the original object re-created.

Serialization is hard. The reason why is that a computer program typically does not have a name for its objects. We think we have named objects. But we actually just name variables that we use to refer to the object. We name references to objects.

The objects themselves should better be thought of as numbered automatically by the memory model. An object's name in a typical computer program is a number defining its location in memory, not the name of the variable that holds it in a particular context. This makes sense when you consider that multiple variables might refer to the same object, but if something lives in the same place in memory, it can't (generally) be thought of as being more than one thing.

Put into code, you can think about it this way:

int* X = new int;
*X = 2;
int* Y = X;

The object here is a particular copy of the number 2. Its "name", or the closest it has to it, is the memory address it has been stored at. It's neither X nor Y.

Serialization is hard because when serializing or deserializing an object, it will move around in memory. Our names for objects are typically implicit (memory addresses) and change when deserializing. Renaming things randomly this way tends to break the entire logical model of the program, unsurprisingly since names are how we differentiate between things.

If some object "A" knows about another object "B", and it also has complete control over "B", then generally B can be wrapped into A's serialization. I.e., when A serializes itself, it also serializes B. Deserializing A, in return, produces a new but equivalent B.

This falls apart though when another object "C" knows about "B" as well (for instance, it has a pointer to it.) If A creates a new B on deserialization, it also has to somehow tell C where B now lives.

This can get complicated and ugly.

The solution I use is to explicitly name my objects. I do this with strings. Every object created has a name and it can be looked up uniquely by that name. Things don't hold pointers to memory but look them up in the game engine when they need them.

I use strings because they are a bit easier to debug if something goes wrong. You could also use integers for efficiency. They are typically some short description (e.g., "frog") followed by an increment (so the first is "frog-0", the next is "frog-1", and so on.)

Again, the names are globally unique. If frogs are being spawned by the game, we might well end up with "frog-10582" at some point. But numbers don't run out.

In practice, this really solves a bunch of other problems besides serialization, because now there really is an explicit name for everything.

The name draws a boundary around the object. We know we can always just refer to it by its name. The name doesn't change no matter how many times we reserialize it.

What's interesting is to consider whether serialization actually destroys the object or not, now. I think it's fair to say it does not. From the perspective of human reasoning, the name has not changed. We can reason unambiguously about the object whether it is saved to disk as a string or living in memory in a running program.

This is why database systems generally always have a "key" or primary index. Each row is like an object, each table is like a class, and rows objects are uniquely named by their key.

Database systems are built to persist in the first place.

Naming objects is similar to storing them in a database like this. It makes serialization easy (well, easier), but it also lets you other neat things you normally do with databases, e.g., global searches for objects with a certain criteria.

Apart from serialization, I find that lots of relatively hard design problems become quite simple once you follow this pattern. I think it's because ultimately once you start to use explicit names there is less implied logic throughout your program.

Put another way, a lot of the benefits of explicitly naming objects are unforseable, but fall out because your data model matches your intuitive mental model better.

May 6, 2021

◀ Back