This article is the third of a four-part series on the Envision virtual machine’s inner workings: the software that runs Envision scripts. See part 1, part 2 and part 4. This series doesn’t cover the Envision compiler (maybe some other time), so let’s just assume that the script has somehow been converted to the bytecode that the Envision virtual machine takes as input.
During execution, thunks read input data and write output data, often in large quantities.
- A billion booleans (one bit per value) take 125MB.
- A billion floating-point numbers (32-bit precision) take 4GB.
- A billion minimal sales lines (date, location, EAN-13, quantity) take between 14GB and 33GB (or more!) depending on how the values are encoded.
This creates two challenges: how to preserve this data from the moment it is created and until it is used (part of the answer is: on NVMe drives spread over several machines), and how to minimize the amount of data that goes through channels slower than RAM (network and persistent storage).
One part of the solution is to have two separate data layers, with data being pushed into one of the two layers based on its nature. The metadata layer contains information about the actual data, and about the scripts being executed:
- When a thunk has successfully returned data, the unique identifier of that data is kept in this layer.
- When a thunk has failed, the error messages produced by the thunk are kept in this layer.
- When a thunk has returned a new thunk (and its parents’ DAG), the serialized DAG is kept in this layer.
- A thunk can save checkpoints to the metadata layer (usually consisting of a block of data’s identifier); if a thunk is interrupted before it was completed, it can then load back its checkpoint from the metadata layer and resume work from that position.
In other words, the metadata layer can be seen as a dictionary mapping thunks to results, where the exact nature of the result depends on what the thunk actually returned.
The metadata layer can also contain additional information about the structure of the data being referenced. For instance, if a thunk has returned a pair of vectors, then the metadata will contain each vector’s unique identifier. This allows consumers to access one vector without having to load both.
There are two limits on values stored in the metadata layer: an entry may not exceed 10MB (so a serialized DAG is not allowed to exceed this amount, either!), and the total storage space for the metadata layer is 1.5GB. Usually, there are around one million values in this layer, for an average entry size of 1.5KB.
The metadata layer always lives in RAM to guarantee fast access. It acts as the source of truth for thunk execution: a thunk has been executed if, and only if, there is a result associated to that thunk in the metadata layer—although this does not guarantee that the data referenced by that result is available.
Each worker in the cluster keeps its own copy of the metadata layer. The worker broadcasts every change to this layer (caused by the execution of local thunks) to all other workers in the cluster, and to the scheduler as well. This is done on a «best effort» basis: if a broadcast message does not reach its destination, it is dropped1 without a retry.
Every second, the metadata layer is persisted to disk, incrementally. In case of crash or reboot, the worker will take a second or two to reload the entire layer from disk to remember what it was doing.
Keeping large databases in memory
As mentioned above, the metadata layer can contain a million entries. Each individual DAG can contain hundreds of thousands of nodes. All of these have long lifetimes—from minutes to hours. Keeping millions of long-lived objects in memory is quite hard on the .NET garbage collector.
Garbage collection in .NET is a complex topic (though there is an excellent series by Konrad Kokosa to dive into the low-level details), but the overall issue is a combination of three facts:
- The performance cost of a garbage collection pass is proportional to the number of alive objects in the area of memory being garbage-collected. Processing millions of objects, often with billions of references to follow between them, will take the garbage collector several seconds to process.
- To avoid paying this cost, the .NET garbage collector works with separate areas of memory, called generations, depending on the age of objects inside them. The youngest generation, Gen0, undergoes garbage collection frequently but only contains objects allocated since the last pass (so, only a few). The oldest generation, Gen2, is only collected if both Gen1 and Gen0 were collected but failed to yield enough free memory. This will be quite rare as long as most object allocations are small and short-lived.
- However, a normal thunk operation involves large arrays of values, which are allocated on the Large Object Heap, an area separate from Gen0, Gen1 and Gen2. When the Large Object Heap runs out of space, a full garbage collection is performed, which also collects Gen2.
And Gen2 is where the millions of objects from DAGs and the metadata layer are situated.
To avoid this, we have built both the DAGs and the metadata layer to use only very few objects.
Each DAG consists of only two allocations—an array of nodes and an array of edges, both of which are unmanaged value types, so that the GC does not even need to traverse their contents to follow any references they may contain. When a thunk is needed in order to be executed, it is deserialized from the binary representation of the DAG2, which is present in the metadata layer.
The metadata layer has variable-length contents, so it is built by carving chunks out of a large
ref struct and
MemoryMarshal.Cast to manipulate the data without copying it.
A cluster has between 512GiB and 1.5TiB of RAM, and between 15.36TB and 46.08TB of NVMe storage. Most of this space is dedicated to storing the intermediate results of thunk evaluation.
RAM is valuable real estate: it represents only 3% of available storage space, but is between 100× and 1000× faster to read and write. There is a significant benefit to ensuring that data that is about to be read by a thunk is already present in memory (or has never left memory in the first place).
In addition, it is nearly impossible to use 100% of available RAM in .NET—the operating system has variable memory needs, and has no reliable way of communicating to the .NET process that it should relinquish some memory, resulting in the process being oom-killed (out-of-memory).
Envision resolves this issue by delegating the management of RAM-to-NVMe transfers to the operating system. We have open sourced this code as Lokad.ScratchSpace. This library memory-maps all the storage space available on the NVMe drives, and exposes it as a blob store that the application can use to:
- write blocks of data (up to 2GB each) to the scratch space, either directly or by serializing from a managed object. This operation returns a block identifier.
- read blocks of data using their identifiers. This operation pins the block and exposes it to the application as a
ReadOnlySpan<byte>, which the application should then copy (or deserialize) to managed memory.
Once the scratch space is full, the oldest blocks are discarded to make space for new data. This means that it’s possible for a read operation to fail, if the identifier points to a block that has been dropped, but this is a rare occurrence during the execution of an Envision script—rarely does a single execution produce tens of terabytes. On the other hand, this may prevent a new execution from reusing the results of a previous one.
The key to using a memory-mapped scratch space is that the available RAM is distributed among three kinds of pages3: memory that belongs to processes (such as Envision’s .NET process), memory that is an exact byte-for-byte copy of an on disk file portion, and memory that is intended to be written to a file on disk.
Memory that is a copy of a file on disk can, at any point in time, be released by the operating system, and used for another purpose—to be given to a process for its own use, or to become a copy of another portion of a file on disk. While not instantaneous, these pages act as a memory buffer that can be quickly re-assigned to another use. And until they are re-assigned, the operating system knows that they contain a copy of a specific region of persistent memory, and so any read requests for that region will be redirected to the existing page instead, thereby requiring no load from disk at all.
Memory that is intended to be written to disk, will eventually be written out and become a copy of the region where it was written. This conversion is limited by the writing speed of the NVMe drives (on the order of 1GB/s).
Memory that is assigned to the process cannot be converted back to the two other types without being released by the process (which the .NET GC will sometimes do, after a collection has released a large amount of memory). All memory allocated through .NET, including all managed objects and everything that the GC oversees, must belong to this type of memory.
In a typical worker, 25% of the memory is assigned to the .NET process directly, 70% is a read-only copy of file regions, and 5% is in the process of being written out.
The general principle is that each thunk writes its output to the scratch space as one or more atoms, then stores the identifiers of those atoms in the metadata layer. Subsequent thunks then load these identifiers from the metadata layer, and use them to query the scratch space for the atoms that they need.
The name «Atom» was chosen because it is not possible to read only one portion of an atom: they can only be retrieved in their entirety. If a data structure needs to support requesting only part of its contents, we instead save it as multiple atoms, which can then be retrieved independently.
Some atoms are compressed; for instance, most boolean vectors are not represented as
bool, which consumes one byte per element, but are instead compacted down to 1 bit per value, and then compressed to eliminate long sequences of identical values.
It is possible for atoms to disappear, though this is a rare occurrence. The two main situations where this can happen is when the metadata layer remembers a result from a previous run, but the corresponding atom was evicted from the scratch space in the meantime, and when the atom was stored on a different worker that no longer responds to requests. Less frequently, a checksum error reveals that the stored data is no longer valid and must be discarded.
When an atom disappears, the thunk that requested it is interrupted, and enters recovery mode:
- The system verifies the presence (but not the checksums) of all other atoms referenced by the thunk’s inputs. This is because atoms are likely to be generated at the same time and on the same worker, and the disappearance of an atom is correlated with the disappearance of other atoms from around the same time and place.
- The system scours the metadata layer for references to any of the atoms discovered as missing during the previous step. This will cause some thunks to revert from “executed” to “not executed yet” because their result was discarded. The kernel will then detect this, and schedule them again.
The re-executed thunks will then produce the atom again, and execution can resume.
A particular aspect of the atom layer is the way in which shuffles are performed—a first layer of $M$ thunks each produces several million lines of data, and then a second layer of $N$ thunks each read the previous layer’s output to perform another operation (usually, some form of reduce), but every single line from the first layer is only ever read by one thunk from the second layer.
It would be very wasteful for every thunk in the second layer to read all the data from the first layer (every line would be read $N$ times, out of which $N-1$ were unnecessary), but this is exactly what would happen if every thunk from the first layer produced exactly one atom.
On the other hand, if every thunk in the first layer produces one atom for each thunk in the second layer, the shuffle operation will involve $M\cdot N$ atoms in total—a million atoms for $M = N = 1000$. While the overhead on atoms is not excessive, adding up an atom identifier, tenant identifier, atom data type, size, and a bit of bookkeeping, it can still reach a few hundred bytes per atom. While 100MB may seem like a small price to pay in order to shuffle around 4GB of actual data, that actual data lives in the atom layer (which is designed for large data), while 100MB represents a sizable chunk of the 1.5GB total budget of the metadata layer.
To work around this, Envision supports atom arrays:
- All the atoms in an atom array are written out at the same time, and are kept together both in memory and on the disk.
- Given the identifier of the atom array, it is easy to derive the identifier of the i-th atom in the array.
Thanks to this, an array of atoms has the same overhead as a single atom. In a shuffle,the first layer thunks would produce $M$ arrays of $N$ atoms each. The second layer thunks would each request $M$ atoms, one from each array, at the position corresponding to that thunk’s rank in the shuffle.
In closing, a few production statistics! In an hour, a typical worker will execute 150 000 thunks and write 200 000 atoms (arrays of atoms are counted only once) representing 750GiB of intermediate data.
In the next and final article of this series, we will discuss the layers that enable the distributed execution to happen.
Shameless plug: we are hiring software engineers. Remote work is possible.
Messages are only very rarely dropped, and although it is better for performance if no messages are dropped at all, it is not necessary for correctness. It is assumed that the metadata layer of each worker will be slightly out of sync with the others, and while this hampers their ability to cooperate on specific missions, each worker remains capable of finishing every mission on its own. This lets us avoid the complexity of setting up at-least-once delivery. ↩︎
This deserialization also involves a great deal of decompression, since we apply several complex techniques to keep the total size of a serialized DAG to a minimum. ↩︎
There are actually other kinds of pages, and this article provides only a very limited overview as it applies to Envision. ↩︎