Explaining Serialization
The aim of this page 📝 is to explain the fundamental differences between in-memory objects and serialized data (data engineering 101), focusing on the core concepts of serialization, encoding, and the distinction between text and binary formats.
It is a result of my conversation with GeminiAI that started with this sentence from Designing Data Intensive Applications, which I’m currently reading
At the beginning of this chapter we said that whenever you want to send some data to another process with which you don’t share memory — for example, whenever you want to send data over the network or write it to a file — you need to encode it as a sequence of bytes. We then discussed a variety of different encodings for doing this.
— Chapter 4 (Encoding and Evolution > Modes of Dataflow)
NOTES
- When data needs to be transported between two separate processes, it must be encoded as a sequence of bytes.
- This process is known as serialization, which converts complex, language-specific in-memory objects into a portable, language-agnostic byte stream.
- In-memory objects are stored in a complex, often non-contiguous memory layout with internal pointers, unique to a specific programming language’s runtime (e.g., Python’s
dictor JavaScript'sobject). - A serialized byte sequence is a single, contiguous block of bytes designed for portability, not for direct in-memory access. It contains no internal pointers.
- The raw bytes of an in-memory object are not directly usable by another process due to their complex, language-specific structure.
- There are two main categories of serialization formats: text-based and binary.
- Text-based formats, like JSON, are human-readable and rely on a separate character encoding step (e.g., UTF-8) to convert the text characters into bytes.
- Binary formats, like Parquet, are not human-readable but are more compact and faster to parse because they directly map data types to a predefined binary layout, skipping the text-based layer.
- The choice between a text and a binary format is a trade-off between human readability and performance (speed and size).
- While serialization into a byte sequence is the most common method for inter-process communication, shared memory offers a faster alternative for processes on the same machine by allowing them to access the same region of RAM directly, avoiding the need for serialization.
- Advanced formats like Apache Arrow enable zero-copy deserialization, where the in-memory representation is identical to the on-wire format, allowing programs to access data without making a separate copy.
- Here is an example of a Python object being serialized to JSON and a JavaScript object being deserialized from it:
- Python (Serialization)
import json
data = {'id': 101, 'name': 'Alice'}
serialized_data = json.dumps(data)
print(type(data))
# <class 'dict'>
print(type(serialized_data))
# <class 'str'>- Javascript (Deserialization)
const serialized_data = '{"id": 101, "name": "Alice"}';
const data = JSON.parse(serialized_data);
console.log(typeof data);
// object