Explaining Schema-on-Read VS Schema-on-Write Principles

The aim of this page is to explain schema-on-read vs schema-on-write based on the particular examples of Mixpanel and Snowplow, focusing on event tracking, as notes under the recently published and insightful https://engineering.mixpanel.com/under-the-hood-of-mixpanels-infrastructure-0c7682125e9b

Pavol Kutaj
2 min readApr 23, 2024
  • Schema-on-read infers the structure from the data itself, whereas schema-on-write defines the structure upfront using a schema language.
  • Schema-on-read is flexible but can lead to inconsistent data and errors if not implemented carefully.
  • Schema-on-write enforces structure and improves data quality but can be rigid and time-consuming to define initially.

Analogy to Programming Languages

Just like programming languages can be typed statically or dynamically, event tracking can follow a schema-on-read or schema-on-write approach.

  • Schema-on-read (similar to dynamically typed languages like Python): Offers flexibility to handle various data formats but can lead to runtime errors if data is unexpected.
  • Schema-on-write (similar to statically typed languages like Java): Enforces data structure upfront to prevent errors but can be inflexible and require more initial effort.

Example: Tracking User Signups in Mixpanel (Schema-on-read)

Mixpanel utilizes a schema-on-read approach for event tracking. Here’s an example of a user signup event sent to Mixpanel:

{
"userId": "123",
"email": "user@example.com",
"created_at": 1655800000 // unix timestamp in milliseconds
}

In this scenario, Mixpanel doesn’t require a predefined schema for this event. It can ingest the data and infer the structure based on the properties included (userId, email, and created_at).

Example: Tracking Product Purchases with Snowplow Schema (Schema-on-write)

Snowplow, another popular event-tracking platform (and my wonderful employer), utilizes a schema-on-write approach. Here’s an example schema written in JSON for tracking a product purchase event:

{
"schema": "iglu:com.mycompany.ecommerce/purchase_event/jsonschema/1-0-0",
"data": {
"productId": "SKU123",
"quantity": 1,
"price": 19.99,
"currency": "USD"
}
}

This example showcases schema-on-write. The schema property defines the event structure using a Uniform Resource Locator (URL) that points to a central schema registry. This registry stores and manages all the schemas used for event tracking. The data property contains the actual event data, ensuring it adheres to the predefined structure outlined in the schema.

Snowplow Schema Example (iglu:com.mycompany.ecommerce/purchase_event/jsonschema/1–0–0):

{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"productId": {
"type": "string",
"description": "The unique identifier of the purchased product"
},
"quantity": {
"type": "integer",
"description": "The number of units of the product purchased"
},
"price": {
"type": "number",
"description": "The price of the purchased product"
},
"currency": {
"type": "string",
"description": "The currency code of the price (e.g., USD, EUR)"
}
},
"required": [
"productId",
"quantity",
"price",
"currency"
]
}

This schema example defines the expected structure for the purchase_event data. It specifies the data type for each property (e.g., string, integer, number) and includes descriptions for clarity. Additionally, it marks all properties as required, ensuring complete data for each event.

--

--

Pavol Kutaj

Today I Learnt | Infrastructure Support Engineer at snowplow.io with a passion for cloud infrastructure/terraform/python/docs. More at https://pavol.kutaj.com