Data Encoding & Schema Evolution

Applications inevitably change over time, and so do the data structures they use. But in a large system, you cannot upgrade all services simultaneously -- old and new versions of your code coexist during rolling deployments, and data written by old code must remain readable by new code (and often vice versa). This chapter explores how different encoding formats handle schema evolution and what it means for building evolvable systems.

Why Encoding Matters

Data exists in two representations. In memory, it lives in objects, structs, hash tables, and trees optimized for CPU access. On disk or over the network, it must be encoded as a self-contained sequence of bytes. The translation between these two representations is called encoding (serialization, marshalling) and decoding (deserialization, unmarshalling).

The choice of encoding format affects storage size, parsing speed, human readability, and most critically, your ability to evolve schemas over time without breaking running systems.

Human-Readable Formats

JSON

JSON is the de facto standard for web APIs and configuration files. It is human-readable, widely supported, and simple to work with.

JSON{ "user_id": 42, "name": "Alice Chen", "email": "alice@example.com", "orders": [ { "id": "ord_001", "total": 89.99, "currency": "USD" }, { "id": "ord_002", "total": 245.00, "currency": "USD" } ] }

Limitations of JSON:

  • Number precision: JSON does not distinguish between integers and floating-point numbers. Large integers (>2^53) lose precision when parsed by JavaScript. Twitter famously had to include tweet IDs as both numbers and strings in their API.
  • No binary support: Binary data must be Base64-encoded, increasing size by 33%.
  • No schema enforcement: JSON has no built-in way to specify which fields are required, what types they should be, or version constraints. JSON Schema exists but is not widely enforced at the transport layer.
  • Verbose: Field names are repeated in every record, wasting space compared to binary formats.

XML

XML was the dominant data interchange format before JSON. It supports namespaces, schemas (XSD), and is more verbose than JSON. It is still common in enterprise systems (SOAP web services, configuration files like Maven POM), but has largely been replaced by JSON for new APIs.

CSV

CSV is the simplest text format: values separated by commas, one record per line. It has no schema, no type system, and ambiguous handling of special characters (commas in values, newlines, encoding). It remains popular for data exchange with spreadsheets and simple ETL pipelines, but is unsuitable for complex or evolving data structures.

Binary Encoding Formats

Binary formats encode data more compactly and parse faster than text formats. More importantly, they support schema definitions that enable forward and backward compatibility.

Protocol Buffers (Protobuf)

Developed by Google, Protocol Buffers is the most widely used binary encoding format. You define a schema in a .proto file, and a code generator produces classes for your programming language.

Protocol Buffers Schemasyntax = "proto3"; message User { int32 user_id = 1; // field number 1 string name = 2; // field number 2 string email = 3; // field number 3 message Order { string id = 1; double total = 2; string currency = 3; } repeated Order orders = 4; // field number 4 } // Encoded size for the JSON example above: // JSON: ~220 bytes // Protobuf: ~85 bytes (60% smaller)

The critical design choice: fields are identified by field numbers (1, 2, 3...), not by name. The field name is only used in generated code -- it never appears in the encoded bytes. This means you can rename fields freely without breaking compatibility.

Apache Thrift

Developed at Facebook, Thrift is similar to Protobuf but offers two binary encodings: BinaryProtocol (straightforward) and CompactProtocol (uses variable-length integers and bit-packing for smaller size). Thrift also includes a built-in RPC framework.

Thrift Schemastruct User { 1: required i32 user_id, 2: required string name, 3: optional string email, 4: optional list<Order> orders, } struct Order { 1: required string id, 2: required double total, 3: optional string currency, }

Apache Avro

Avro takes a different approach: the schema is included (or referenced) alongside the data, and the encoding contains no field tags or type indicators. Values are simply concatenated in the order they appear in the schema. This makes Avro the most compact binary format, but it means the reader must use the exact same schema (or a compatible one) to decode the data.

Avro Schema{ "type": "record", "name": "User", "fields": [ { "name": "user_id", "type": "int" }, { "name": "name", "type": "string" }, { "name": "email", "type": ["null", "string"], "default": null }, { "name": "orders", "type": { "type": "array", "items": { "type": "record", "name": "Order", "fields": [ { "name": "id", "type": "string" }, { "name": "total", "type": "double" }, { "name": "currency", "type": "string", "default": "USD" } ] } } } ] }

Avro's key advantage is schema resolution: the writer's schema and the reader's schema do not need to be identical, only compatible. Avro resolves differences between them at read time, matching fields by name. This is especially powerful for Hadoop/Spark pipelines where data written months ago must be read by new code.

Comparing Encoding Formats

FeatureJSONProtobufThriftAvro
Human readableYesNoNoNo
Schema requiredNoYes (.proto)Yes (.thrift)Yes (JSON schema)
Field identificationBy nameBy field numberBy field numberBy name (schema resolution)
Encoding sizeLarge (verbose)SmallSmall (compact)Smallest
Parse speedSlowFastFastFast
Code generationOptionalRequiredRequiredOptional
Dynamic typingNativeLimitedLimitedGood (schema resolution)
RPC frameworkNo (use REST/gRPC)gRPCBuilt-inNo
Primary ecosystemWeb APIsGoogle, gRPCFacebook (legacy)Hadoop, Kafka

Schema Evolution

In a distributed system, schema changes are inevitable. New features require new fields, old fields become irrelevant, and data types need to change. The question is: how do you evolve your schema without breaking the system?

Forward and Backward Compatibility

  • Backward compatibility: New code can read data written by old code. This is straightforward -- the new code knows about all old fields and can handle missing new fields with defaults.
  • Forward compatibility: Old code can read data written by new code. This is trickier -- the old code must gracefully ignore fields it does not recognize.

Both directions are necessary for rolling deployments, where old and new versions of a service run simultaneously.

Warning: Breaking Changes

These schema changes break backward or forward compatibility and should be avoided in production systems: removing a required field, changing a field's type in an incompatible way (e.g., int32 to string), reusing a deleted field number (in Protobuf/Thrift), or renaming a field in Avro (since Avro matches fields by name). Always add new optional fields with defaults instead of modifying existing ones.

How Each Format Handles Evolution

Protobuf / Thrift Evolution Rules

Safe Schema Changes (Protobuf)// Version 1: message User { int32 user_id = 1; string name = 2; string email = 3; } // Version 2 (backward + forward compatible): message User { int32 user_id = 1; string name = 2; string email = 3; string phone = 4; // NEW optional field (old code ignores it) int32 age = 5; // NEW optional field with implicit default 0 } // UNSAFE changes (DO NOT DO): // - Removing field 2 (name) breaks readers expecting it // - Changing field 1 from int32 to string breaks encoding // - Reusing field number 3 for a different field after deletion // - Changing a field from optional to required

In Protobuf and Thrift, field numbers are permanent identifiers. Old code encountering an unknown field number simply skips it. New code encountering a missing optional field uses the default value. This is why you must never reuse a field number after deleting a field.

Avro Evolution Rules

Avro resolves differences between the writer's schema and the reader's schema at read time. Fields are matched by name. If the reader's schema has a field that the writer's schema lacks, Avro uses the default value declared in the reader's schema. If the writer's schema has a field that the reader's schema lacks, the field is ignored.

Avro Schema Evolution// Writer schema (v1): { "name": "user_id", "type": "int" } { "name": "name", "type": "string" } // Reader schema (v2): { "name": "user_id", "type": "int" } { "name": "name", "type": "string" } { "name": "phone", "type": ["null","string"], "default": null } // Reading v1 data with v2 schema: // user_id: read from data // name: read from data // phone: not in data, use default (null) โ† backward compatible // Reading v2 data with v1 schema: // user_id: read from data // name: read from data // phone: in data but not in schema, ignored โ† forward compatible

Dataflow: How Encoded Data Moves Through Systems

Data flows between processes in three main ways, each with different implications for schema evolution.

1. Through Databases

When a service writes data to a database, it encodes it. When it reads the data back (possibly years later, possibly by a different version of the service), it decodes it. The database stores data for a long time, so backward compatibility is essential: new code must be able to read old data. Forward compatibility matters during rolling upgrades, when an old version of the service might read data written by the new version.

2. Through Service Calls (REST and RPC)

In a microservices architecture, services communicate through APIs. The client and server may run different versions of the code, so the API encoding must support schema evolution.

  • REST / HTTP APIs: Typically use JSON. Schema evolution is managed through API versioning (URL path: /v1/users, /v2/users) or content negotiation headers. Adding optional fields to JSON responses is backward compatible; clients ignore unknown fields.
  • RPC (gRPC, Thrift RPC): Use binary encoding (Protobuf, Thrift) which has built-in schema evolution through field numbers. gRPC is the modern standard, offering streaming, deadlines, and code generation.
gRPC Service Definitionsyntax = "proto3"; service UserService { rpc GetUser(GetUserRequest) returns (UserResponse); rpc ListUsers(ListUsersRequest) returns (stream UserResponse); } message GetUserRequest { int32 user_id = 1; } message UserResponse { int32 user_id = 1; string name = 2; string email = 3; // Adding field 4 in a future version is safe: // old clients will ignore it, new clients will use it }

3. Through Asynchronous Message Passing

Message brokers (Kafka, RabbitMQ, Amazon SQS) decouple producers from consumers. The producer encodes a message and publishes it to a topic. Consumers (potentially many, running different code versions) decode it later. This is the most demanding scenario for schema evolution because:

  • Messages may be consumed days or weeks after they were produced.
  • Multiple consumer groups may run different schema versions simultaneously.
  • Messages cannot easily be "migrated" in place (unlike a database, you cannot run ALTER TABLE on a Kafka topic).

Schema Registries

In systems that use binary encoding (especially Avro with Kafka), a schema registry acts as a central repository for all schema versions. Producers register their schema before publishing, and consumers look up the schema when reading.

Step 1: Producer Registers Schema

Before publishing a message, the producer sends its schema to the registry. The registry assigns a schema ID and checks compatibility with previous versions. If the schema breaks compatibility rules, the registration is rejected.

Step 2: Encode with Schema ID

The producer encodes the message using the schema and prepends the schema ID (typically 4 bytes) to the encoded payload. The schema itself is not included in every message -- only the compact ID.

Step 3: Consumer Fetches Schema

The consumer reads the schema ID from the message, fetches the corresponding schema from the registry (cached locally after the first fetch), and uses it to decode the payload.

Step 4: Schema Resolution

If the consumer's expected schema differs from the writer's schema, the decoding library (e.g., Avro) performs schema resolution, mapping fields by name and filling in defaults for missing fields.

Confluent Schema Registry is the most widely used implementation, supporting Avro, Protobuf, and JSON Schema with configurable compatibility modes: BACKWARD, FORWARD, FULL (both directions), and NONE.

Contract-First Development

In a contract-first approach, the schema (the "contract") is defined before any code is written. This ensures that producers and consumers agree on the data format upfront, and changes go through a review process.

  • Define the schema in a shared repository (e.g., a proto file in a shared Git repo).
  • Generate code from the schema for each service's language (Java, Go, Python, TypeScript).
  • Review schema changes through pull requests, with automated compatibility checks via the schema registry.
  • Publish the schema to the registry on merge. Services pick up the new generated code on their next deployment.

Best Practice: Schema Evolution Checklist

When modifying a schema: (1) Only add new optional fields with sensible defaults. (2) Never remove or rename fields that are in use. (3) Never reuse deleted field numbers. (4) Run compatibility checks against the schema registry before deploying. (5) Version your APIs explicitly (URL or header-based). (6) Document the meaning of each field -- the schema is the contract between teams.

Practical Comparison: When to Use What

Use CaseRecommended FormatReason
Public-facing REST APIJSONUniversally supported, human-readable, easy to debug
Internal microservice RPCProtobuf (gRPC)Compact, fast, strong schema evolution, code generation
Kafka event streamingAvro + Schema RegistryMost compact encoding, schema resolution, registry integration
Hadoop / data lake storageAvro or ParquetAvro for row-oriented, Parquet for columnar analytics
Configuration filesJSON, YAML, or TOMLHuman-readable, editable, version-controlled
Browser-to-serverJSON (or Protobuf via gRPC-Web)Native browser JSON support; gRPC-Web for performance-critical apps

Key Takeaways

  • Encoding format choice affects storage size, parsing speed, and most critically, your ability to evolve schemas without downtime.
  • JSON is human-readable and universally supported but verbose and lacks built-in schema enforcement. Use it for public APIs and configuration.
  • Binary formats (Protobuf, Thrift, Avro) are compact, fast, and support schema evolution through field numbers (Protobuf/Thrift) or schema resolution (Avro).
  • Backward compatibility (new code reads old data) and forward compatibility (old code reads new data) are both required for safe rolling deployments.
  • Safe schema changes: add optional fields with defaults. Unsafe changes: remove fields, change types, reuse field numbers, rename fields in Avro.
  • A schema registry centralizes schema management, enforces compatibility rules, and enables efficient binary encoding by replacing schemas with compact IDs in each message.
  • Contract-first development -- defining the schema before writing code -- prevents breaking changes and aligns teams in a microservices architecture.

Chapter Check-Up

Quick quiz to reinforce what you just learned.