Data Formats

Updated June 3, 2026
M
Magic Magnets Team
7 min read

JSON (JavaScript Object Notation)

JSON is the dominant format for web APIs. Every major programming language ships with a JSON parser in its standard library, and the format's schemaless nature means you can add fields without a coordinated deployment.

{ "id": 123, "username": "magic_user", "isActive": true }

The tradeoff is size and parse cost. JSON is a text format — every key name, every quote, every bracket is transmitted as characters. For high-frequency internal calls that exchange millions of messages per second, that overhead accumulates. JSON is best suited for public APIs, frontend-to-backend communication, and configuration files where debuggability matters more than raw throughput.

Quiz Time

Why is JSON generally preferred over XML for new web APIs?

XML (eXtensible Markup Language)

Before JSON, XML was the standard interchange format. It uses paired open/close tags and supports complex schemas and XML namespaces, which made it attractive for enterprise systems that needed strict contract validation.

<User> <Id>123</Id> <Username>magic_user</Username> <IsActive>true</IsActive> </User>

The cost is verbosity — every field is wrapped in two tags, which inflates payload sizes significantly compared to JSON. XML persists in legacy enterprise systems (SOAP APIs), RSS feeds, and Android UI layouts, but it is rarely chosen for new internal services.

Protocol Buffers (Protobuf)

Created by Google, Protobuf is a binary serialization format built around a strict schema defined upfront in a .proto file:

message User { int32 id = 1; string username = 2; bool is_active = 3; }

The compiler generates typed code for your target language (Python, Java, Go, etc.). On the wire, Protobuf drops the field name strings entirely and encodes only the field number alongside the raw binary value. Because both sides already share the .proto schema, the receiver knows that field number 2 means username — there is no need to transmit the string "username" with every message. This is the primary source of Protobuf's size advantage: payloads are typically 3–10× smaller than equivalent JSON.

Parsing is fast because binary decoding skips the character-by-character tokenisation that text parsers require. The format is also strongly typed, catching schema mismatches at compile time rather than at runtime. The downside is that binary data is not human-readable — intercepted traffic looks like raw bytes without the .proto file to decode it. Protobuf (commonly paired with gRPC) is the standard choice for internal microservice-to-microservice communication, high-performance systems, and IoT devices where bandwidth is constrained.

Quiz Time

Protobuf payloads are smaller than equivalent JSON payloads primarily because they omit field name strings and encode only field numbers with raw binary values.

Quiz Time

Which format would you choose for high-speed internal microservice-to-microservice communication?

Avro and Parquet

When moving millions of rows into a data warehouse or Hadoop cluster, JSON and Protobuf are not optimised for analytical workloads.

Apache Avro is a row-based binary format that embeds the schema directly inside the file. It is the dominant format in Apache Kafka streaming pipelines because producers and consumers can evolve their schemas independently — adding or removing fields — without breaking each other, as long as the embedded schema travels with the data.

Apache Parquet is a column-oriented binary format. Rather than writing complete rows sequentially, Parquet stores all values for a single column contiguously — all usernames together, all ages together. An analytical query that computes the average age of all users only needs to read the age column from disk; it skips every other column entirely. This makes Parquet significantly faster than row-oriented formats for aggregate and scan-heavy queries, which is why it is the default storage format in data lakes on AWS S3, Google Cloud Storage, and Azure Blob Storage.

Quiz Time

Apache Avro is the preferred format in Kafka streaming pipelines mainly because it supports schema evolution.

Quiz Time

What property makes Apache Parquet faster than row-oriented formats for analytical queries?

Summary

Choosing a data format means trading human readability for machine efficiency. Use JSON for web-facing APIs where debuggability and broad client support matter. Use Protobuf with gRPC for high-speed internal microservice communication where payload size and parse latency are concerns. Use Avro for high-throughput event streaming pipelines such as Kafka, where schema evolution across producers and consumers is required. Use Parquet for storing large datasets in a data lake intended for analytical querying.

How helpful was this content?

Comments

0/2000

Sign in to join the discussion

Saved on this device only

Sign in to sync progress across devices