Advantages
- It has a direct mapping to and from JSON.
- It has a very compact format. The bulk of JSON, repeating every field name with every single record, is what makes JSON inefficient for high-volume usage.
- It is very fast.
- It has great bindings for a wide variety of programming languages so you can generate Java objects that make working with event data easier, but it does not require code generation so tools can be written generically for any data stream.
- It has a rich, extensible schema language defined in pure JSON.
- It has the best notion of compatibility for evolving your data over time.
Avro Specification (opens in a new tab)
Avro Specification - Logical types (opens in a new tab)
-
official logical types
decimaluuiddatetime-millistime-microstimestamp-millistimestamp-microslocal-timestamp-millislocal-timestamp-microsduration
Avro IDL (.avdl) (opens in a new tab)
-
Plain text, of
Avro IDLsyntaxnamespace default.namespace.for.named.schemata; schema Message; record Message { string? title = null; string message; }
Avro Schema (.avsc)
-
JSON format
// Example { "type": "record", "name": "User", "namespace": "my.types", "doc": "User record", "fields": [ { "name": "name", "type": "string" }, { "name": "email", "type": "string" } ] }
Avro Protocol (.avpr)
Avro protocols describe RPC interfaces. Like schemas, they are defined with JSON text.
Avro Data (.avro)
- Binary
Schema - Cheatsheet
Schema - Reference a shared type
Use $namespace.$type to reference a custom type
{
"type": "record",
"namespace": "data.add",
"name": "Address",
"fields": [
{
"name": "student",
"type": "data.add.Student"
}
]
}Schema - Specify order of compilation
-
Avro Maven plugin
Use
includesto specify the order of compilation<plugin> <groupId>org.apache.avro</groupId> <artifactId>avro-maven-plugin</artifactId> <version>${avro.version}</version> <configuration> <stringType>String</stringType> <sourceDirectory>${project.basedir}/src/main/avro/schema</sourceDirectory> <outputDirectory>${project.build.directory}/generated-sources/avro</outputDirectory> <includes> <include>**/Message.avsc</include> <include>**/UserDtoAvro.avsc</include> <include>**/Metadata.avsc</include> <include>**/Template.avsc</include> <include>**/smsRequested.avsc</include> </includes> </configuration> <executions> <execution> <id>schema-to-java</id> <phase>generate-sources</phase> <goals> <goal>schema</goal> </goals> </execution> </executions> </plugin>
Avro Serialization
Data in Avro is always stored with its corresponding schema, meaning we can always read a serialized item regardless of whether we know the schema ahead of time. This allows us to perform serialization and deserialization without code generation.
Avro Serialization - Encoding
Avro specifies two serialization encodings: binary and JSON
Binary (opens in a new tab)
- Default encoding
- More performant, smaller and faster
- Does not include field names, self-contained information about the types of individual bytes, nor field or record separators. Therefore readers are wholly reliant on the schema used when the data was encoded.
JSON (opens in a new tab)
- Human readable, easy for debugging
- Useful for web applications
Cheatsheet
Avro IDL -> Avro schema
avro-tools idl2schemata ${protocol.avdl} .Get Avro schema from Avro data
avro-tools getschema ${data.avro} > ${schema.avsc}Avro schema + JSON data -> Avro data
avro-tools fromjson --schema-file ${schema.avsc} ${data.json} > ${data.avro}Avro schema + JSON data -> Avro data (with compression)
avro-tools fromjson --codec deflate --schema-file ${schema.avsc} ${data.json} > ${data.avro}Avro data -> JSON data
avro-tools tojson --pretty ${data.avro} > ${data.json}Java source code -> Avro schema
-
Jackson Binary Dataformats
-
avro-tools induce
Avro schema -> Java source code
- avro-tools compile
CLI
avro-tools
-
Installation
# Homebrew brew install avro-tools -
Usage
❯ avro-tools Version 1.11.3 of Apache Avro Copyright 2010-2015 The Apache Software Foundation This product includes software developed at The Apache Software Foundation (https://www.apache.org/). ---------------- Available tools: canonical Converts an Avro Schema to its canonical form cat Extracts samples from files compile Generates Java code for the given schema. concat Concatenates avro files without re-compressing. count Counts the records in avro files or folders fingerprint Returns the fingerprint for the schemas. fragtojson Renders a binary-encoded Avro datum as JSON. fromjson Reads JSON records and writes an Avro data file. fromtext Imports a text file into an avro data file. getmeta Prints out the metadata of an Avro data file. getschema Prints out schema of an Avro data file. idl Generates a JSON schema from an Avro IDL file idl2schemata Extract JSON schemata of the types from an Avro IDL file induce Induce schema/protocol from Java class/interface via reflection. jsontofrag Renders a JSON-encoded Avro datum as binary. random Creates a file with randomly generated instances of a schema. recodec Alters the codec of a data file. repair Recovers data from a corrupt Avro Data file rpcprotocol Output the protocol of a RPC service rpcreceive Opens an RPC Server and listens for one message. rpcsend Sends a single RPC message. tether Run a tethered mapreduce job. tojson Dumps an Avro data file as JSON, record per line or pretty. totext Converts an Avro data file to a text file. totrevni Converts an Avro data file to a Trevni file. trevni_meta Dumps a Trevni file's metadata as JSON. trevni_random Create a Trevni file filled with random instances of a schema. trevni_tojson Dumps a Trevni file as JSON. -
Resources
Schema Evolution
Backward compatibility
New schema can read data generated with the old schema.
Forward compatibility
Old schema can read data generated with the new schema.
-
Examples
-
Additions of fields
Adding a field in the newer (writer) schema is forward-compatible if the added field has a default value. Older readers simply ignore the new field; when they expect that field (because their schema includes it), Avro fills it with the default from the reader schema. If the reader schema does not include the new field, nothing is needed.
-
Removals of fields
Removing a field from the writer schema is forward-compatible if the removed field had a default in the older (reader) schema, or the reader schema does not require it. If an older reader expects a field that the writer no longer writes, reading will fail unless the reader supplies a default.
-
Changing field types
Some type changes are safe if they are promotable (e.g., int -> long, int -> double). If the writer produces a type the reader can accept via promotion, forward compatibility holds. Non-promotable or incompatible type changes break compatibility.
-
Making a field optional/nullable
Adding
nullto a field'sunion(e.g.,"string" -> ["null","string"]) can be forward-compatible when defaults are provided and reader expectations align. Care needed: writer and reader unions must be compatible in ordering and defaults. -
Moving fields into nested records or renaming
Renaming a field without aliasing breaks compatibility. Use the "aliases" attribute to preserve compatibility when renaming.
-
Performance
Code
-
avro/share/test/schemas at main · apache/avro (opens in a new tab)
IDL(.avdl),Protocol(.avpr),Schema(.avsc) examples -
GitHub - linkedin/avro-util (opens in a new tab)
A collection of utilities and libraries to allow java projects to better work with avro.
Implementations
Python - fastavro (opens in a new tab)
-
Supported Features
- File Writer
- File Reader (iterating via records or blocks)
- Schemaless Writer
- Schemaless Reader
- JSON Writer
- JSON Reader
- Codecs (Snappy, Deflate, Zstandard, Bzip2, LZ4, XZ)
- Schema resolution
- Aliases
- Logical Types
- Parsing schemas into the canonical form
- Schema fingerprinting