Avro – Nextra

Advantages

It has a direct mapping to and from JSON.
It has a very compact format. The bulk of JSON, repeating every field name with every single record, is what makes JSON inefficient for high-volume usage.
It is very fast.
It has great bindings for a wide variety of programming languages so you can generate Java objects that make working with event data easier, but it does not require code generation so tools can be written generically for any data stream.
It has a rich, extensible schema language defined in pure JSON.
It has the best notion of compatibility for evolving your data over time.

Avro Specification (opens in a new tab)

Avro Specification - Logical types (opens in a new tab)

official logical types
- decimal
- uuid
- date
- time-millis
- time-micros
- timestamp-millis
- timestamp-micros
- local-timestamp-millis
- local-timestamp-micros
- duration

Avro IDL (.avdl) (opens in a new tab)

Plain text, of Avro IDL syntax

namespace default.namespace.for.named.schemata;
schema Message;

record Message {
    string? title = null;
    string message;
}

Avro Schema (.avsc)

JSON format

// Example
{
  "type": "record",
  "name": "User",
  "namespace": "my.types",
  "doc": "User record",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "email",
      "type": "string"
    }
  ]
}

Avro Protocol (.avpr)

Avro protocols describe RPC interfaces. Like schemas, they are defined with JSON text.

Avro Data (.avro)

Binary

Schema - Cheatsheet

Schema - Reference a shared type

Use $namespace.$type to reference a custom type

{
  "type": "record",
  "namespace": "data.add",
  "name": "Address",
  "fields": [
    {
      "name": "student",
      "type": "data.add.Student"
    }
  ]
}

Schema - Specify order of compilation

Avro Maven plugin

Use includes to specify the order of compilation

<plugin>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-maven-plugin</artifactId>
    <version>${avro.version}</version>
    <configuration>
        <stringType>String</stringType>
        <sourceDirectory>${project.basedir}/src/main/avro/schema</sourceDirectory>
        <outputDirectory>${project.build.directory}/generated-sources/avro</outputDirectory>
        <includes>
            <include>**/Message.avsc</include>
            <include>**/UserDtoAvro.avsc</include>
            <include>**/Metadata.avsc</include>
            <include>**/Template.avsc</include>
            <include>**/smsRequested.avsc</include>
        </includes>
    </configuration>
    <executions>
        <execution>
            <id>schema-to-java</id>
            <phase>generate-sources</phase>
            <goals>
                <goal>schema</goal>
            </goals>
        </execution>
    </executions>
</plugin>

Avro Serialization

Data in Avro is always stored with its corresponding schema, meaning we can always read a serialized item regardless of whether we know the schema ahead of time. This allows us to perform serialization and deserialization without code generation.

Avro Serialization - Encoding

Avro specifies two serialization encodings: binary and JSON

Binary (opens in a new tab)

Default encoding
More performant, smaller and faster
Does not include field names, self-contained information about the types of individual bytes, nor field or record separators. Therefore readers are wholly reliant on the schema used when the data was encoded.

JSON (opens in a new tab)

Human readable, easy for debugging
Useful for web applications

Cheatsheet

Avro IDL -> Avro schema

avro-tools idl2schemata ${protocol.avdl} .

Get Avro schema from Avro data

avro-tools getschema ${data.avro} > ${schema.avsc}

Avro schema + JSON data -> Avro data

avro-tools fromjson --schema-file ${schema.avsc} ${data.json} > ${data.avro}

Avro schema + JSON data -> Avro data (with compression)

avro-tools fromjson --codec deflate --schema-file ${schema.avsc} ${data.json} > ${data.avro}

Avro data -> JSON data

avro-tools tojson --pretty ${data.avro} > ${data.json}

Java source code -> Avro schema

Jackson Binary Dataformats
- Jackson Binary Dataformats - Avro - Generating Avro Schema from POJO definition (opens in a new tab)
avro-tools induce

Avro schema -> Java source code

avro-tools compile

CLI

avro-tools

Installation
```
# Homebrew
brew install avro-tools
```

Usage

❯ avro-tools
  Version 1.11.3 of Apache Avro
  Copyright 2010-2015 The Apache Software Foundation

  This product includes software developed at
  The Apache Software Foundation (https://www.apache.org/).
  ----------------
  Available tools:
      canonical  Converts an Avro Schema to its canonical form
            cat  Extracts samples from files
        compile  Generates Java code for the given schema.
         concat  Concatenates avro files without re-compressing.
          count  Counts the records in avro files or folders
    fingerprint  Returns the fingerprint for the schemas.
     fragtojson  Renders a binary-encoded Avro datum as JSON.
       fromjson  Reads JSON records and writes an Avro data file.
       fromtext  Imports a text file into an avro data file.
        getmeta  Prints out the metadata of an Avro data file.
      getschema  Prints out schema of an Avro data file.
            idl  Generates a JSON schema from an Avro IDL file
   idl2schemata  Extract JSON schemata of the types from an Avro IDL file
         induce  Induce schema/protocol from Java class/interface via reflection.
     jsontofrag  Renders a JSON-encoded Avro datum as binary.
         random  Creates a file with randomly generated instances of a schema.
        recodec  Alters the codec of a data file.
         repair  Recovers data from a corrupt Avro Data file
    rpcprotocol  Output the protocol of a RPC service
     rpcreceive  Opens an RPC Server and listens for one message.
        rpcsend  Sends a single RPC message.
         tether  Run a tethered mapreduce job.
         tojson  Dumps an Avro data file as JSON, record per line or pretty.
         totext  Converts an Avro data file to a text file.
       totrevni  Converts an Avro data file to a Trevni file.
    trevni_meta  Dumps a Trevni file's metadata as JSON.
  trevni_random  Create a Trevni file filled with random instances of a schema.
  trevni_tojson  Dumps a Trevni file as JSON.

Resources
- Reading and Writing Avro Files from the Command Line (opens in a new tab)
- GitHub - avro/lang/java/tools/src/main/java/org/apache/avro/tool at b8bf9a1292e1075aa8207ff6985cb0fcec4cbd05 · apache/avro (opens in a new tab)

Schema Evolution

Backward compatibility

New schema can read data generated with the old schema.

Forward compatibility

Old schema can read data generated with the new schema.

Examples
- Additions of fields
  
  Adding a field in the newer (writer) schema is forward-compatible if the added field has a default value. Older readers simply ignore the new field; when they expect that field (because their schema includes it), Avro fills it with the default from the reader schema. If the reader schema does not include the new field, nothing is needed.
- Removals of fields
  
  Removing a field from the writer schema is forward-compatible if the removed field had a default in the older (reader) schema, or the reader schema does not require it. If an older reader expects a field that the writer no longer writes, reading will fail unless the reader supplies a default.
- Changing field types
  
  Some type changes are safe if they are promotable (e.g., int -> long, int -> double). If the writer produces a type the reader can accept via promotion, forward compatibility holds. Non-promotable or incompatible type changes break compatibility.
- Making a field optional/nullable
  
  Adding null to a field's union (e.g., "string" -> ["null","string"]) can be forward-compatible when defaults are provided and reader expectations align. Care needed: writer and reader unions must be compatible in ordering and defaults.
- Moving fields into nested records or renaming
  
  Renaming a field without aliasing breaks compatibility. Use the "aliases" attribute to preserve compatibility when renaming.

Performance

Our approach to fast Avro serialization and deserialization in JVM - Techblog (opens in a new tab)

Code

avro/share/test/schemas at main · apache/avro (opens in a new tab)

IDL (.avdl), Protocol (.avpr), Schema (.avsc) examples
GitHub - linkedin/avro-util (opens in a new tab)

A collection of utilities and libraries to allow java projects to better work with avro.

Implementations

Python - fastavro (opens in a new tab)

Supported Features
- File Writer
- File Reader (iterating via records or blocks)
- Schemaless Writer
- Schemaless Reader
- JSON Writer
- JSON Reader
- Codecs (Snappy, Deflate, Zstandard, Bzip2, LZ4, XZ)
- Schema resolution
- Aliases
- Logical Types
- Parsing schemas into the canonical form
- Schema fingerprinting

JNI Transportation