Avro

Advantages

  • It has a direct mapping to and from JSON.
  • It has a very compact format. The bulk of JSON, repeating every field name with every single record, is what makes JSON inefficient for high-volume usage.
  • It is very fast.
  • It has great bindings for a wide variety of programming languages so you can generate Java objects that make working with event data easier, but it does not require code generation so tools can be written generically for any data stream.
  • It has a rich, extensible schema language defined in pure JSON.
  • It has the best notion of compatibility for evolving your data over time.

Avro Specification (opens in a new tab)

Avro Specification - Logical types (opens in a new tab)

  • official logical types

    • decimal
    • uuid
    • date
    • time-millis
    • time-micros
    • timestamp-millis
    • timestamp-micros
    • local-timestamp-millis
    • local-timestamp-micros
    • duration

Avro IDL (.avdl) (opens in a new tab)

  • Plain text, of Avro IDL syntax

    namespace default.namespace.for.named.schemata;
    schema Message;
    
    record Message {
        string? title = null;
        string message;
    }

Avro Schema (.avsc)

  • JSON format

    // Example
    {
      "type": "record",
      "name": "User",
      "namespace": "my.types",
      "doc": "User record",
      "fields": [
        {
          "name": "name",
          "type": "string"
        },
        {
          "name": "email",
          "type": "string"
        }
      ]
    }

Avro Protocol (.avpr)

Avro protocols describe RPC interfaces. Like schemas, they are defined with JSON text.

Avro Data (.avro)

  • Binary

Schema - Cheatsheet

Schema - Reference a shared type

Use $namespace.$type to reference a custom type

{
  "type": "record",
  "namespace": "data.add",
  "name": "Address",
  "fields": [
    {
      "name": "student",
      "type": "data.add.Student"
    }
  ]
}

Schema - Specify order of compilation

  • Avro Maven plugin

    Use includes to specify the order of compilation

    <plugin>
        <groupId>org.apache.avro</groupId>
        <artifactId>avro-maven-plugin</artifactId>
        <version>${avro.version}</version>
        <configuration>
            <stringType>String</stringType>
            <sourceDirectory>${project.basedir}/src/main/avro/schema</sourceDirectory>
            <outputDirectory>${project.build.directory}/generated-sources/avro</outputDirectory>
            <includes>
                <include>**/Message.avsc</include>
                <include>**/UserDtoAvro.avsc</include>
                <include>**/Metadata.avsc</include>
                <include>**/Template.avsc</include>
                <include>**/smsRequested.avsc</include>
            </includes>
        </configuration>
        <executions>
            <execution>
                <id>schema-to-java</id>
                <phase>generate-sources</phase>
                <goals>
                    <goal>schema</goal>
                </goals>
            </execution>
        </executions>
    </plugin>

Avro Serialization

Data in Avro is always stored with its corresponding schema, meaning we can always read a serialized item regardless of whether we know the schema ahead of time. This allows us to perform serialization and deserialization without code generation.

Avro Serialization - Encoding

Avro specifies two serialization encodings: binary and JSON

Binary (opens in a new tab)

  • Default encoding
  • More performant, smaller and faster
  • Does not include field names, self-contained information about the types of individual bytes, nor field or record separators. Therefore readers are wholly reliant on the schema used when the data was encoded.

JSON (opens in a new tab)

  • Human readable, easy for debugging
  • Useful for web applications

Cheatsheet

Avro IDL -> Avro schema

avro-tools idl2schemata ${protocol.avdl} .

Get Avro schema from Avro data

avro-tools getschema ${data.avro} > ${schema.avsc}

Avro schema + JSON data -> Avro data

avro-tools fromjson --schema-file ${schema.avsc} ${data.json} > ${data.avro}

Avro schema + JSON data -> Avro data (with compression)

avro-tools fromjson --codec deflate --schema-file ${schema.avsc} ${data.json} > ${data.avro}

Avro data -> JSON data

avro-tools tojson --pretty ${data.avro} > ${data.json}

Java source code -> Avro schema

Avro schema -> Java source code

  • avro-tools compile

CLI

avro-tools

  • Installation

    # Homebrew
    brew install avro-tools
  • Usage

    ❯ avro-tools
      Version 1.11.3 of Apache Avro
      Copyright 2010-2015 The Apache Software Foundation
    
      This product includes software developed at
      The Apache Software Foundation (https://www.apache.org/).
      ----------------
      Available tools:
          canonical  Converts an Avro Schema to its canonical form
                cat  Extracts samples from files
            compile  Generates Java code for the given schema.
             concat  Concatenates avro files without re-compressing.
              count  Counts the records in avro files or folders
        fingerprint  Returns the fingerprint for the schemas.
         fragtojson  Renders a binary-encoded Avro datum as JSON.
           fromjson  Reads JSON records and writes an Avro data file.
           fromtext  Imports a text file into an avro data file.
            getmeta  Prints out the metadata of an Avro data file.
          getschema  Prints out schema of an Avro data file.
                idl  Generates a JSON schema from an Avro IDL file
       idl2schemata  Extract JSON schemata of the types from an Avro IDL file
             induce  Induce schema/protocol from Java class/interface via reflection.
         jsontofrag  Renders a JSON-encoded Avro datum as binary.
             random  Creates a file with randomly generated instances of a schema.
            recodec  Alters the codec of a data file.
             repair  Recovers data from a corrupt Avro Data file
        rpcprotocol  Output the protocol of a RPC service
         rpcreceive  Opens an RPC Server and listens for one message.
            rpcsend  Sends a single RPC message.
             tether  Run a tethered mapreduce job.
             tojson  Dumps an Avro data file as JSON, record per line or pretty.
             totext  Converts an Avro data file to a text file.
           totrevni  Converts an Avro data file to a Trevni file.
        trevni_meta  Dumps a Trevni file's metadata as JSON.
      trevni_random  Create a Trevni file filled with random instances of a schema.
      trevni_tojson  Dumps a Trevni file as JSON.
  • Resources

Schema Evolution

Backward compatibility

New schema can read data generated with the old schema.

Forward compatibility

Old schema can read data generated with the new schema.

  • Examples

    • Additions of fields

      Adding a field in the newer (writer) schema is forward-compatible if the added field has a default value. Older readers simply ignore the new field; when they expect that field (because their schema includes it), Avro fills it with the default from the reader schema. If the reader schema does not include the new field, nothing is needed.

    • Removals of fields

      Removing a field from the writer schema is forward-compatible if the removed field had a default in the older (reader) schema, or the reader schema does not require it. If an older reader expects a field that the writer no longer writes, reading will fail unless the reader supplies a default.

    • Changing field types

      Some type changes are safe if they are promotable (e.g., int -> long, int -> double). If the writer produces a type the reader can accept via promotion, forward compatibility holds. Non-promotable or incompatible type changes break compatibility.

    • Making a field optional/nullable

      Adding null to a field's union (e.g., "string" -> ["null","string"]) can be forward-compatible when defaults are provided and reader expectations align. Care needed: writer and reader unions must be compatible in ordering and defaults.

    • Moving fields into nested records or renaming

      Renaming a field without aliasing breaks compatibility. Use the "aliases" attribute to preserve compatibility when renaming.

Performance

Code

Implementations

Python - fastavro (opens in a new tab)

  • Supported Features

    • File Writer
    • File Reader (iterating via records or blocks)
    • Schemaless Writer
    • Schemaless Reader
    • JSON Writer
    • JSON Reader
    • Codecs (Snappy, Deflate, Zstandard, Bzip2, LZ4, XZ)
    • Schema resolution
    • Aliases
    • Logical Types
    • Parsing schemas into the canonical form
    • Schema fingerprinting