Avro

Concepts

Avro IDL (.avdl)

Avro Schema (.avsc)

  • JSON format

    // Example
    {
      "type": "record",
      "name": "User",
      "namespace": "my.types",
      "doc": "User record",
      "fields": [
        {
          "name": "name",
          "type": "string"
        },
        {
          "name": "email",
          "type": "string"
        }
      ]
    }

Schema - Cheatsheet

Schema - Reference a shared type

Use $namespace.$type to reference a custom type

{
  "type": "record",
  "namespace": "data.add",
  "name": "Address",
  "fields": [
    {
      "name": "student",
      "type": "data.add.Student"
    }
  ]
}
Schema - Specify order of compilation
  • Avro Maven plugin

    Use includes to specify the order of compilation

    <plugin>
        <groupId>org.apache.avro</groupId>
        <artifactId>avro-maven-plugin</artifactId>
        <version>${avro.version}</version>
        <configuration>
            <stringType>String</stringType>
            <sourceDirectory>${project.basedir}/src/main/avro/schema</sourceDirectory>
            <outputDirectory>${project.build.directory}/generated-sources/avro</outputDirectory>
            <includes>
                <include>**/Message.avsc</include>
                <include>**/UserDtoAvro.avsc</include>
                <include>**/Metadata.avsc</include>
                <include>**/Template.avsc</include>
                <include>**/smsRequested.avsc</include>
            </includes>
        </configuration>
        <executions>
            <execution>
                <id>schema-to-java</id>
                <phase>generate-sources</phase>
                <goals>
                    <goal>schema</goal>
                </goals>
            </execution>
        </executions>
    </plugin>

Avro Protocol (.avpr)

  • JSON format

Avro Data (.avro)

  • Binary

Avro Serialization

Data in Avro is always stored with its corresponding schema, meaning we can always read a serialized item regardless of whether we know the schema ahead of time. This allows us to perform serialization and deserialization without code generation.

Avro Serialization - Encoding

  • Binary

    • Default encoding
    • More performant, smaller and faster
    • Does not include field names, self-contained information about the types of individual bytes, nor field or record separators. Therefore readers are wholly reliant on the schema used when the data was encoded.
  • JSON

    • Human readable, easy for debugging
    • Useful for web applications

Cheatsheet

Avro IDL -> Avro schema

avro-tools idl2schemata ${protocol.avdl} .

Get Avro schema from Avro data

avro-tools getschema ${data.avro} > ${schema.avsc}

Avro schema + JSON data -> Avro data

avro-tools fromjson --schema-file ${schema.avsc} ${data.json} > ${data.avro}

Avro schema + JSON data -> Avro data (with compression)

avro-tools fromjson --codec deflate --schema-file ${schema.avsc} ${data.json} > ${data.avro}

Avro data -> JSON data

avro-tools tojson --pretty ${data.avro} > ${data.json}

Java source code -> Avro schema

Avro schema -> Java source code

  • avro-tools compile

CLI

avro-tools

  • Installation

    # Homebrew
    brew install avro-tools
  • Usage

    ❯ avro-tools
      Version 1.11.3 of Apache Avro
      Copyright 2010-2015 The Apache Software Foundation
    
      This product includes software developed at
      The Apache Software Foundation (https://www.apache.org/).
      ----------------
      Available tools:
          canonical  Converts an Avro Schema to its canonical form
                cat  Extracts samples from files
            compile  Generates Java code for the given schema.
             concat  Concatenates avro files without re-compressing.
              count  Counts the records in avro files or folders
        fingerprint  Returns the fingerprint for the schemas.
         fragtojson  Renders a binary-encoded Avro datum as JSON.
           fromjson  Reads JSON records and writes an Avro data file.
           fromtext  Imports a text file into an avro data file.
            getmeta  Prints out the metadata of an Avro data file.
          getschema  Prints out schema of an Avro data file.
                idl  Generates a JSON schema from an Avro IDL file
       idl2schemata  Extract JSON schemata of the types from an Avro IDL file
             induce  Induce schema/protocol from Java class/interface via reflection.
         jsontofrag  Renders a JSON-encoded Avro datum as binary.
             random  Creates a file with randomly generated instances of a schema.
            recodec  Alters the codec of a data file.
             repair  Recovers data from a corrupt Avro Data file
        rpcprotocol  Output the protocol of a RPC service
         rpcreceive  Opens an RPC Server and listens for one message.
            rpcsend  Sends a single RPC message.
             tether  Run a tethered mapreduce job.
             tojson  Dumps an Avro data file as JSON, record per line or pretty.
             totext  Converts an Avro data file to a text file.
           totrevni  Converts an Avro data file to a Trevni file.
        trevni_meta  Dumps a Trevni file's metadata as JSON.
      trevni_random  Create a Trevni file filled with random instances of a schema.
      trevni_tojson  Dumps a Trevni file as JSON.
  • Resources

Article

Performance

Code