One of the first techniques taught in a programming languages course is how to inductively express a language using context-free grammars. Although this technique is familiar to many of us with a computer science background, I have found this technique useful for enumerating all of the possible types in a type system, e.g. Avro.

Avro is a serialization system that is common in the Hadoop ecosystem. If you have written MapReduce jobs or Spark jobs, there is a high likelihood that your data is serialized according to the Avro specification. Avro specifies how data should be serialized as binary data. In other words, it specifies a type system. With the existence of user-defined types, the set of all concrete Avro types is infinite in size! Is there a more tangible, finite representation of this set?

We explore the answer to this question by attempting to represent this set in a finite form using context-free grammars.

Avro specification

To begin our quest, we refer to the current [Schema Declaration][1] section of the Avro Specification for version 1.8.1.

A schema is a specification of how the data is formatted.

The schema declaration section is straightforward to read; however, it is not explicit about the the exhaustive set of types that are possible in Avro. To construct an inductive representation of types, we distill this specification into a context-free grammar that describes the types as they are specified.

EBNF grammar notation

For notation, we use a variant of EBNF specified by W3. To summarize our use of the notation:

  • NONTERMINAL symbols are uppercase.
  • terminal symbols are lowercase.
  • = declares a production.
  • | declares a choice in a production.
  • * declares that the preceding symbol may appear zero or more times.
  • (x) declares a grouping of the metavariable x.

As a simple example, the following grammar describes numeric values with a single literal 0.

NUM_VALUE
  = 0
  | succ NUM_VALUE

Using this notation, we may now proceed to describe our context-free grammar for Avro types.

Grammar for Avro types

Our goal is to recognize every possible concrete Avro type with this context-free grammar.

First, let AVRO be the start symbol that describes any Avro type. There are two classes of types according to the specification: primitive types and complex types. Let PRIMITIVE be the nonterminal that describes the set of primitive types and let COMPLEX be the nonterminal that describes the set of complex types. Our current grammar is as follows.

AVRO
  = PRIMITIVE
  | COMPLEX

From here, we enumerate the primitive and complex type classes.

Primitive types

Starting with the primitive type class, we know that a primitive type must be one of null, boolean, int, long, float, double, bytes, or string. We represent these concrete types as terminals.

PRIMITIVE
  = null
  | boolean
  | int
  | long
  | float
  | double
  | bytes
  | string

The class of primitive types is trivial because it only contains conrete types. The class of complex types, however, contains parametric types, i.e. they refer to other Avro types.

Complex types

There are six complex types: record, enum, array, map, union, and fixed. Of these, all but the fixed type are parametric. To represent the parametric types, we use a nonterminal in place of the type parameter. For example, we use the notation enum of AVRO where enum is a terminal and AVRO is a nonterminal representing any Avro type. We also use curly brace literals to avoid ambiguity in our grammar.

COMPLEX
  = record of { (AVRO,)* AVRO }
  | enum of AVRO
  | array of AVRO
  | map of AVRO
  | union of { (AVRO,)* AVRO }
  | fixed

Note that only the record and union types may be parameterized by multiple Avro types.

From here, we have inductively defined all conrete Avro types. The set of concrete Avro types is infinite, but the grammar is finite.

Example

As an example, consider a union of a string and an array of integers. This would be represented as the following expression in our grammar:

union of { string, array of int }

Consequences

There are a few interesting consequences of describing the set of Avro types with a context-free grammar. The immediate consequence is that any Avro type can be recognized by a Push Down Automata. We can also use the pumping lemma to show that the set of Avro types cannot be described by a regular grammar, but this is beyond the scope of this post.

Applications

You may be curious why we would even bother to represent all Avro types with a context-free grammar. If you design serialization libraries, e.g. Avro.jl, then this may be useful to designing schema parsers. After all, context-free grammars are typically used in the context of parsing.