# Express all Avro types with a context-free grammar

One of the first techniques taught in a programming languages course is how to inductively express a language using context-free grammars. Although this technique is familiar to many of us with a computer science background, I have found this technique useful for enumerating all of the possible types in a type system, e.g. Avro.

Avro is a serialization system that is common in the Hadoop ecosystem. If you have written MapReduce jobs or Spark jobs, there is a high likelihood that your data is serialized according to the Avro specification. Avro specifies how data should be serialized as binary data. In other words, it specifies a type system. With the existence of user-defined types, the set of all concrete Avro types is infinite in size! Is there a more tangible, finite representation of this set?

We explore the answer to this question by attempting to represent this set in a finite form using context-free grammars.

## Avro specification

To begin our quest, we refer to the current [Schema Declaration][1] section of the Avro Specification for version 1.8.1.

A

schemais a specification of how the data is formatted.

The schema declaration section is straightforward to read; however, it is not explicit about the the exhaustive set of types that are possible in Avro. To construct an inductive representation of types, we distill this specification into a context-free grammar that describes the types as they are specified.

### EBNF grammar notation

For notation, we use a variant of EBNF specified by W3. To summarize our use of the notation:

`NONTERMINAL`

symbols are uppercase.`terminal`

symbols are lowercase.`=`

declares a production.`|`

declares a choice in a production.`*`

declares that the preceding symbol may appear zero or more times.`(x)`

declares a grouping of the metavariable`x`

.

As a simple example, the following grammar describes numeric values with a
single literal `0`

.

```
NUM_VALUE
= 0
| succ NUM_VALUE
```

Using this notation, we may now proceed to describe our context-free grammar for Avro types.

## Grammar for Avro types

Our goal is to recognize every possible concrete Avro type with this context-free grammar.

First, let `AVRO`

be the start symbol that describes any Avro type. There are
two classes of types according to the specification: *primitive types* and
*complex types*. Let `PRIMITIVE`

be the nonterminal that describes the set of
primitive types and let `COMPLEX`

be the nonterminal that describes the set of
complex types. Our current grammar is as follows.

```
AVRO
= PRIMITIVE
| COMPLEX
```

From here, we enumerate the primitive and complex type classes.

### Primitive types

Starting with the primitive type class, we know that a primitive type must be
one of `null`

, `boolean`

, `int`

, `long`

, `float`

, `double`

, `bytes`

, or
`string`

. We represent these concrete types as terminals.

```
PRIMITIVE
= null
| boolean
| int
| long
| float
| double
| bytes
| string
```

The class of primitive types is trivial because it only contains conrete types. The class of complex types, however, contains parametric types, i.e. they refer to other Avro types.

### Complex types

There are six complex types: `record`

, `enum`

, `array`

, `map`

, `union`

, and
`fixed`

. Of these, all but the `fixed`

type are parametric. To represent the
parametric types, we use a nonterminal in place of the type parameter. For
example, we use the notation `enum of AVRO`

where `enum`

is a terminal
and `AVRO`

is a nonterminal representing any Avro type. We also use curly brace
literals to avoid ambiguity in our grammar.

```
COMPLEX
= record of { (AVRO,)* AVRO }
| enum of AVRO
| array of AVRO
| map of AVRO
| union of { (AVRO,)* AVRO }
| fixed
```

Note that only the `record`

and `union`

types may be parameterized by multiple
Avro types.

From here, we have inductively defined all conrete Avro types. The set of concrete Avro types is infinite, but the grammar is finite.

### Example

As an example, consider a union of a string and an array of integers. This would be represented as the following expression in our grammar:

```
union of { string, array of int }
```

## Consequences

There are a few interesting consequences of describing the set of Avro types with a context-free grammar. The immediate consequence is that any Avro type can be recognized by a Push Down Automata. We can also use the pumping lemma to show that the set of Avro types cannot be described by a regular grammar, but this is beyond the scope of this post.

### Applications

You may be curious why we would even bother to represent all Avro types with a context-free grammar. If you design serialization libraries, e.g. Avro.jl, then this may be useful to designing schema parsers. After all, context-free grammars are typically used in the context of parsing.