Writing a Lexer in Java 1.7 using Regex Named Capturing Groups
One of my favorite features in the new Java 1.7 aside from the try-with-resources
statement are named capturing groups in the regular expression API.
Although, captured groups can be referenced numerically in the order of which they are declared from left to right, named capturing makes this more intuitive as I will demonstrate in the construction of a lexer.
Lexer, a Definition
To describe lexers, we must first describe a tokenizer. Tokenizers simply break up strings into a set of tokens which are, of course, more strings. Subsequently, a lexer is a type of tokenizer that adds a context to the tokens such as the type of token extracted e.g. (NUMBER 1234)
whereas a simple token would be 1234
. Lexers are important for parsing languages, however, that is a discussion beyond the scope of this tutorial.
For example, given the input "11 + 22 + 33"
, we should receive the following tokens from a lexer:
(NUMBER 11)
(BINARYOP +)
(NUMBER 22)
(BINARYOP -)
(NUMBER 33)
Note that
BINARYOP
refers to binary operator. Binary operators includes any operator that accepts two arguments. The archetypal example is addition which accepts two numbers, one to the left and the other to the right of the operator.
Setting-Up the Program
The input is a sentence to be scanned. For this tutorial, we will scan a simple arithmetic grammar that includes addition, multiplication and subtraction. Consequently, we will parse the following input:
public class Lexer {
public static void main(String[] args) {
String input = "11 + 22 - 33";
}
}
Next, we must define the types of the tokens that we are extracting and the regular expression that they match.
- Number
- `-?[0-9]+` Matches negative infinity to positive infinity without decimals.
- Binary Operator
- `[*|/|+|-]` Matches any standard arithmetic operators.
- Whitespace
- `[ \t\f\r\n]+` Matches whitespace, tabs, form feeds or newlines in a sequence. Will be skipped.
public class Lexer {
public static enum TokenType {
// Token types cannot have underscores
NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");
public final String pattern;
private TokenType(String pattern) {
this.pattern = pattern;
}
}
public static void main(String[] args) {
String input = "11 + 22 - 33";
}
}
Enumerations in Java can only have
private
constructors because there is only a finite set of objects created at run-time. Consequently, their data fields are frequently declared asfinal
.
Finally, we declare a data structure for holding the token data. Additionally, I will override the toString()
method for printing out the token's contextual data at the end of this tutorial in the format I have mentioned earlier: (<TYPE> <DATA>)
.
public class Lexer {
public static enum TokenType {
// Token types cannot have underscores
NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");
public final String pattern;
private TokenType(String pattern) {
this.pattern = pattern;
}
}
public static class Token {
public TokenType type;
public String data;
public Token(TokenType type, String data) {
this.type = type;
this.data = data;
}
@Override
public String toString() {
return String.format("(%s %s)", type.name(), data);
}
}
public static void main(String[] args) {
String input = "11 + 22 - 33";
}
}
Now that we have our input, token types and data structure for tokens, we may begin lexical analysis of the input string into a set of tokens with its corresponding token type.
Lexical Analysis with Regular Expressions
We begin by framing our lexical analysis method as lex()
, a function which returns a list of Token
objects. Additionally, we will need to import ArrayList
in order to store the Token
objects into the list.
import java.util.ArrayList;
public class Lexer {
public static enum TokenType {
// Token types cannot have underscores
NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");
public final String pattern;
private TokenType(String pattern) {
this.pattern = pattern;
}
}
public static class Token {
public TokenType type;
public String data;
public Token(TokenType type, String data) {
this.type = type;
this.name = data;
}
@Override
public String toString() {
return String.format("(%s %s)", type.name(), data);
}
}
public static ArrayList<Token> lex(String input) {
// The tokens to return
ArrayList<Token> tokens = new ArrayList<Token>();
// Lexer logic begins here
return tokens;
}
public static void main(String[] args) {
String input = "11 + 22 - 33";
// Create tokens and print them
ArrayList<Token> tokens = lex(input);
for (Token token : tokens)
System.out.println(token);
}
}
Now, we need to encode all of the regular expression patterns for each of the token types into a single pattern in the algorithm shown below. This is the case where we use named capturing groups in regular expressions as (?<TYPE> PATTERN)
so that once a pattern is matched, we can retrieve the token by calling its group name, the TYPE
.
Additionally, we import the Pattern
class to compile regular expression patterns.
import java.util.regex.Pattern;
public static ArrayList<Token> lex(String input) {
// The tokens to return
ArrayList<Token> tokens = new ArrayList<Token>();
// Lexer logic begins here
StringBuffer tokenPatternsBuffer = new StringBuffer();
for (TokenType tokenType : TokenType.values())
tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
String tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));
return tokens;
}
Next, we begin tokenizing by creating a Matcher
object from the compiled pattern, tokenPatterns
, from earlier. The matcher will return any token matched with any of the corresponding token type patterns. Note that we must also import the Matcher
class here.
We will iterate through the list of token types and ask if the token type was matched. If the token returns a match, we will add it to our list of tokens with the corresponding token type and continue parsing the input.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public static ArrayList<Token> lex(String input) {
// The tokens to return
ArrayList<Token> tokens = new ArrayList<Token>();
// Lexer logic begins here
StringBuffer tokenPatternsBuffer = new StringBuffer();
for (TokenType tokenType : TokenType.values())
tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
Pattern tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));
// Begin matching tokens
Matcher matcher = tokenPatterns.matcher(input);
while (matcher.find()) {
if (matcher.group(TokenType.NUMBER.name()) != null) {
tokens.add(new Token(TokenType.NUMBER, matcher.group(TokenType.NUMBER.name())));
continue;
} else if (matcher.group(TokenType.BINARYOP.name()) != null) {
tokens.add(new Token(TokenType.BINARYOP, matcher.group(TokenType.BINARYOP.name())));
continue;
} else if (matcher.group(TokenType.WHITESPACE.name()) != null)
continue;
}
return tokens;
}
And the algorithm is complete! The magic of named capturing groups here happens as we match of the token types. Note that instead of matching groups by their numerical reference, matcher.group(0)
, we use the actual name which is far more intuitive and much easier to maintain.
Here is the complete source code:
import java.util.ArrayList;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Lexer {
public static enum TokenType {
// Token types cannot have underscores
NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");
public final String pattern;
private TokenType(String pattern) {
this.pattern = pattern;
}
}
public static class Token {
public TokenType type;
public String data;
public Token(TokenType type, String data) {
this.type = type;
this.data = data;
}
@Override
public String toString() {
return String.format("(%s %s)", type.name(), data);
}
}
public static ArrayList<Token> lex(String input) {
// The tokens to return
ArrayList<Token> tokens = new ArrayList<Token>();
// Lexer logic begins here
StringBuffer tokenPatternsBuffer = new StringBuffer();
for (TokenType tokenType : TokenType.values())
tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
Pattern tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));
// Begin matching tokens
Matcher matcher = tokenPatterns.matcher(input);
while (matcher.find()) {
if (matcher.group(TokenType.NUMBER.name()) != null) {
tokens.add(new Token(TokenType.NUMBER, matcher.group(TokenType.NUMBER.name())));
continue;
} else if (matcher.group(TokenType.BINARYOP.name()) != null) {
tokens.add(new Token(TokenType.BINARYOP, matcher.group(TokenType.BINARYOP.name())));
continue;
} else if (matcher.group(TokenType.WHITESPACE.name()) != null)
continue;
}
return tokens;
}
public static void main(String[] args) {
String input = "11 + 22 - 33";
// Create tokens and print them
ArrayList<Token> tokens = lex(input);
for (Token token : tokens)
System.out.println(token);
}
}
Running the Algorithm
For completeness, when we run the program, we should receive the following output.
(NUMBER 11)
(BINARYOP +)
(NUMBER 22)
(BINARYOP -)
(NUMBER 33)
Conclusion
Although lexical analysis is doable without it, named capturing groups in regular expressions certainly makes the code more intuitive and easier to maintain. It's also nice that Java is beginning to provide features that act as syntactic sugar.