Introduction to Recursive Descent Parsers with C#

Parser? Aren’t parsers utterly boring?

Well no, quite the opposite. Lately i have been solving some of the programming challenges on talentbuddy and bumped into task to create parser and solver for simple arithmetic expressions in string format, something like this: “(2+2)-(3-(6-5))-4″.

On first thought this seems trivial, but only until the moment you start implementing it. If you are doing it the wrong way, for example by using regular expressions, you can bump into some interesting problems, and also develop a solution that works for most cases but then fails on edge cases etc.

The proper way

Fortunately there are proper ways to do this, and one of them is building a Recursive Descent Parser. As the name implies, this parser will use a Top-Down approach and start breaking the expression into smaller pieces but in recursive way. In order to be able to build a proper parser we first need to define a grammar of the expressions we want to parse.

For defining grammars we can use Backus-Naur Form or Extended Backus-Naur Form.

If we manage to define the grammar properly, then implementing the parser is almost a mechanical process, and this is what i really like about this approach.

In this post we will build very simple Recursive Descent Parser for basic aritmetic operations of addition and subtraction of integers. Later we will tackle some more interesting scenarios, but for this post we want to keep things simple.

So the examples of expressions we want to parse could be:

  • “1”
  • “1+100″
  • “1+2+3+44″
  • “100-99+1″ etc

Defining the grammar

So how would our grammar in EBNF would look like?

Expression := Number {Operator Number}
Operator   := "+" | "-"
Number     := Digit{Digit}
Digit      := "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Grammar above tells us that we are defining expressions that can contain numbers and operators.

Each Expression can contain a single integer Number or this number can be followed by optional set of Operator and then another Number must follow. Note the curly brackets around the {Operator Number} which means that this part can repeat zero or more times.

Then we continue and define what Operator could be (+ or -) and then we define that Number is just a Digit followed by zero or more Digits (again we use curly braces to define optional repetition of digits in integer number).

At the end we define that Digit is number from 0 to 9.

With this we defined our grammar and its vocabulary – and creating a parser should be simple, right? Well almost :)

Tokens and Tokenizers

First there is another small issue we need to handle here. Our input expressions will be in string format, and we need to parse them into something that our Parser will understand in order to abstract the raw character processing – we don’t want to string processing inside of Parser this would be bad practice.

This is where idea of Tokens comes into play. We will define tokens for each part of our grammar. We can have abstract Token class, and other tokens will inherit it.

Then we will have base OperatorToken that will be inherited by PlusToken and MinusToken, and finally we will have NumberConstantToken for our numbers.

Token classes will just be a simple marker classes with almost no functionality inside except for the NumberConstantToken which has to hold the actual value.

 public abstract class Token
    {
    }

    public class OperatorToken : Token
    {

    }
    public class PlusToken : OperatorToken
    {
    }

    public class MinusToken : OperatorToken
    {
    }

    public class NumberConstantToken : Token
    {
        private readonly int _value;

        public NumberConstantToken(int value)
        {
            _value = value;
        }

        public int Value
        {
            get { return _value; }
        }
    }

Now that we defined our Tokens, lets build a PlusMinusTokenizer class that will transform the raw string of the expression into a sequence of Tokens that our Parser can then elegantly work with:

    public class PlusMinusTokenizer
    {
        private StringReader _reader;

        public IEnumerable<Token> Scan(string expression)
        {
            _reader = new StringReader(expression);

            var tokens = new List<Token>();
            while (_reader.Peek() != -1)
            {
                var c = (char) _reader.Peek();
                if (Char.IsWhiteSpace(c))
                {
                    _reader.Read();
                    continue;
                }

                if (Char.IsDigit(c))
                {
                    var nr = ParseNumber();
                    tokens.Add(new NumberConstantToken(nr));
                }
                else if (c == '-')
                {
                    tokens.Add(new MinusToken());
                    _reader.Read();
                }
                else if (c == '+')
                {
                    tokens.Add(new PlusToken());
                    _reader.Read();
                }
                else
                    throw new Exception("Unknown character in expression: " + c);
            }

            return tokens;
        }

        private int ParseNumber()
        {
            var digits = new List<int>();
            while (Char.IsDigit((char) _reader.Peek()))
            {
                var digit = (char) _reader.Read();
                int i;
                if (int.TryParse(Char.ToString(digit), out i))
                {
                    digits.Add(i);
                }
                else
                    throw new Exception("Could not parse integer number when parsing digit: " + digit);
            }

            var nr = 0;
            var mul = 1;
            digits.Reverse();
            digits.ForEach(d =>
            {
                nr += d*mul;
                mul *= 10;
            });

            return nr;
        }
    }

As you can see code is very simple and for given string expression it returns IEnumerable<Token>. There is also a part there that converts sequence of digits into into integer but we all know that so lets move on.

Parsing is Fun!

Now we finally come to the interesting part. Our parser should start from the top of our grammar and parse Expression, then inside of it parse the Number, then Operator then again Number etc.

So lets first examine the ParseExpression method:

        private int ParseExpression()
        {
            var number = ParseNumber();
            if (!_tokens.MoveNext())
                return number;

            if (_tokens.Current is OperatorToken)
            {
                var op = _tokens.Current;
                _tokens.MoveNext();

                var secondNumber = ParseNumber();
                if (op is PlusToken)
                {
                    return number + secondNumber;
                }
                if (op is MinusToken)
                    return number - secondNumber;

                throw new Exception("Unsupported operator: " + op);
            }

            throw new Exception("Expecting operator after number, but got " + _tokens.Current);
        }

Its barely 30 lines of code that first calls our ParseNumber method to get first number of the expression, and then it checks if next Token is OperatorToken, and if so, it calls the ParseNumber again to get second number and then based on the type of the Operator performs the addition or subtraction and returns the resulting integer.

Note that Exception is thrown if some unexpected token is encountered, so that we can know when we have invalid expression. We could have also done this inside of the Tokenizer but i wanted to have this logic just in one place.

Now let’s see the ParseNumber method:

        private int ParseNumber()
        {
            if (_tokens.Current is NumberConstantToken)
            {
                return (_tokens.Current as NumberConstantToken).Value;
            }

            throw new Exception("Expected a number constant but found " + _tokens.Current);
        }

As you see this method just returns the value of the number from our Token.

I told you this is gonna be simple :)

OK lets now see the main method of the Parser that uses the methods explained above:

        public int Parse()
        {
            var result = 0;
            while (_tokens.MoveNext())
            {
                result = ParseExpression();

                if (_tokens.MoveNext())
                {
                    if (_tokens.Current is OperatorToken)
                    {
                        var op = _tokens.Current;

                        var secondNumber = Parse();
                        if (op is PlusToken)
                        {
                            return result + secondNumber;
                        }
                        if (op is MinusToken)
                            return result - secondNumber;

                        throw new Exception("Unsupported operator: " + op);

                    }
                    throw new Exception("Expecting operator after expression, but got " + _tokens.Current);
                }
            }

            return result;
        }

This is where it all happens, the Parse method calls ParseExpression to get first Expression, and then – if there are more tokens and first one is OperatorToken, we call recursively again Parse and therefore parse next expression and get its result, and then we do the operation based on the Operator, and then inside of that call if needed we call Parse again recursively, and then the next one, and way we go down the rabbit hole…

I told you this was fun :)

Tests

Whole solution is built with Test-First because the idea of Parsers and Tokenizers fit very well with unit testing. If you are interested in this check out the project PlusMinusParser.Tests where there are tests for both the Tokenizer and the Parser. All of them pass an input expressions and expect certain results – integer as solution of the expression (for Parser) or list of Tokens (for Tokenizer).

Here are some tests for Parser (Note that im using NUnit’s TestCase attribute):

    [TestFixture]
    public class PlusMinusParserUnitTests
    {
        [TestCase("1", ExpectedResult = 1)]
        [TestCase("2", ExpectedResult = 2)]
        [TestCase("1+1", ExpectedResult = 2)]
        [TestCase("1+1+1", ExpectedResult = 3)]
        [TestCase("1-1", ExpectedResult = 0)]
        [TestCase("0-0", ExpectedResult = 0)]
        [TestCase("0-1", ExpectedResult = -1)]
        [TestCase("1+2+3+4+5+6", ExpectedResult = 21)]
        [TestCase("100-100", ExpectedResult = 0)]
        [TestCase("100-100+1", ExpectedResult = 1)]
        [TestCase("100-100+100", ExpectedResult = 100)]
        [TestCase("1 2", ExpectedException = typeof(Exception))]
        [TestCase("+", ExpectedException = typeof(Exception))]
        [TestCase("++", ExpectedException = typeof(Exception))]
        [TestCase("--", ExpectedException = typeof(Exception))]
        [TestCase("+-", ExpectedException = typeof(Exception))]
        [TestCase("-+", ExpectedException = typeof(Exception))]
        [TestCase("-", ExpectedException = typeof(Exception))]
        [TestCase("1-1 2", ExpectedException = typeof(Exception))]
        [TestCase("10-100", ExpectedResult = -90)]
        public int CanParseExpressions(string expression)
        {
            var tokens = new PlusMinusTokenizer().Scan(expression);
            var parser = new PlusMinusParser(tokens);
            var res = parser.Parse();
            return res;
        }

Where next?

Well now that we have a solid basic understanding on how to build simple parsers, we can go wild and invent different grammars and build parsers for them. Having this in your tool-belt can be very handy when you encounter tasks that need to parse text.

In one of the next posts i will implement some more complex parsers so stay tuned…

If you want to see the full source code check out my PlusMinusParser repository on GitHub.

3 thoughts on “Introduction to Recursive Descent Parsers with C#

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>