Tag Archives: EBNF

Recursive Descent Parser with C# – Boolean logic expressions

In previous post we gave brief introduction on Recursive Descent Parsers and we implemented parser that was able to parse and calculate simple arithmetic expressions with addition and subtraction.

To be (True) or !(To be True)?

This time we will try to tackle little bit more complex example that will parse and evaluate Boolean logic expressions that will include negation and parenthesis.

Examples of expressions we want to be able to parse and evaluate are:

  • True And True And False
  • True
  • !False
  • (!(False)) and (!(True) etc

Let’s assemble a EBNF grammar for this type of expressions:

Expression      := [ "!" ] <Boolean> { BooleanOperator Boolean }
Boolean         := BooleanConstant | Expression | "(" <Expression> ")"
BooleanOperator := "And" | "Or" 
BooleanConstant := "True" | "False"

You can see that our Terminal Symbols are “And”, “Or” (BooleanOperator) and “True”, “False” (BooleanConstant) and off course  “!” and parenthesis.

Expression can have optional negation symbol “!” and then Boolean (which can be BooleanConstant or Expression or Expression in parenthesis).

Every next Boolean expression is optional but if its there, it must be preceded by BooleanOperator so that we can parse the final value by combining it with previous Boolean value. Obviously we will have some recursion there, but more on that later when we start implementing the parser.

Always Tokenize everything!

Before looking into the parser, we have to implement the Tokenizer class that will parse the raw text of the expression, tokenize it and return IEnumerable<Token> so that our parser can have less to worry about.

Here is the Tokenizer class:

    public class Tokenizer
    {
        private readonly StringReader _reader;
        private string _text;

        public Tokenizer(string text)
        {
            _text = text;
            _reader = new StringReader(text);
        }

        public IEnumerable<Token> Tokenize()
        {
            var tokens = new List<Token>();
            while (_reader.Peek() != -1)
            {
                while (Char.IsWhiteSpace((char) _reader.Peek()))
                {
                    _reader.Read();
                }

                if (_reader.Peek() == -1)
                    break;

                var c = (char) _reader.Peek();
                switch (c)
                {
                    case '!':
                        tokens.Add(new NegationToken());
                        _reader.Read();
                        break;
                    case '(':
                        tokens.Add(new OpenParenthesisToken());
                        _reader.Read();
                        break;
                    case ')':
                        tokens.Add(new ClosedParenthesisToken());
                        _reader.Read();
                        break;
                    default:
                        if (Char.IsLetter(c))
                        {
                            var token = ParseKeyword();
                            tokens.Add(token);
                        }
                        else
                        {
                            var remainingText = _reader.ReadToEnd() ?? string.Empty;
                            throw new Exception(string.Format("Unknown grammar found at position {0} : '{1}'", _text.Length - remainingText.Length, remainingText));
                        }
                        break;
                }
            }
            return tokens;
        }

        private Token ParseKeyword()
        {
            var text = new StringBuilder();
            while (Char.IsLetter((char) _reader.Peek()))
            {
                text.Append((char) _reader.Read());
            }

            var potentialKeyword = text.ToString().ToLower();

            switch (potentialKeyword)
            {
                case "true":
                    return new TrueToken();
                case "false":
                    return new FalseToken();
                case "and":
                    return new AndToken();
                case "or":
                    return new OrToken();
                default:
                    throw new Exception("Expected keyword (True, False, And, Or) but found "+ potentialKeyword);
            }
        }
    }

Not much happening there really, we just go through the characters of the expression, and if its negation or parenthesis we return proper sub classes of Token and if we detect letters we try to parse one of our keywords (“True”, “False”, “And”, “Or”).
If we encounter unknown keyword we throw exception to be on the safe side. I deliberately did not do much validation of the expression in this class since this is done later in the Parser – but nothing would stop us from doing it here also – i will leave that exercise to the reader.

The Parser

Inside of our parser we have main Parse method that will start the process of parsing the tokens, handle the negation, and continue parsing sub-expressions while it encounters one of the OperandTokens (AndToken or OrToken).

        public bool Parse()
        {
            while (_tokens.Current != null)
            {
                var isNegated = _tokens.Current is NegationToken;
                if (isNegated)
                    _tokens.MoveNext();

                var boolean = ParseBoolean();
                if (isNegated)
                    boolean = !boolean;

                while (_tokens.Current is OperandToken)
                {
                    var operand = _tokens.Current;
                    if (!_tokens.MoveNext())
                    {
                        throw new Exception("Missing expression after operand");
                    }
                    var nextBoolean = ParseBoolean();

                    if (operand is AndToken)
                        boolean = boolean && nextBoolean;
                    else
                        boolean = boolean || nextBoolean;

                }

                return boolean;
            }

            throw new Exception("Empty expression");
        }

Parsing of the sub-expressions is handled in the ParseBoolean method:

       private bool ParseBoolean()
        {
            if (_tokens.Current is BooleanValueToken)
            {
                var current = _tokens.Current;
                _tokens.MoveNext();

                if (current is TrueToken)
                    return true;

                return false;
            }
            if (_tokens.Current is OpenParenthesisToken)
            {
                _tokens.MoveNext();

                var expInPars = Parse();

                if (!(_tokens.Current is ClosedParenthesisToken))
                    throw new Exception("Expecting Closing Parenthesis");

                _tokens.MoveNext(); 

                return expInPars;
            }
            if (_tokens.Current is ClosedParenthesisToken)
                throw new Exception("Unexpected Closed Parenthesis");

            // since its not a BooleanConstant or Expression in parenthesis, it must be a expression again
            var val = Parse();
            return val;
        }

This method tries to parse the simplest BooleanValueToken, then if it encounter OpenParenthesisToken it handles the Expressions in parenthesis by skipping the OpenParenthesisToken and then calling back the Parse to get the value of expressions and then again skipping the ClosedParenthesisToken once parsing of inner expression is done.

If it does not find BooleanValueToken or OpenParenthesisToken – method simply assumes that what follows is again an expression so it calls back Parse method to start the process of parsing again.

To be logical is to be simple

As you see, we implemented the parser in less then 90 lines of C# code. Maybe this code is not particularity useful but its good exercise on how to build parsing logic recursively.

It could be further improved by adding more logic to throw exceptions when unexpected Tokens are encountered but again – i leave that to the reader (for example expression like “true)” should throw exception, but in this version of code it will not do that, it will still parse the expression correctly by ignoring the closing parenthesis).

Tests

Here are some of the Unit Tests i built to test the parser:

        [TestCase("true", ExpectedResult = true)]
        [TestCase(")", ExpectedException = (typeof(Exception)))]
        [TestCase("az", ExpectedException = (typeof(Exception)))]
        [TestCase("", ExpectedException = (typeof(Exception)))]
        [TestCase("()", ExpectedException = typeof(Exception))]
        [TestCase("true and", ExpectedException = typeof(Exception))]
        [TestCase("false", ExpectedResult = false)]
        [TestCase("true ", ExpectedResult = true)]
        [TestCase("false ", ExpectedResult = false)]
        [TestCase(" true", ExpectedResult = true)]
        [TestCase(" false", ExpectedResult = false)]
        [TestCase(" true ", ExpectedResult = true)]
        [TestCase(" false ", ExpectedResult = false)]
        [TestCase("(false)", ExpectedResult = false)]
        [TestCase("(true)", ExpectedResult = true)]
        [TestCase("true and false", ExpectedResult = false)]
        [TestCase("false and true", ExpectedResult = false)]
        [TestCase("false and false", ExpectedResult = false)]
        [TestCase("true and true", ExpectedResult = true)]
        [TestCase("!true", ExpectedResult = false)]
        [TestCase("!(true)", ExpectedResult = false)]
        [TestCase("!(true", ExpectedException = typeof(Exception))]
        [TestCase("!(!(true))", ExpectedResult = true)]
        [TestCase("!false", ExpectedResult = true)]
        [TestCase("!(false)", ExpectedResult = true)]
        [TestCase("(!(false)) and (!(true))", ExpectedResult = false)]
        [TestCase("!((!(false)) and (!(true)))", ExpectedResult = true)]
        [TestCase("!false and !true", ExpectedResult = false)]
        [TestCase("false and true and true", ExpectedResult = false)]
        [TestCase("false or true or false", ExpectedResult = true)]
        public bool CanParseSingleToken(string expression)
        {
            var tokens = new Tokenizer(expression).Tokenize();
            var parser = new Parser(tokens);
            return parser.Parse();
        }

Full source code of the whole solution is available at my BooleanLogicExpressionParser GitHub repo.

Stay tuned because next time we will implement parser for more complex arithmetical expressions.

Introduction to Recursive Descent Parsers with C#

Parser? Aren’t parsers utterly boring?

Well no, quite the opposite. Lately i have been solving some of the programming challenges on talentbuddy and bumped into task to create parser and solver for simple arithmetic expressions in string format, something like this: “(2+2)-(3-(6-5))-4″.

On first thought this seems trivial, but only until the moment you start implementing it. If you are doing it the wrong way, for example by using regular expressions, you can bump into some interesting problems, and also develop a solution that works for most cases but then fails on edge cases etc.

The proper way

Fortunately there are proper ways to do this, and one of them is building a Recursive Descent Parser. As the name implies, this parser will use a Top-Down approach and start breaking the expression into smaller pieces but in recursive way. In order to be able to build a proper parser we first need to define a grammar of the expressions we want to parse.

For defining grammars we can use Backus-Naur Form or Extended Backus-Naur Form.

If we manage to define the grammar properly, then implementing the parser is almost a mechanical process, and this is what i really like about this approach.

In this post we will build very simple Recursive Descent Parser for basic aritmetic operations of addition and subtraction of integers. Later we will tackle some more interesting scenarios, but for this post we want to keep things simple.

So the examples of expressions we want to parse could be:

  • “1”
  • “1+100″
  • “1+2+3+44″
  • “100-99+1″ etc

Defining the grammar

So how would our grammar in EBNF would look like?

Expression := Number {Operator Number}
Operator   := "+" | "-"
Number     := Digit{Digit}
Digit      := "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Grammar above tells us that we are defining expressions that can contain numbers and operators.

Each Expression can contain a single integer Number or this number can be followed by optional set of Operator and then another Number must follow. Note the curly brackets around the {Operator Number} which means that this part can repeat zero or more times.

Then we continue and define what Operator could be (+ or -) and then we define that Number is just a Digit followed by zero or more Digits (again we use curly braces to define optional repetition of digits in integer number).

At the end we define that Digit is number from 0 to 9.

With this we defined our grammar and its vocabulary – and creating a parser should be simple, right? Well almost :)

Tokens and Tokenizers

First there is another small issue we need to handle here. Our input expressions will be in string format, and we need to parse them into something that our Parser will understand in order to abstract the raw character processing – we don’t want to string processing inside of Parser this would be bad practice.

This is where idea of Tokens comes into play. We will define tokens for each part of our grammar. We can have abstract Token class, and other tokens will inherit it.

Then we will have base OperatorToken that will be inherited by PlusToken and MinusToken, and finally we will have NumberConstantToken for our numbers.

Token classes will just be a simple marker classes with almost no functionality inside except for the NumberConstantToken which has to hold the actual value.

 public abstract class Token
    {
    }

    public class OperatorToken : Token
    {

    }
    public class PlusToken : OperatorToken
    {
    }

    public class MinusToken : OperatorToken
    {
    }

    public class NumberConstantToken : Token
    {
        private readonly int _value;

        public NumberConstantToken(int value)
        {
            _value = value;
        }

        public int Value
        {
            get { return _value; }
        }
    }

Now that we defined our Tokens, lets build a PlusMinusTokenizer class that will transform the raw string of the expression into a sequence of Tokens that our Parser can then elegantly work with:

    public class PlusMinusTokenizer
    {
        private StringReader _reader;

        public IEnumerable<Token> Scan(string expression)
        {
            _reader = new StringReader(expression);

            var tokens = new List<Token>();
            while (_reader.Peek() != -1)
            {
                var c = (char) _reader.Peek();
                if (Char.IsWhiteSpace(c))
                {
                    _reader.Read();
                    continue;
                }

                if (Char.IsDigit(c))
                {
                    var nr = ParseNumber();
                    tokens.Add(new NumberConstantToken(nr));
                }
                else if (c == '-')
                {
                    tokens.Add(new MinusToken());
                    _reader.Read();
                }
                else if (c == '+')
                {
                    tokens.Add(new PlusToken());
                    _reader.Read();
                }
                else
                    throw new Exception("Unknown character in expression: " + c);
            }

            return tokens;
        }

        private int ParseNumber()
        {
            var digits = new List<int>();
            while (Char.IsDigit((char) _reader.Peek()))
            {
                var digit = (char) _reader.Read();
                int i;
                if (int.TryParse(Char.ToString(digit), out i))
                {
                    digits.Add(i);
                }
                else
                    throw new Exception("Could not parse integer number when parsing digit: " + digit);
            }

            var nr = 0;
            var mul = 1;
            digits.Reverse();
            digits.ForEach(d =>
            {
                nr += d*mul;
                mul *= 10;
            });

            return nr;
        }
    }

As you can see code is very simple and for given string expression it returns IEnumerable<Token>. There is also a part there that converts sequence of digits into into integer but we all know that so lets move on.

Parsing is Fun!

Now we finally come to the interesting part. Our parser should start from the top of our grammar and parse Expression, then inside of it parse the Number, then Operator then again Number etc.

So lets first examine the ParseExpression method:

        private int ParseExpression()
        {
            var number = ParseNumber();
            if (!_tokens.MoveNext())
                return number;

            if (_tokens.Current is OperatorToken)
            {
                var op = _tokens.Current;
                _tokens.MoveNext();

                var secondNumber = ParseNumber();
                if (op is PlusToken)
                {
                    return number + secondNumber;
                }
                if (op is MinusToken)
                    return number - secondNumber;

                throw new Exception("Unsupported operator: " + op);
            }

            throw new Exception("Expecting operator after number, but got " + _tokens.Current);
        }

Its barely 30 lines of code that first calls our ParseNumber method to get first number of the expression, and then it checks if next Token is OperatorToken, and if so, it calls the ParseNumber again to get second number and then based on the type of the Operator performs the addition or subtraction and returns the resulting integer.

Note that Exception is thrown if some unexpected token is encountered, so that we can know when we have invalid expression. We could have also done this inside of the Tokenizer but i wanted to have this logic just in one place.

Now let’s see the ParseNumber method:

        private int ParseNumber()
        {
            if (_tokens.Current is NumberConstantToken)
            {
                return (_tokens.Current as NumberConstantToken).Value;
            }

            throw new Exception("Expected a number constant but found " + _tokens.Current);
        }

As you see this method just returns the value of the number from our Token.

I told you this is gonna be simple :)

OK lets now see the main method of the Parser that uses the methods explained above:

        public int Parse()
        {
            var result = 0;
            while (_tokens.MoveNext())
            {
                result = ParseExpression();

                if (_tokens.MoveNext())
                {
                    if (_tokens.Current is OperatorToken)
                    {
                        var op = _tokens.Current;

                        var secondNumber = Parse();
                        if (op is PlusToken)
                        {
                            return result + secondNumber;
                        }
                        if (op is MinusToken)
                            return result - secondNumber;

                        throw new Exception("Unsupported operator: " + op);

                    }
                    throw new Exception("Expecting operator after expression, but got " + _tokens.Current);
                }
            }

            return result;
        }

This is where it all happens, the Parse method calls ParseExpression to get first Expression, and then – if there are more tokens and first one is OperatorToken, we call recursively again Parse and therefore parse next expression and get its result, and then we do the operation based on the Operator, and then inside of that call if needed we call Parse again recursively, and then the next one, and way we go down the rabbit hole…

I told you this was fun :)

Tests

Whole solution is built with Test-First because the idea of Parsers and Tokenizers fit very well with unit testing. If you are interested in this check out the project PlusMinusParser.Tests where there are tests for both the Tokenizer and the Parser. All of them pass an input expressions and expect certain results – integer as solution of the expression (for Parser) or list of Tokens (for Tokenizer).

Here are some tests for Parser (Note that im using NUnit’s TestCase attribute):

    [TestFixture]
    public class PlusMinusParserUnitTests
    {
        [TestCase("1", ExpectedResult = 1)]
        [TestCase("2", ExpectedResult = 2)]
        [TestCase("1+1", ExpectedResult = 2)]
        [TestCase("1+1+1", ExpectedResult = 3)]
        [TestCase("1-1", ExpectedResult = 0)]
        [TestCase("0-0", ExpectedResult = 0)]
        [TestCase("0-1", ExpectedResult = -1)]
        [TestCase("1+2+3+4+5+6", ExpectedResult = 21)]
        [TestCase("100-100", ExpectedResult = 0)]
        [TestCase("100-100+1", ExpectedResult = 1)]
        [TestCase("100-100+100", ExpectedResult = 100)]
        [TestCase("1 2", ExpectedException = typeof(Exception))]
        [TestCase("+", ExpectedException = typeof(Exception))]
        [TestCase("++", ExpectedException = typeof(Exception))]
        [TestCase("--", ExpectedException = typeof(Exception))]
        [TestCase("+-", ExpectedException = typeof(Exception))]
        [TestCase("-+", ExpectedException = typeof(Exception))]
        [TestCase("-", ExpectedException = typeof(Exception))]
        [TestCase("1-1 2", ExpectedException = typeof(Exception))]
        [TestCase("10-100", ExpectedResult = -90)]
        public int CanParseExpressions(string expression)
        {
            var tokens = new PlusMinusTokenizer().Scan(expression);
            var parser = new PlusMinusParser(tokens);
            var res = parser.Parse();
            return res;
        }

Where next?

Well now that we have a solid basic understanding on how to build simple parsers, we can go wild and invent different grammars and build parsers for them. Having this in your tool-belt can be very handy when you encounter tasks that need to parse text.

In one of the next posts i will implement some more complex parsers so stay tuned…

If you want to see the full source code check out my PlusMinusParser repository on GitHub.