Recursive Descent Parser for arithmetic expressions with real numbers

In previous post we were building Recursive Descent Parser for Boolean expressions and in the post before that we were parsing simple arithmetic expressions (with only addition and subtraction).

In this third, final post we will build more real world example – we will parse arithmetic expressions that include (besides addition and subtraction) multiplication and division. Note here that we also have to respect the priority of operations (multiplication and division are done before addition and subtraction) so our parser has to be “smart” enough to handle that. And just to make it more fun, this time we will also support real numbers instead of just boring integers. :)

Grammar

As usual, lets start with the grammar – i cannot stress how important this step is for the whole process:

Expression := [ "-" ] Term { ("+" | "-") Term }
Term       := Factor { ( "*" | "/" ) Factor }
Factor     := RealNumber | "(" Expression ")"
RealNumber := Digit{Digit} | [Digit] "." {Digit}
Digit      := "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Since we need to support priority of operations, we are putting the addition and subtraction (and negation) on top level of our grammar and therefore they will be evaluated at the end after the recursive calls to lover level terminal symbols are parsed and evaluated – because we are building a “descent” parser that goes from top level concepts towards lower ones.

Lets analyze each row of our expression.

In Expression definition row we have introduced non-terminal symbol Term that represents the part of expression that is on top level and has addition or subtraction for operation:

Expression := [ "-" ] Term { ("+" | "-") Term }

Inside of Term definition we are introducing the Factor non-terminal symbol that represents part of expression that has multiplication or division for operation:

 Term       := Factor { ( "*" | "/" ) Factor }

In the Factor we are already expecting the ending symbol of our expression – the RealNumber, its the first one we can start directly evaluating. If not RealNumber then we expect another Expression in parenthesis:

 Factor     := RealNumber | "(" Expression ")"

After that we define what RealNumber is – its either a digit followed by more digits or its a digit with decimal point followed by more decimal places:

RealNumber := Digit{Digit} | [Digit] "." {Digit}

The Parser

Since we defined the grammar, writing the parser is almost mechanical process. We just start from the top level concepts in the grammar and create method for each element we have (Term, Factor, RealNumber, Digit etc) and call them one after the other or recursively.

Here is our entry method called Parse that starts the parsing and handles the first row of our grammar, the Expression row:

        // Expression := [ "-" ] Term { ("+" | "-") Term }
        public double Parse()
        {
            var isNegative = NextIsMinus();
            if (isNegative)
                GetNext();
            var valueOfExpression = TermValue();
            if (isNegative)
                valueOfExpression = -valueOfExpression;
            while (NextIsMinusOrPlus())
            {
                var op = GetTermOperand();
                var nextTermValue = TermValue();
                if (op is PlusToken)
                    valueOfExpression += nextTermValue;
                else
                    valueOfExpression -= nextTermValue;
            }
            return valueOfExpression;
        }

You can see that we are just following the grammar: we handle the negation and then call the TermValue method to get the next Term and then while there are plus or minus operators, we continue getting the next term again (again by calling TermValue) and then based on the extracted operator we decide should we add or subtract those two Term values. We repeat this as many as times as needed and accumulate the total result and then return it – that’s all.

Lets drill down to see the TermValue method:

        // Term       := Factor { ( "*" | "/" ) Factor }
        private double TermValue()
        {
            var totalVal = FactorValue();
            while (NextIsMultiplicationOrDivision())
            {
                var op = GetFactorOperand();
                var nextFactor = FactorValue();

                if (op is DivideToken)
                    totalVal /= nextFactor;
                else
                    totalVal *= nextFactor;
            }

            return totalVal;
        }

Here its again similar process: but we are handling the second row of our grammar. We call the FactorValue method to parse the next Factor and then while we encounter tokens for multiplication or division we again call FactorValue to get the second Factor and based on the operation we divide or multiply. And as many as times this happens we are accumulating the result in the totalVal variable, and returning it at the end.

Final important method we need to implement for parser is the FactorValue method:

        // Factor     := RealNumber | "(" Expression ")"
        private double FactorValue()
        {
            if (NextIsDigit())
            {
                var nr = GetNumber();
                return nr;
            }
            if (!NextIsOpeningBracket())
                throw new Exception("Expecting Real number or '(' in expression, instead got : " + (PeekNext() != null ? PeekNext().ToString()  : "End of expression"));
            GetNext();

            var val = Parse();

            if (!(NextIs(typeof(ClosedParenthesisToken))))
                throw new Exception("Expecting ')' in expression, instead got: " + (PeekNext() != null ? PeekNext().ToString() : "End of expression"));
            GetNext();
            return val;
        }

This one is even simpler. Based on our grammar row we know that we can expect either our RealNumber, or Expression in parenthesis.
So we check if next Token in the expression is token with number (NumberConstatToken), and if it is we just extract its value by calling GetNumber method and return it.

If next token is not number, then we know we expect the opening bracket (we throw exception if not) and here we recursively call the main Parse method (from which the execution started in the first place) to parse the next expression and so on, and on, and on until we unwind the expression completely ending up with the calculated result for the whole expression.

I wont go through all the helper methods that are defined and used there since they are trivial and not really interesting – you can always check out the full source on GitHub. Important takeaway is that it is really straightforward to implement the parser once you define the grammar properly – this opens up endless possibilities…

Looking ahead

You probably noticed that in the code i’m able to check if next symbol in the expression is of certain type – this is because i implemented simple TokensWalker class which allows me to peek the next token without consuming it:

    public class TokensWalker
    {
        private readonly List<Token> _tokens = new List<Token>();
        private int _currentIndex = -1;

        public bool ThereAreMoreTokens
        {
            get { return _currentIndex < _tokens.Count - 1; }
        }

        public TokensWalker(IEnumerable<Token> tokens)
        {
            _tokens = tokens.ToList();
        }

        public Token GetNext()
        {
            MakeSureWeDontGoPastTheEnd();
            return _tokens[++_currentIndex];
        }

        private void MakeSureWeDontGoPastTheEnd()
        {
            if (!ThereAreMoreTokens)
                throw new Exception("Cannot read pass the end of tokens list");
        }

        public Token PeekNext()
        {
            MakeSureWeDontPeekPastTheEnd();
            return _tokens[_currentIndex + 1];
        }

        private void MakeSureWeDontPeekPastTheEnd()
        {
            var weCanPeek = (_currentIndex + 1 < _tokens.Count);
            if (!weCanPeek)
                throw new Exception("Cannot peek pass the end of tokens list");
        }

        public bool IsNextOfType(Type type)
        {
            return PeekNext().GetType() == type;
        }
    }

May the source be with you

This concludes our not so long voyage through Recursive Descent Parsers. Hopefully i was able to interest some of you to start playing with these concepts. Now you can go and extend this and build a new calculator application, better than the built-in one that came with your beloved OS. :)

If you want to see the full source of this parser with all the unit tests, its available in my ArithmeticExpressionParser GitHub repo.

4 thoughts on “Recursive Descent Parser for arithmetic expressions with real numbers

    1. Hi Ricardo,

      yes using generators like ANTLR is valid option, but i prefer to build things by myself in order to understand them better. Once you know the internals, then you can use finished products like ANTLR in better way because you know what are the potential problems etc.

      So consider this article as way of learning by doing, also as a way of sharing the knowledge :)

      Thanks for your feedback!

  1. This is great!
    I plan to use your boolean expression parser in my project (but I will slightly modify it so that AND operator will take precedence over OR operator — already done this).
    However, please check what result will give your arithmetic parser to -3 * -3.
    It gives Exception instead of 9.
    Thank you very much for your work!

Leave a Reply to Sergey Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>