Transforming Syntax

Or: how to write the easy part of a compiler

This article has moved to my current blog. Please look there for the most up to date version. The new version includes many corrections and improvements.

This post will cover this stackoverflow question on how to translate an expression to a C-like language by user2516265. The other answers suggested regexes or treating it as Perl code *shudder*. This post aims to cover:

parsing with Marpa::R2,
AST manipulation,
optimization passes,
compilation, and
Perl OO.

It won't cover parser theory etc. or provide a copy&paste-ready solution. The estimated size of the final code will be around $0.2\,\mathrm{kloc}$ .

Note: All runnable code snippets should be prefixed by use strict; use warnings; use 5.010;.

Transforming Syntax

Original question

I am new to the perl scripting. I am writing script to read excel file and put in text file in C programing syntax.

So I excel sheet I have string like below:
If ((Myvalue.xyz == 1) Or (Frmae_1.signal_1 == 1)) Then a = 1
Else a = 0;
This I have to convert into:
a = (((Myvalue.xyz == 1) || (Frmae_1.signal_1 == 1))?1:0)
How this can be handled in perl?

Short introduction to Marpa

Various parser exists for Perl. Some of these are ancient tech, some are solid but slow, some are awesome. My favourite Parsers are:

Regexp::Grammars: Provides many powerful and expressive tools. Retains the full power of Perl regexes.
Parser::MGC: Makes it trivial to write an object oriented top-down parser. However, the resulting code is unneccessarily verbose.
Marpa::R2: This is a wrapper around libmarpa. It is written in C, and uses Early parsing instead of top-down techniques. This makes it fairly easy to write arbitrary BNFm which Marpa will happily use. While it is strictly less powerful than regexes, Marpa provides excellent interfaces and good debugging help that makes it worth the extra code.

Here, we will use the Scanless Interface for Marpa, which adds lexing capabilities to the BNF. To create a new grammar, we have to

$grammar = Marpa::R2::Scanless::G->new({ %options })

one of these options is the BNF source, which has to be passed as a reference.

Writing the BNF

We want to write a grammar that is able to correctly parse the example data

If ((Myvalue.xyz == 1) Or (Frmae_1.signal_1 == 1)) Then a = 1
Else a = 0;

We see that the program is made from statements that are separated by semicolons. We can express this with the BNF rule

StatementList ::= <Expression>+ separator => <op semicolon>
<op semicolon> ~ ';'  # Tokens are declared with '~'

An expression can be an If/Then/Else, a paranthesized expression, a binary operator, an identifier, or a numeric literal. We should choose a sensible precedence for all these possibilities. E.g. paranthesization and literals should have very high precedence. The conditional should be lower. For the binary operators, I have choosen the precedence

== equality test
= assignment
Or logical operator

Marpa Scanless allows us to write simple alternatives with |, and || to provide prioritized rules (this works roughly like | in regexes with respect to backtracking). The BNF for the expression would be:

Expression ::=
    ('(') Expression (')')  assoc => group  # tokens enclosed in parens are matched, but their values discarded
|   Number
||  Ident
||  Expression  '==' Expression
||  Expression  '='  Expression
||  Expression  'Or' Expression
||  Conditional

(This isn't exactly good style, but it is good enough)

The other referenced rules are:

Conditional ::=
    ('If') Expression ('Then') Expression
|   ('If') Expression ('Then') Expression ('Else') Expression

Ident   ~ ident
Number  ~ <number int> | <number rat>

word         ~ [\w]+  # Perl charclasses can be used for tokens
ident        ~ word | ident '.' word # allow complex identifiers
<number int> ~ [\d]+
<number rat> ~ <number int> '.' <number int>

We can indicate Marpa to start matching at the “top” with the pseudorule :start ::= StatementList, and can allow the lexer to skip whitespace between tokens via

:discard ~ ws
ws ~ [\s]+

In the next section, we take a short look at object oriented Programming with Perl and will then circle back to see how Marpa interacts with that.

Intermission: Perl OO

An object is a thingy which we can ask to do things. Depending on what kind of thing it is, it will either throw an error, or fulfill our request. This can simplify code because we don't care how they work on the inside, or where they know from how to perform the task.

Perl uses class-based OO. We can declare a class with the package keyword. All subs in that package can then be used as methods. We can inherit methods from another package via use parent 'Parent::Class'. When a sub is used as a method, the object it was invoked on is passed as the first argument. This invocant is usually called $self in proud Smalltalk tradition.
The invocant may also be a string that holds the name of the package. That is, Perl makes no distinction between class methods and instance methods.

We can create a new instance of a class by blessing a reference into that class. Any reference will do. Usually, hash references are used, but here we will focus on array references. For demonstration purposes, here is a small example to show blessing and inheritance. In this example, we create unneccessarily complex classes to calculate the statistical mean and variance from a data set.

# Call the `new` method on class `Data::Variance`:
my $object = Data::Variance->new(1.1, 1.9, 2.1);

# Call the `variance` method on the object:
say "Var: ", $object->variance;
# Call the inherited `mean` method on the object:
say "Mean:", $object->mean;

# Define the `Data::Mean` class which provides the `mean` method
package Data::Mean;

sub new {
    # The invocant for `new` should be the class name
    my ($class, @data) = @_;
    # bless a reference to the data into the provided class
    my $self = bless \@data => $class;
    # return the freshly baked object
    return $self;
}

# calculate the mean
sub mean {
    my ($self) = @_;
    my $sum = 0;
    $sum += $_ for @$self;
    return $sum / @$self;
}

# Define the `Data::Variance` class which inherits from `Data::Mean`:
package Data::Variance;
use parent -norequire, 'Data::Mean';  # specify inheritance.
                                      # The `-norequire` option is used because the parent is in the same file
# `new` is inherited
# `mean` is inherited

# calculate the variance, bootstrapping it with the `mean`
sub variance {
    my ($self) =    @_;
    my $mean = $self->mean;
    return Data::Mean->new(map { ($_ - $mean)**2 } @$self)->mean;  # nifty math
}

Output:

Var: 0.186666666666667
Mean:1.7

In that example, the reference is blessed in a new method. This is considered good style, but we can bless any reference anywhere into any class (the class doesn't even have to exist). This is what we'll let Marpa do in the next step.

Marpa's `bless` adverb

We can let Marpa take the array of matched values, and bless them into a class of our choosing. We can do so by augmenting a rule with the bless adverb, e.g.

StatementList ::= <Expression>+ separator => <op semicolon> bless => Block

But we also have to specify how these values are obtained. We can do so by setting default attributes like

:default ::= action => [values]

We could have also written an action ourselves that takes the matched values as arguments and returns some data structure. The above default action is similar to sub { return [@_] }.

We can specify a class root for the bless adverb. If we construct the grammar with the bless_package option, e.g. bless_package => 'Ast', then the StatementList would not be blessed into the Block class but into Ast::Block.

Similar classes are added to the other rules.

Obtaining a parse

This is the general workflow to obtain a parse from a Marpa::R2::Scannless::G Grammar object:

Create a recognizer for that grammar:

my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });

Read the input string:
```
$recce->read(\$input);
```
Look at the value of the parse. The parse failed if it is undef. Otherwise, it is a reference to out AST:
```
my $value = $recce->value;
defined $value or die "Parse failed";
my $ast = $$value;
```

I have prepared a gist with fully runnable code that parses the example input, and prints out the resulting data structure. The output is shown below (although I improved the intendation):

$VAR1 = bless( [
                 [
                   bless( [
                            bless( [
                                     bless( [
                                              bless( ['Myvalue.xyz'], 'Ast::Var' ),
                                              '==',
                                              bless( ['1'], 'Ast::Literal' )
                                            ], 'Ast::Binop' ),
                                     'Or',
                                     bless( [
                                              bless( ['Frmae_1.signal_1'], 'Ast::Var' ),
                                              '==',
                                              bless( ['1'], 'Ast::Literal' )
                                            ], 'Ast::Binop' )
                                   ], 'Ast::Binop' ),
                            bless( [
                                     bless( ['a'], 'Ast::Var' ),
                                     '=',
                                     bless( ['1'], 'Ast::Literal' )
                                   ], 'Ast::Binop' ),
                            bless( [
                                     bless( ['a'], 'Ast::Var' ),
                                     '=',
                                     bless( ['0'], 'Ast::Literal' )
                                   ], 'Ast::Binop' )
                          ], 'Ast::Cond' )
                 ]
               ], 'Ast::Block' );

The above data structure is equivalent to this tree:

                                              Block
                                                |
                                               [ ]
                                                |
                      /----------------------- Cond ---------\
                     /                          |             \
                Binop                          Binop           Binop
             /   |      \                     / | \            / | \
        Binop   'Or'     \-- Binop         Var '=' Literal  Var '=' Literal
        / | \               /  |  \         |         |      |         |
    Var '==' Literal     Var  '==' Literal  'a'       '1'    'a'       '0'
     |          |         |            |
'Myvalue.xyz'  '1' 'Frmae_1.signal_1' '1'

We can see that the Data::Dumper output is much harder to read than the tree. This is mainly because it contains too much superfluous information.

Working with the AST

Before we can start working with the syntax tree Marpa creates for us, we should take a moment to specify inheritance, and write a few accessor method so that we don't have to hardcode indices of the underlying arrayrefs. This is fairly boring:

package Ast;  # a parent class to define common methods

    # a constructor for convenience.
    # Marpa won't use this but rather `bless` directly
    sub new {
        my ($class, @args) = @_;
        bless \@args => $class;
    }

package Ast::Binop;
    use parent -norequire, 'Ast';

    sub l   { shift()->[0] }
    sub op  { shift()->[1] }
    sub r   { shift()->[2] }

package Ast::Var;
    use parent -norequire, 'Ast';

    sub name { shift()->[0] }

package Ast::Cond;
    use parent -norequire, 'Ast';

    sub cond { shift()->[0] }
    sub then { shift()->[1] }
    sub else { shift()->[2] }

package Ast::Block;
    use parent -norequire, 'Ast';

    sub contents { shift()->[0] }

    # comes in handly later on
    sub new { 
        my ($class, @items) = @_;
        $class->SUPER::new(\@items);
    }

package Ast::Literal;
    use parent -norequire, 'Ast';

    sub val { shift()->[0] }

We will now look at a few ways to transform the AST.

Transformation: cloning

It can sometimes be useful to make a deep copy of a data structure. This is fairly easy for each AST node. To copy an AST node, it

clones all child nodes, or copies the values if they aren't a regular node (we call such nodes a terminal node).
blesses an array reference containing those copies into the correct class.

To make things a bit easier, we start by defining a childs method that just gives the children nodes:

package Ast;
    ...;
    sub childs { @{ shift() } }

...;

package Ast::Block;
    ...;
    sub childs { @{ shift()->contents } }

To determine whether a given child node is a terminal node, we ask it if it can perform the clone method. We can do so with the universal can method. I.e. $node->can("clone") does this check.

However, it is a fatal error to call a method on an unblessed reference (something that isn't an object). Therefore, we first do a check whether the $node is an object at all. The blessed method from Scalar::Util comes in handy here.

Therefore, we can perform cloning like

package Ast;
    use Scalar::Util qw/blessed/;
    ...;
    sub clone {
        my ($self) = @_;
        my @childs = map { blessed($_) && $_->can("clone") ? $_->clone : $_ } $self->childs;
        ref($self)->new(@childs);
        # ref $self is the class of $self.
        # Remember that we can call methods on strings
        # that contain the name of a class.
    }

Note: This technique assumes that the given data structure is a tree alright, and doesn't contain loops (that would be a cyclic graph, not a tree).

I have prepared a gist that runs this code on the above Data::Dumper output.

Transformation: AST dump (aka. compilation)

As mentioned above, the Data::Dumper output is hard to read. Therefore, we will create a method that prettyprints a part of the AST. The output will generally look like

Ast::Something(
  childnode_A(
    ...
  )
  'some literal'
  ${a.variable}
)

Our prettyprint method will take an optional argument that indicates the intendation level. First, we add the general implementation:

package Ast;
    ...;
    sub prettyprint {
        my ($self, $indent) = @_;
        $indent //= 0;  # initialize $indent if no value passed
        $indent++;      # increment indent level
        my $items = join "\n",         # concatenate items with newline
            map { "  "x$indent . $_ }  # pad with intendation
            map { blessed($_) && $_->can("prettyprint")
                    ? $_->prettyprint($indent)
                    : $_ 
                } $self->childs;
        my $type = ref $self;
        return "$type(\n" . $items . " )";
    }

For Ast::Literal and Ast::Var, we provide special implementations:

package Ast::Literal;
    ...;
    sub prettyprint {
        my $self = shift;
        my $val = $self->val // return "undef";
        return qq("$val");
    }

package Ast::Var;
    ...;
    sub prettyprint {
        my $self = shift;
        '${' . $self->name . '}';
    }

I have made a gist that adds prettyprinting. When the prettyprint method is called on the dumped AST, we get the output

Ast::Block(
  Ast::Cond(
    Ast::Binop(
      Ast::Binop(
        ${Myvalue.xyz}
        ==
        "1" )
      Or
      Ast::Binop(
        ${Frmae_1.signal_1}
        ==
        "1" ) )
    Ast::Binop(
      ${a}
      =
      "1" )
    Ast::Binop(
      ${a}
      =
      "0" ) ) )

What I have done here was defining a language, and then compiling the AST to that language. I can easily add further implementations that change how the AST is compiled. E.g, we could add an if-then-else for the conditional like

package Ast::Cond;
    ...;
    sub prettyprint {
        my ($self, $indent) = @_;
        $indent //= 0;
        my ($cond, $then, $else) =
            map { Scalar::Util::blessed($_) && $_->can("prettyprint")
                    ? $_->prettyprint($indent)
                    : $_
                } $self->childs;
        return "if $cond\n"
          . "  "x$indent . "then $then\n"
          . "  "x$indent . "else $else";
    }

This would produce the output

Ast::Block(
  if Ast::Binop(
    Ast::Binop(
      ${Myvalue.xyz}
      ==
      "1" )
    Or
    Ast::Binop(
      ${Frmae_1.signal_1}
      ==
      "1" ) )
  then Ast::Binop(
    ${a}
    =
    "1" )
  else Ast::Binop(
    ${a}
    =
    "0" ) )

(see the gist for the complete code)

When you compile the AST to the C-like target language, you will probably use similar transformations.

Transformation: AST modification (aka. optimization)

Arguably more difficult than flattening the AST to some representation is modifying the structure of the syntax tree. This is due mostly to Perl's type system which doesn't lend itself to these tasks. (Generally, strongly typed functional languages like Ocaml or Haskell tend to excel here due to very expressive pattern matching).

In your stackoverflow question, you asked how to factor out the common assignment from the two branches of the conditional. To do that, we have to assert

That both branches are binops.
That the operator of both binops is the assignment operator =.
That the lvalue of both assignments is a variable.
That this variable goes by the same name.

Then we can factor out the assignment, and will assign the result of the conditional. Expressed as Perl code, given a branch $branch, we want the condition

$branch->isa("Ast::Binop") && $branch->op eq "=" && $branch->l->isa("Ast::Var")

to be true. The 4th condition can be expressed as $then->l->name eq $else->l->name. The complete simplify method for $cond would then be:

package Ast::Cond;
    ...;
    sub simplify {
        my $self = shift;
        my ($cond, $then, $else) = @$self;

        if (
            not( grep not($_->isa('Ast::Binop') && $_->op eq '=' && $_->l->isa('Ast::Var')), $then, $else) and
            $then->l->name eq $else->l->name
        ) {
            return Ast::Binop->new(
                $then->l->clone,
                '=',
                Ast::Cond->new($cond->simplify, $then->r->simplify, $else->r->simplify),
            );
        }
        # else: just do what would have been done by default
        return $self->SUPER::simplify;
    }

The not grep not ... is an application of De Morgan's Law: $P \land Q \Leftrightarrow \overline{\overline P \lor \overline Q}$ .

All other nodes inherit a simplify from Ast that does

sub simplify {
    my $self = shift;
    my @childs = map {blessed($_) && $_->can("simplify") ? $_->simplify : $_}  $self->childs;
    ref($self)->new(@childs);
}

In other words, a simplify on an AST that does not contain a conditional where the common subexpression is found has the same result as a clone on the AST.

If the simplify method is invoked on our AST, we get the following output:

Ast::Block(
  Ast::Binop(
    ${a}
    =
    if Ast::Binop(
      Ast::Binop(
        ${Myvalue.xyz}
        ==
        "1" )
      Or
      Ast::Binop(
        ${Frmae_1.signal_1}
        ==
        "1" ) )
    then "1"
    else "0" ) )

Closing remarks

Creating an AST and compiling it down to some representation may seem somewhat hard, but it isn't rocket science either. In the above examples, simplicity and code size were a constraint. In a real-world application, it would be helpful to define more AST node types. There is no such thing as a Binop on a semantic level, instead I'd use Ast::Logical::Or, Ast::Assign, Ast::Numeric::Eq and so on. Also, numeric literals should have a different type than string literals etc.

This would also make the final compilation to the C-like language easier. To make that step, one would implement another method that recurses through the tree, as clone, simplify and pretyprint do. However, there is no general solution, so there isn't much inheritance going on here. Especially the implementation of compile for the Ast class should probably throw an error.

Because the AST is unambiguous, there is no concept of precedence present in a tree. Therefore, it might be helpful to surround each output subexpression with parens, even when a human programmer knows that they aren't needed.

The AST Marpa produces and the AST we actually want may sometimes be slightly different. To do the translation, custom actions can be helpful. The Marpa docs cover this to some extent.

The grammar shown was tailored to match the example input, but not much more. Most important of all, more operators would have to be implemented with the correct precedence and association rules. Many operators are left-associatove:

$x + $y + $z == (($x + $y) + $z)

But assignment is right-associative:

($x = $y = $z) == ($x = ($y = $z))

Associativity can be specified by the assoc adverb in the Marpa grammar, e.g.

Expression ::=
    ...
||  Expression ('=') Expression assoc => right bless => Assign
    ...

Some operators like comparision aren't associative in Perl, instead

$x <= $y <= $z

is a syntax error (Perl5 has no chaining comparision operators).