First steps with Tinyap

Author : Damien “bl0b” Leroux (damien dot leroux at gmail dot com)

This page explains how to fetch, install, run, and use tinyap.

Introduction
I. Installation
II. Meta-Grammar
III. Usage
IV. (Un)Serialization of ASTs
V. These apes are made for walking...
VI. Unparsing
VII. C API

Introduction

“One build to parse them all.”

Basically, tinyap is a recursive descent parser with backup. Thus, it's able to recognize any LL(k) language. When a parse is successful, tinyap outputs an Abstract Syntax Tree (AST) that represents the input text in a structured manner.

Unlike most parsers, its code isn't factory-calibrated for a particular grammar. Instead, the grammar is data. Tinyap uses an AST that represents a grammar to parse its input text. This grammar AST is structured according to tinyap's grammar description language.

The factory default for the grammar AST is the grammar description language itself. So one can parse a grammar description and directly use the parse output to parse some other text written in the described language. Tinyap also features a plugin mechanism for grammars, which allows for dynamic modular grammars. The ASTs tinyap produces can be serialized to and deserialized from buffers and files.

Finally, tinyap provides an interface to walk down the ASTs and makes it easy to write external plugins to visit the nodes, e.g. a compiler, an interpreter, a static evaluator, a renderer...

You can have a look at an extensive use of tinyap at http://code.google.com/p/tinyaml where tinyap is used to parse compile and extend a base meta-language.

I. Installation

First, get the source code.

download a tarball : (version 1.4-0 in used the example below, you can download the latest at http://code.google.com/p/tinyap/downloads/list).
```
 $ wget http://tinyap.googlecode.com/files/tinyap-1.4-0.tar.gz
 $ tar xzf tinyap-1.4-0.tar.gz
```

using SVN :

 $ svn checkout http://tinyap.googlecode.com/svn/trunk/ tinyap-read-only

Now you have your tinyap source distribution at hands, build it.

 $ cd tinyap
 $ CFLAGS=-O3 ./configure -C --prefix=/my/install/prefix/if/not/slash/usr
 $ make all
 $ make install

You might need root privileges to run make install.

II. Meta-Grammar

“The language is that of EBNF, which I will not utter here. In the common tongue it says...”

The way the parser produces its output is driven by the grammar it uses. A grammar is a set of production rules, each rule associates a symbolic name and one or more elements that the rule will produce. Each element successfully parsed is converted in an AST node and the full AST is built after the recursive descent. When a rule successfully produces its elements, it can evaluate to a new operator node and enclose its produced nodes (these rules are called operator rules), or evaluate to its produced nodes silently (these rules are called transient rules). Transient rules are useful to implement loops and split rules.

The parser distinguishes information from garbage. A special rule defines the garbage characters to strip from the input between two elements. Terminal strings don't contain information and are not included in the output.

Elements

<non-terminal> : the parser will try to produce the rule named "non-terminal", and append the result to output.
"terminal" : the parser will try to produce the string "terminal" at its current position, and discard it from output.
/reg-exp/ : the parser will search for a match at its current position, and append the match string to output.
//reg-exp/replacement/ : same as above, but the match will be replaced by the replacement string. Numbers from 0 to 9 prefixed by a backslash \ reference submatches (and \0 is the whole match).

Expressions

Sequence
```
 expression expression... 
```
produce all children from left to right
Selection
```
 ( expression | expression | ... ) 
```
try all children from left to right until successful. It is allowed to have only one sub-expression inside parenthesis.
Prefix
```
 [ expression ] <non-terminal> 
```
produce expression, produce non-terminal, insert expression at head of non-terminal's children, evaluate to non-terminal
Postfix (much useless)
```
 { expression } <non-terminal> 
```
same as prefix, but insert expression at tail of non-terminal's children

Repetition operators

 expression*   # match expression zero or more times (behaves like expr_loop = ( <expression> <expr_loop> | epsilon ). )
 expression?   # match expression zero or one time (behaves like opt_expr = ( <expression> | epsilon ). )
 expression+   # match expression one or more times (behaves like expr = <expression> <expr_loop>. )

Rules

Operator rules
```
 symbol ::= expression . 
```
produce an operator node labeled "symbol"
Transient rules
```
 symbol = expression . 
```
produce expression
_start : tinyap searches for this rule to start a parse.
_whitespace : tinyap searches for a string or a regexp in the right-hand side expression, and uses it as the garbage characters pattern.

Special symbols

epsilon : produce the empty string (never fails)
EOF : produce the end of file

Special constructs

Handling trivial left-recursive rules
```
 lefty = ( <lefty> <A> | ). 
```
(works for both operator and transient rules) tinyap recognizes such rules and handles them correctly. Use with care. Remember it's an LL parser, though. Tinyap doesn't handle ALL left-recursive rules. You're warned.
Grammar plugins
```
 plugin_rule = ( <A> | | ... ). 
```
tinyap can plug new alternatives in rules where the right-hand side expression is a selection (see VII. C API).

Here is the full grammar description language expressed with itself :

elem = /[_a-zA-Z][0-9a-zA-Z_]*/ .
T ::= //\"\(\([^\"\\]|[\\][\"\ trnb]\)*\)\"/\\1/ .
NT ::= "<" <elem> ">" .
re_re = /\([^\\/]|[\\][][\\/\ <>trnb\"]\)*/ .
RE ::= "/" <re_re> "/" .
RPL ::= "//" <re_re> "/" /\([^\r\n\\\\/]+|\\\\[\\\\/0-9]\)+/ "/" .
rule = ( <OperatorRule> | <TransientRule> ) .
OperatorRule ::= <elem> "::=" <rule_expr> "." .
TransientRule ::= <elem> "=" <rule_expr> "." .
rule_expr = ( <Alt> | <Seq> | <rule_elem> ) .
Prefix ::= "[" <rule_expr> "]" <rule_elem_atom> .
Postfix ::= "{" <rule_expr> "}" <rule_elem_atom> .
Seq ::= <rule_elem> <seq_expr> .
Alt ::= "\(" <alt_expr> "\)" .
seq_expr = ( <rule_elem> <seq_expr> | <rule_elem> ) .
alt_expr = ( <rule_elem> "|" <alt_expr> | <Seq> "|" <alt_expr> | <Seq> | <rule_elem> ) .
rule_elem = ( <Rep> | <rule_elem_atom> ) .
rule_elem_atom = ( <EOF> | <epsilon> | <T> | <NT> | <RPL> | <RE> | <Alt> | <Prefix> | <Postfix> ) .
_start = <Grammar> .
Grammar ::= <_loop> .
_loop = ( EOF | <rule> <_loop> ) .
EOF ::= "EOF" .
epsilon ::= "epsilon" .
_whitespace = /\([\ \r\n\t]|#[^\r\n]*[\r\n]+\)+/ .
Rep1N ::= <rule_elem_atom> "+" .
Rep0N ::= <rule_elem_atom> "*" .
Rep01 ::= <rule_elem_atom> "?" .
Rep = ( <Rep1N> | <Rep0N> | <Rep01> ) .

Tinyap features two other dialects to describe grammars which feature minor differences from the basic explicit dialect. You can look at their full description in explicit dialect by running

 tinyap -g dialectName -pg

, where dialectName is one of :

CamelCasing All rules use the "::=" notation. Rule names that obey CamelCasing are operator rules, other rules are transient rules.
short New and sexier dialect. Same as explicit but for non-terminals : the < and > symbols are stripped. The epsilon and EOF symbols both get a leading _ .
```
 i.e. the rule :
 foobar ::= "foo" epsilon <bar>+ EOF.
 becomes
 foobar ::= "foo" _epsilon bar+ _EOF. 
```

Below is a condensed grammar to parse arithmetic expressions. It will be used in the following examples :

#number ::= /[0-9]+(\.[0-9]*)?/.
number ::= /[0-9]+/.

m_expr ::= <expr4>.

m_minus ::= "-" <number>.
m_div ::= ( <m_div> "/" <expr0> | <expr0> "/" <expr0> ).
m_mul ::= ( <m_mul> "*" <expr1> | <expr1> "*" <expr1> ).
m_sub ::= ( <m_sub> "-" <expr2> | <expr2> "-" <expr2> ).
m_add ::= ( <m_add> "+" <expr3> | <expr3> "+" <expr3> ).

expr0 = ( <m_minus> | <number> | "(" <m_expr> ")" ).
expr1 = ( <m_div> | <expr0> ).
expr2 = ( <m_mul> | <expr1> ).
expr3 = ( <m_sub> | <expr2> ).
expr4 = ( <m_add> | <expr3> ).

_expr = <m_expr> ";".

_start = ( EOF
         | <_expr> <_start>
         ).

Notice that number, m_expr, m_mul, m_sub, m_add, m_div are operator rules. These are the labels we'll have in the nodes of the parse results.

III. Usage

“First shalt thou take out the Holy Command Line, then shalt thou count to three, no more, no less.
Three shall be the number thou shalt count, and the number of the counting shall be three.”

When invoked, tinyap executes each of its command-line arguments from left to right.

```
 --help, -h [name] 
```
Display version and usage.
```
 --grammar, -g [file name] 
```
Set the grammar to be used. If the name is explicit or CamelCasing, the according grammar dialect is used. Otherwise, the named file must contain a serialized "Grammar" AST.
```
 --pring-grammar, -pg 
```
Output the current grammar to stdout, using explicit dialect.
```
 --input, -i [file name] 
```
Set the file to be parsed. - sets input to stdin.
```
 --output, -o [file name] 
```
Output serialized AST to named file. - sets output to stdout.
```
 --parse, -p 
```
Parse the input file using the current grammar.
```
 --parse-as-grammar, -pag 
```
Parse the input file using the current grammar and set grammar to the newly output AST if parse succeeded. Parser output is NULL after this.
```
 --walk, -w [name] 
```
Walk over the current parser output with the named walker.

Examples : (assuming you're in ./examples/)

first use :
```
 $ tinyap -h 
```
display the default meta-grammar :
```
 $ tinyap -pg 
```
parse a grammar :
```
 $ tinyap -i math.gram -pag 
```

parse something and display the result tree :

 $ tinyap -i math.gram -p -w prettyprint

parse a grammar and a text file :

 $ tinyap -i math.gram -pag -i test.math -p

evaluate the operations in test2.math :

$ gcc -shared ape_tinycalc.o -o libape_tinycalc.so # you may need to build the shared object
$ LD_LIBRARY_PATH=. tinyap -i math.gram -pag -i test2.math -p -w tinycalc

IV. (Un)Serialization of ASTs

Tinyap makes it easy writing and reading ASTs to and from files and buffers. Serialized ASTs can be directly used as grammars :

 $ tinyap -i math.gram -p -o math.gram.srlz                 # parse and serialize math.gram
 $ tinyap -g math.gram.srlz -i test2.math -p -w tinycalc    # using math.gram, parse test2.math and evaluate the operations

The C API provides more means to serialize and unserialize ASTs to/from files and buffers. Such interactions are pointless in command line.

V. These apes are made for walking...

“...And that's just what they'll do : one of these days these apes are gonna walk 'ver the output.”

Tinyap provides an interface to easily write AST evaluators (hence the name, tinyap evaluator -> tinyape -> ape).

An evaluator has to provide four functions :

init : receives data under the form of a void* and has to initialize the ape.
term : terminate the ape.
result : has to return the result of the ape's evaluation.
default : this visit method is called whenever no specific visit method has been defined for an operator node.

Visit methods take the current node as an argument and return the direction for the next step in walk. Directions can be :

Up : to parent node
Down : walk on each child node (operands)
Next : walk to next sibling node (or is equivalent to "Up-Next" on the last child of a node)
Done : evaluation is over and successful.
Error : evaluation failed.

Each method of an evaluator must be named as follows :

ape_ evaluator_name _ method

method is init, term, result, default, or an operator label.

Tinyap handles the actual walking depending on evaluator's direction returns, and invokes the proper evaluator's visit method at each step.

Here is the code for the pretty-printer (invoke it in command line with "-w prettyprint") :

/* Tinya(J)P : this is not yet another (Java) parser.
 * Copyright (C) 2007 Damien Leroux
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
 */

#include "config.h"
#include "tinyap.h"
#include "pilot_manager.h"
#include "walker.h"


void* ape_prettyprint_init(void* init_data) { return NULL; }

void apt_prettyprint_free(void*data) {}

void prettyprint_node_header(wast_t father, wast_t node) {
        char c[4]={' ',' ',0,0};
        if(!father) {
                return;
        }
        if(!node) {
                c[1]='-';
                node=father;
                father=wa_father(node);
        }
        if(father) {
                prettyprint_node_header(wa_father(father), father);
                if(wa_opd(father,wa_opd_count(father)-1)!=node) {
                        c[0] = '|';
                } else {
                        if(c[1]=='-') {
                                c[0] = '`';
                        }
                }
                fputs(c,stdout);
        }
}



WalkDirection ape_prettyprint_default(wast_t node, void*___) {
        prettyprint_node_header(node,NULL);
        puts(wa_op(node));

        if(wa_opd_count(node)) {
                return Down;
        } else {
                return Next;
        }
}

And here is the code for the tiny calculator "tinycalc" (seen in the examples above) :

/* Tinya(J)P : this is not yet another (Java) parser.
 * Copyright (C) 2007 Damien Leroux
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
 */

#include "../src/tinyape.h"
#include <stdlib.h>
#include <stdio.h>

typedef struct _tc_struc {
        int result;
        int is_toplevel;
}* tc_t;

void* ape_tinycalc_init(void* init_data) {
        tc_t tc = (tc_t) malloc(sizeof(struct _tc_struc));

        tc->is_toplevel = init_data==NULL;

        return tc;
}

void* ape_tinycalc_result(tc_t i) {
        if(i->is_toplevel) {
                printf("result = %i\n",i->result);
        }
        return (void*)i->result;
}

void ape_tinycalc_free(tc_t data) {
        free(data);
}

WalkDirection ape_tinycalc_default(wast_t node, tc_t this) {
        printf("unknown node %s\n",wa_op(node));
        return Done;
}

WalkDirection ape_tinycalc_number(wast_t node, tc_t this) {
        this->result = atoi(wa_op(wa_opd(node,0)));
        return Done;
}

WalkDirection ape_tinycalc_m_add(wast_t node, tc_t this) {
        int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this);
        int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this);
        this->result = a + b;
        return Done;
}

WalkDirection ape_tinycalc_m_sub(wast_t node, tc_t this) {
        int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this);
        int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this);
        this->result = a - b;
        return Done;
}

WalkDirection ape_tinycalc_m_mul(wast_t node, tc_t this) {
        int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this);
        int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this);
        this->result = a * b;
        return Done;
}

WalkDirection ape_tinycalc_m_div(wast_t node, tc_t this) {
        int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this);
        int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this);
        this->result = a / b;
        return Done;
}

WalkDirection ape_tinycalc_m_minus(wast_t node, tc_t this) {
        int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this);
        this->result = - a;
        return Done;
}

WalkDirection ape_tinycalc_m_expr(wast_t node, tc_t this) {
        return Down;
}

VI. Unparsing

With using a grammar and an AST, create back the text buffer that generated this AST when parsed with this grammar.

This can be useful for data (un)serialization, as well as syntax highlighting and source code reformatting.

Tinyap recognizes the special pseudo-elements Space, Indent, Dedent, and NewLine. These elements are ignored when parsing text, and used to perform indentation when unparsing an AST. A special grammar rule indent may define the indentation string. Tinyap defaults to the tab character if this rule is not defined. Space is used to force a white space between similar tokens, or to ensure formatting. Many Space elements in a row will insert only one space.

Note:: Note that the grammar used to unparse is not required to be the same grammar used for parsing. It is only required to define all the rules that match the AST labels.
Since tinyap doesn't recover from parse errors yet, the unparsing feature can't be trivially used to format source code while it is edited. This should be handled in the next release (Help wanted ! I haven't defined yet how to behave when encountering errors, since I'm busy by now it may take some time before tinyap is able to handle errors, unless people come to help with this feature).

Detailed pseudo-elements behaviour :

Space : insert one space into output text. More than one Space elements in a row will produce one single space character in the output text.
NewLine : insert a NewLine ('
') character into output text, followed by as many _indent strings as the current indent level.
Indent : increment indent level and insert one _indent string into the output text.
Dedent : decrement indent level and remove the last _indent string from output text.

Quick output formatting how-to :

It is quite straightforward to include formatting elements in a grammar, for instance to add color or typefaces to the output text, without interfering with parsing input text.

Using terminals and the ? operator, one can define optional elements that will be taken into account only at unparsing time. Example :

        _start = foobar.
        foobar ::= "<em>"? /(foo|bar|baz)+/ "</em>"?.

The text "foobarbazfoo" will be recognized by this rule, because the ? operators allow the terminals to be omitted. The corresponding AST (foobar foobarbazfoo) will be unparsed to "foobarbazfoo" because unparsing always tries to produce optional elements.

Moreover, it is easy to define configurable formatters :

        _start = foobar.
        foo_prefix = ( "<em>" ).
        foo_suffix = ( "</em>" ).
        foobar ::= foo_prefix? /(foo|bar|baz)+/ foo_suffix?.

Prefix and Suffix are defined as pluggable rules. Now, just plug a new formatting terminal in each rule. Unparsing will always use the last formatting terminal plugged in.

For instance :

Unparsing (foobar foobarbazfoo) gives "foobarbazfoo".
Plug "PFX" into foo_prefix.
Plug "SFX" into foo_suffix.
Unparsing (foobar foobarbazfoo) now gives "PFXfoobarbazfooSFX".

VII. C API

See also:: tinyap.h
tinyape.h

Generated on Sun May 11 15:04:49 2008 for TINYAP by

1.5.3