First steps with Tinyap

Author : Damien “bl0b” Leroux (damien dot leroux at gmail dot com)

This page explains how to fetch, install, run, and use tinyap.

Contents

Introduction
I. Installation
II. Meta-Grammar
III. Usage
IV. (Un)Serialization of ASTs
V. These apes are made for walking...
VI. Unparsing
VII. C API

Introduction

“One build to parse them all.”

Basically, tinyap is a recursive descent parser with backup. Thus, it's able to recognize any LL(k) language. When a parse is successful, tinyap outputs an Abstract Syntax Tree (AST) that represents the input text in a structured manner.

Unlike most parsers, its code isn't factory-calibrated for a particular grammar. Instead, the grammar is data. Tinyap uses an AST that represents a grammar to parse its input text. This grammar AST is structured according to tinyap's grammar description language.

The factory default for the grammar AST is the grammar description language itself. So one can parse a grammar description and directly use the parse output to parse some other text written in the described language. Tinyap also features a plugin mechanism for grammars, which allows for dynamic modular grammars. The ASTs tinyap produces can be serialized to and deserialized from buffers and files.

Finally, tinyap provides an interface to walk down the ASTs and makes it easy to write external plugins to visit the nodes, e.g. a compiler, an interpreter, a static evaluator, a renderer...

You can have a look at an extensive use of tinyap at http://code.google.com/p/tinyaml where tinyap is used to parse compile and extend a base meta-language.

I. Installation

First, get the source code.

Now you have your tinyap source distribution at hands, build it.

 $ cd tinyap
 $ CFLAGS=-O3 ./configure -C --prefix=/my/install/prefix/if/not/slash/usr
 $ make all
 $ make install
You might need root privileges to run make install.

II. Meta-Grammar

“The language is that of EBNF, which I will not utter here. In the common tongue it says...”

The way the parser produces its output is driven by the grammar it uses. A grammar is a set of production rules, each rule associates a symbolic name and one or more elements that the rule will produce. Each element successfully parsed is converted in an AST node and the full AST is built after the recursive descent. When a rule successfully produces its elements, it can evaluate to a new operator node and enclose its produced nodes (these rules are called operator rules), or evaluate to its produced nodes silently (these rules are called transient rules). Transient rules are useful to implement loops and split rules.

The parser distinguishes information from garbage. A special rule defines the garbage characters to strip from the input between two elements. Terminal strings don't contain information and are not included in the output.

Elements

Expressions

Rules

Special symbols

Special constructs

Here is the full grammar description language expressed with itself :

elem = /[_a-zA-Z][0-9a-zA-Z_]*/ .
T ::= //\"\(\([^\"\\]|[\\][\"\ trnb]\)*\)\"/\\1/ .
NT ::= "<" <elem> ">" .
re_re = /\([^\\/]|[\\][][\\/\ <>trnb\"]\)*/ .
RE ::= "/" <re_re> "/" .
RPL ::= "//" <re_re> "/" /\([^\r\n\\\\/]+|\\\\[\\\\/0-9]\)+/ "/" .
rule = ( <OperatorRule> | <TransientRule> ) .
OperatorRule ::= <elem> "::=" <rule_expr> "." .
TransientRule ::= <elem> "=" <rule_expr> "." .
rule_expr = ( <Alt> | <Seq> | <rule_elem> ) .
Prefix ::= "[" <rule_expr> "]" <rule_elem_atom> .
Postfix ::= "{" <rule_expr> "}" <rule_elem_atom> .
Seq ::= <rule_elem> <seq_expr> .
Alt ::= "\(" <alt_expr> "\)" .
seq_expr = ( <rule_elem> <seq_expr> | <rule_elem> ) .
alt_expr = ( <rule_elem> "|" <alt_expr> | <Seq> "|" <alt_expr> | <Seq> | <rule_elem> ) .
rule_elem = ( <Rep> | <rule_elem_atom> ) .
rule_elem_atom = ( <EOF> | <epsilon> | <T> | <NT> | <RPL> | <RE> | <Alt> | <Prefix> | <Postfix> ) .
_start = <Grammar> .
Grammar ::= <_loop> .
_loop = ( EOF | <rule> <_loop> ) .
EOF ::= "EOF" .
epsilon ::= "epsilon" .
_whitespace = /\([\ \r\n\t]|#[^\r\n]*[\r\n]+\)+/ .
Rep1N ::= <rule_elem_atom> "+" .
Rep0N ::= <rule_elem_atom> "*" .
Rep01 ::= <rule_elem_atom> "?" .
Rep = ( <Rep1N> | <Rep0N> | <Rep01> ) .

Tinyap features two other dialects to describe grammars which feature minor differences from the basic explicit dialect. You can look at their full description in explicit dialect by running

 tinyap -g dialectName -pg 
, where dialectName is one of :

Below is a condensed grammar to parse arithmetic expressions. It will be used in the following examples :

#number ::= /[0-9]+(\.[0-9]*)?/.
number ::= /[0-9]+/.

m_expr ::= <expr4>.

m_minus ::= "-" <number>.
m_div ::= ( <m_div> "/" <expr0> | <expr0> "/" <expr0> ).
m_mul ::= ( <m_mul> "*" <expr1> | <expr1> "*" <expr1> ).
m_sub ::= ( <m_sub> "-" <expr2> | <expr2> "-" <expr2> ).
m_add ::= ( <m_add> "+" <expr3> | <expr3> "+" <expr3> ).

expr0 = ( <m_minus> | <number> | "(" <m_expr> ")" ).
expr1 = ( <m_div> | <expr0> ).
expr2 = ( <m_mul> | <expr1> ).
expr3 = ( <m_sub> | <expr2> ).
expr4 = ( <m_add> | <expr3> ).

_expr = <m_expr> ";".

_start = ( EOF
         | <_expr> <_start>
         ).

Notice that number, m_expr, m_mul, m_sub, m_add, m_div are operator rules. These are the labels we'll have in the nodes of the parse results.

III. Usage

“First shalt thou take out the Holy Command Line, then shalt thou count to three, no more, no less.
Three shall be the number thou shalt count, and the number of the counting shall be three.”

When invoked, tinyap executes each of its command-line arguments from left to right.

Examples : (assuming you're in ./examples/)

IV. (Un)Serialization of ASTs

Tinyap makes it easy writing and reading ASTs to and from files and buffers. Serialized ASTs can be directly used as grammars :
 $ tinyap -i math.gram -p -o math.gram.srlz                 # parse and serialize math.gram
 $ tinyap -g math.gram.srlz -i test2.math -p -w tinycalc    # using math.gram, parse test2.math and evaluate the operations 

The C API provides more means to serialize and unserialize ASTs to/from files and buffers. Such interactions are pointless in command line.

V. These apes are made for walking...

“...And that's just what they'll do : one of these days these apes are gonna walk 'ver the output.”
Tinyap provides an interface to easily write AST evaluators (hence the name, tinyap evaluator -> tinyape -> ape).

An evaluator has to provide four functions :

Visit methods take the current node as an argument and return the direction for the next step in walk. Directions can be :

Each method of an evaluator must be named as follows :

ape_ evaluator_name _ method

method is init, term, result, default, or an operator label.

Tinyap handles the actual walking depending on evaluator's direction returns, and invokes the proper evaluator's visit method at each step.

Here is the code for the pretty-printer (invoke it in command line with "-w prettyprint") :

/* Tinya(J)P : this is not yet another (Java) parser.
 * Copyright (C) 2007 Damien Leroux
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
 */

#include "config.h"
#include "tinyap.h"
#include "pilot_manager.h"
#include "walker.h"


void* ape_prettyprint_init(void* init_data) { return NULL; }

void apt_prettyprint_free(void*data) {}

void prettyprint_node_header(wast_t father, wast_t node) {
        char c[4]={' ',' ',0,0};
        if(!father) {
                return;
        }
        if(!node) {
                c[1]='-';
                node=father;
                father=wa_father(node);
        }
        if(father) {
                prettyprint_node_header(wa_father(father), father);
                if(wa_opd(father,wa_opd_count(father)-1)!=node) {
                        c[0] = '|';
                } else {
                        if(c[1]=='-') {
                                c[0] = '`';
                        }
                }
                fputs(c,stdout);
        }
}



WalkDirection ape_prettyprint_default(wast_t node, void*___) {
        prettyprint_node_header(node,NULL);
        puts(wa_op(node));

        if(wa_opd_count(node)) {
                return Down;
        } else {
                return Next;
        }
}

And here is the code for the tiny calculator "tinycalc" (seen in the examples above) :

/* Tinya(J)P : this is not yet another (Java) parser.
 * Copyright (C) 2007 Damien Leroux
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
 */

#include "../src/tinyape.h"
#include <stdlib.h>
#include <stdio.h>

typedef struct _tc_struc {
        int result;
        int is_toplevel;
}* tc_t;

void* ape_tinycalc_init(void* init_data) {
        tc_t tc = (tc_t) malloc(sizeof(struct _tc_struc));

        tc->is_toplevel = init_data==NULL;

        return tc;
}

void* ape_tinycalc_result(tc_t i) {
        if(i->is_toplevel) {
                printf("result = %i\n",i->result);
        }
        return (void*)i->result;
}

void ape_tinycalc_free(tc_t data) {
        free(data);
}

WalkDirection ape_tinycalc_default(wast_t node, tc_t this) {
        printf("unknown node %s\n",wa_op(node));
        return Done;
}

WalkDirection ape_tinycalc_number(wast_t node, tc_t this) {
        this->result = atoi(wa_op(wa_opd(node,0)));
        return Done;
}

WalkDirection ape_tinycalc_m_add(wast_t node, tc_t this) {
        int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this);
        int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this);
        this->result = a + b;
        return Done;
}

WalkDirection ape_tinycalc_m_sub(wast_t node, tc_t this) {
        int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this);
        int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this);
        this->result = a - b;
        return Done;
}

WalkDirection ape_tinycalc_m_mul(wast_t node, tc_t this) {
        int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this);
        int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this);
        this->result = a * b;
        return Done;
}

WalkDirection ape_tinycalc_m_div(wast_t node, tc_t this) {
        int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this);
        int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this);
        this->result = a / b;
        return Done;
}

WalkDirection ape_tinycalc_m_minus(wast_t node, tc_t this) {
        int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this);
        this->result = - a;
        return Done;
}

WalkDirection ape_tinycalc_m_expr(wast_t node, tc_t this) {
        return Down;
}

VI. Unparsing

With using a grammar and an AST, create back the text buffer that generated this AST when parsed with this grammar.

This can be useful for data (un)serialization, as well as syntax highlighting and source code reformatting.

Tinyap recognizes the special pseudo-elements Space, Indent, Dedent, and NewLine. These elements are ignored when parsing text, and used to perform indentation when unparsing an AST. A special grammar rule indent may define the indentation string. Tinyap defaults to the tab character if this rule is not defined. Space is used to force a white space between similar tokens, or to ensure formatting. Many Space elements in a row will insert only one space.

Note:
Note that the grammar used to unparse is not required to be the same grammar used for parsing. It is only required to define all the rules that match the AST labels.

Since tinyap doesn't recover from parse errors yet, the unparsing feature can't be trivially used to format source code while it is edited. This should be handled in the next release (Help wanted ! I haven't defined yet how to behave when encountering errors, since I'm busy by now it may take some time before tinyap is able to handle errors, unless people come to help with this feature).

Detailed pseudo-elements behaviour :

Quick output formatting how-to :

It is quite straightforward to include formatting elements in a grammar, for instance to add color or typefaces to the output text, without interfering with parsing input text.

Using terminals and the ? operator, one can define optional elements that will be taken into account only at unparsing time. Example :

        _start = foobar.
        foobar ::= "<em>"? /(foo|bar|baz)+/ "</em>"?.
The text "foobarbazfoo" will be recognized by this rule, because the ? operators allow the terminals to be omitted. The corresponding AST (foobar foobarbazfoo) will be unparsed to "<em>foobarbazfoo</em>" because unparsing always tries to produce optional elements.

Moreover, it is easy to define configurable formatters :

        _start = foobar.
        foo_prefix = ( "<em>" ).
        foo_suffix = ( "</em>" ).
        foobar ::= foo_prefix? /(foo|bar|baz)+/ foo_suffix?.

Prefix and Suffix are defined as pluggable rules. Now, just plug a new formatting terminal in each rule. Unparsing will always use the last formatting terminal plugged in.

For instance :

VII. C API

See also:
tinyap.h

tinyape.h


Generated on Sun May 11 15:04:49 2008 for TINYAP by  doxygen 1.5.3