This page explains how to fetch, install, run, and use tinyap.
Basically, tinyap is a recursive descent parser with backup. Thus, it's able to recognize any LL(k) language. When a parse is successful, tinyap outputs an Abstract Syntax Tree (AST) that represents the input text in a structured manner.
Unlike most parsers, its code isn't factory-calibrated for a particular grammar. Instead, the grammar is data. Tinyap uses an AST that represents a grammar to parse its input text. This grammar AST is structured according to tinyap's grammar description language.
The factory default for the grammar AST is the grammar description language itself. So one can parse a grammar description and directly use the parse output to parse some other text written in the described language. Tinyap also features a plugin mechanism for grammars, which allows for dynamic modular grammars. The ASTs tinyap produces can be serialized to and deserialized from buffers and files.
Finally, tinyap provides an interface to walk down the ASTs and makes it easy to write external plugins to visit the nodes, e.g. a compiler, an interpreter, a static evaluator, a renderer...
You can have a look at an extensive use of tinyap at http://code.google.com/p/tinyaml where tinyap is used to parse compile and extend a base meta-language.
$ wget http://tinyap.googlecode.com/files/tinyap-1.4-0.tar.gz
$ tar xzf tinyap-1.4-0.tar.gz
$ svn checkout http://tinyap.googlecode.com/svn/trunk/ tinyap-read-only
Now you have your tinyap source distribution at hands, build it.
$ cd tinyap
$ CFLAGS=-O3 ./configure -C --prefix=/my/install/prefix/if/not/slash/usr
$ make all
$ make install
The way the parser produces its output is driven by the grammar it uses. A grammar is a set of production rules, each rule associates a symbolic name and one or more elements that the rule will produce. Each element successfully parsed is converted in an AST node and the full AST is built after the recursive descent. When a rule successfully produces its elements, it can evaluate to a new operator node and enclose its produced nodes (these rules are called operator rules), or evaluate to its produced nodes silently (these rules are called transient rules). Transient rules are useful to implement loops and split rules.
The parser distinguishes information from garbage. A special rule defines the garbage characters to strip from the input between two elements. Terminal strings don't contain information and are not included in the output.
Elements
<non-terminal>
: the parser will try to produce the rule named "non-terminal", and append the result to output."terminal"
: the parser will try to produce the string "terminal" at its current position, and discard it from output./reg-exp/
: the parser will search for a match at its current position, and append the match string to output.//reg-exp/replacement/
: same as above, but the match will be replaced by the replacement
string. Numbers from 0 to 9 prefixed by a backslash \ reference submatches (and \0 is the whole match).Expressions
expression expression...
( expression | expression | ... )
[ expression ] <non-terminal>
{ expression } <non-terminal>
expression* # match expression zero or more times (behaves like expr_loop = ( <expression> <expr_loop> | epsilon ). ) expression? # match expression zero or one time (behaves like opt_expr = ( <expression> | epsilon ). ) expression+ # match expression one or more times (behaves like expr = <expression> <expr_loop>. )
Rules
symbol ::= expression .
symbol = expression .
Special symbols
epsilon
: produce the empty string (never fails)EOF
: produce the end of fileSpecial constructs
lefty = ( <lefty> <A> | <B> ).
plugin_rule = ( <A> | <B> | ... ).
Here is the full grammar description language expressed with itself :
elem = /[_a-zA-Z][0-9a-zA-Z_]*/ . T ::= //\"\(\([^\"\\]|[\\][\"\ trnb]\)*\)\"/\\1/ . NT ::= "<" <elem> ">" . re_re = /\([^\\/]|[\\][][\\/\ <>trnb\"]\)*/ . RE ::= "/" <re_re> "/" . RPL ::= "//" <re_re> "/" /\([^\r\n\\\\/]+|\\\\[\\\\/0-9]\)+/ "/" . rule = ( <OperatorRule> | <TransientRule> ) . OperatorRule ::= <elem> "::=" <rule_expr> "." . TransientRule ::= <elem> "=" <rule_expr> "." . rule_expr = ( <Alt> | <Seq> | <rule_elem> ) . Prefix ::= "[" <rule_expr> "]" <rule_elem_atom> . Postfix ::= "{" <rule_expr> "}" <rule_elem_atom> . Seq ::= <rule_elem> <seq_expr> . Alt ::= "\(" <alt_expr> "\)" . seq_expr = ( <rule_elem> <seq_expr> | <rule_elem> ) . alt_expr = ( <rule_elem> "|" <alt_expr> | <Seq> "|" <alt_expr> | <Seq> | <rule_elem> ) . rule_elem = ( <Rep> | <rule_elem_atom> ) . rule_elem_atom = ( <EOF> | <epsilon> | <T> | <NT> | <RPL> | <RE> | <Alt> | <Prefix> | <Postfix> ) . _start = <Grammar> . Grammar ::= <_loop> . _loop = ( EOF | <rule> <_loop> ) . EOF ::= "EOF" . epsilon ::= "epsilon" . _whitespace = /\([\ \r\n\t]|#[^\r\n]*[\r\n]+\)+/ . Rep1N ::= <rule_elem_atom> "+" . Rep0N ::= <rule_elem_atom> "*" . Rep01 ::= <rule_elem_atom> "?" . Rep = ( <Rep1N> | <Rep0N> | <Rep01> ) .
Tinyap features two other dialects to describe grammars which feature minor differences from the basic explicit
dialect. You can look at their full description in explicit
dialect by running
tinyap -g dialectName -pg
dialectName
is one of :epsilon
and EOF
symbols both get a leading _
. i.e. the rule : foobar ::= "foo" epsilon <bar>+ EOF. becomes foobar ::= "foo" _epsilon bar+ _EOF.
Below is a condensed grammar to parse arithmetic expressions. It will be used in the following examples :
#number ::= /[0-9]+(\.[0-9]*)?/. number ::= /[0-9]+/. m_expr ::= <expr4>. m_minus ::= "-" <number>. m_div ::= ( <m_div> "/" <expr0> | <expr0> "/" <expr0> ). m_mul ::= ( <m_mul> "*" <expr1> | <expr1> "*" <expr1> ). m_sub ::= ( <m_sub> "-" <expr2> | <expr2> "-" <expr2> ). m_add ::= ( <m_add> "+" <expr3> | <expr3> "+" <expr3> ). expr0 = ( <m_minus> | <number> | "(" <m_expr> ")" ). expr1 = ( <m_div> | <expr0> ). expr2 = ( <m_mul> | <expr1> ). expr3 = ( <m_sub> | <expr2> ). expr4 = ( <m_add> | <expr3> ). _expr = <m_expr> ";". _start = ( EOF | <_expr> <_start> ).
When invoked, tinyap executes each of its command-line arguments from left to right.
--help, -h [name]
--grammar, -g [file name]
explicit
or CamelCasing
, the according grammar dialect is used. Otherwise, the named file must contain a serialized "Grammar" AST.--pring-grammar, -pg
--input, -i [file name]
-
sets input to stdin.--output, -o [file name]
-
sets output to stdout.--parse, -p
--parse-as-grammar, -pag
--walk, -w [name]
Examples : (assuming you're in ./examples/)
$ tinyap -h
$ tinyap -pg
$ tinyap -i math.gram -pag
$ tinyap -i math.gram -p -w prettyprint
$ tinyap -i math.gram -pag -i test.math -p
$ gcc -shared ape_tinycalc.o -o libape_tinycalc.so # you may need to build the shared object $ LD_LIBRARY_PATH=. tinyap -i math.gram -pag -i test2.math -p -w tinycalc
$ tinyap -i math.gram -p -o math.gram.srlz # parse and serialize math.gram $ tinyap -g math.gram.srlz -i test2.math -p -w tinycalc # using math.gram, parse test2.math and evaluate the operations
The C API provides more means to serialize and unserialize ASTs to/from files and buffers. Such interactions are pointless in command line.
An evaluator has to provide four functions :
void*
and has to initialize the ape.Visit methods take the current node as an argument and return the direction for the next step in walk. Directions can be :
Each method of an evaluator must be named as follows :
ape_ evaluator_name _ method
method is init
, term
, result
, default
, or an operator label.
Tinyap handles the actual walking depending on evaluator's direction returns, and invokes the proper evaluator's visit method at each step.
Here is the code for the pretty-printer (invoke it in command line with "-w prettyprint") :
/* Tinya(J)P : this is not yet another (Java) parser. * Copyright (C) 2007 Damien Leroux * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ #include "config.h" #include "tinyap.h" #include "pilot_manager.h" #include "walker.h" void* ape_prettyprint_init(void* init_data) { return NULL; } void apt_prettyprint_free(void*data) {} void prettyprint_node_header(wast_t father, wast_t node) { char c[4]={' ',' ',0,0}; if(!father) { return; } if(!node) { c[1]='-'; node=father; father=wa_father(node); } if(father) { prettyprint_node_header(wa_father(father), father); if(wa_opd(father,wa_opd_count(father)-1)!=node) { c[0] = '|'; } else { if(c[1]=='-') { c[0] = '`'; } } fputs(c,stdout); } } WalkDirection ape_prettyprint_default(wast_t node, void*___) { prettyprint_node_header(node,NULL); puts(wa_op(node)); if(wa_opd_count(node)) { return Down; } else { return Next; } }
And here is the code for the tiny calculator "tinycalc" (seen in the examples above) :
/* Tinya(J)P : this is not yet another (Java) parser. * Copyright (C) 2007 Damien Leroux * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ #include "../src/tinyape.h" #include <stdlib.h> #include <stdio.h> typedef struct _tc_struc { int result; int is_toplevel; }* tc_t; void* ape_tinycalc_init(void* init_data) { tc_t tc = (tc_t) malloc(sizeof(struct _tc_struc)); tc->is_toplevel = init_data==NULL; return tc; } void* ape_tinycalc_result(tc_t i) { if(i->is_toplevel) { printf("result = %i\n",i->result); } return (void*)i->result; } void ape_tinycalc_free(tc_t data) { free(data); } WalkDirection ape_tinycalc_default(wast_t node, tc_t this) { printf("unknown node %s\n",wa_op(node)); return Done; } WalkDirection ape_tinycalc_number(wast_t node, tc_t this) { this->result = atoi(wa_op(wa_opd(node,0))); return Done; } WalkDirection ape_tinycalc_m_add(wast_t node, tc_t this) { int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this); int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this); this->result = a + b; return Done; } WalkDirection ape_tinycalc_m_sub(wast_t node, tc_t this) { int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this); int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this); this->result = a - b; return Done; } WalkDirection ape_tinycalc_m_mul(wast_t node, tc_t this) { int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this); int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this); this->result = a * b; return Done; } WalkDirection ape_tinycalc_m_div(wast_t node, tc_t this) { int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this); int b = (int)tinyap_walk(wa_opd(node,1),"tinycalc",this); this->result = a / b; return Done; } WalkDirection ape_tinycalc_m_minus(wast_t node, tc_t this) { int a = (int)tinyap_walk(wa_opd(node,0),"tinycalc",this); this->result = - a; return Done; } WalkDirection ape_tinycalc_m_expr(wast_t node, tc_t this) { return Down; }
This can be useful for data (un)serialization, as well as syntax highlighting and source code reformatting.
Tinyap recognizes the special pseudo-elements Space
, Indent
, Dedent
, and NewLine
. These elements are ignored when parsing text, and used to perform indentation when unparsing an AST. A special grammar rule indent may define the indentation string. Tinyap defaults to the tab character if this rule is not defined. Space is used to force a white space between similar tokens, or to ensure formatting. Many Space elements in a row will insert only one space.
Since tinyap doesn't recover from parse errors yet, the unparsing feature can't be trivially used to format source code while it is edited. This should be handled in the next release (Help wanted ! I haven't defined yet how to behave when encountering errors, since I'm busy by now it may take some time before tinyap is able to handle errors, unless people come to help with this feature).
Space
: insert one space into output text. More than one Space
elements in a row will produce one single space character in the output text.NewLine
: insert a NewLine ('_indent
strings as the current indent level.Indent
: increment indent level and insert one _indent
string into the output text.Dedent
: decrement indent level and remove the last _indent
string from output text.Quick output formatting how-to :
It is quite straightforward to include formatting elements in a grammar, for instance to add color or typefaces to the output text, without interfering with parsing input text.
Using terminals and the ? operator, one can define optional elements that will be taken into account only at unparsing time. Example :
_start = foobar. foobar ::= "<em>"? /(foo|bar|baz)+/ "</em>"?.
? operators allow the terminals to be omitted. The corresponding AST (foobar foobarbazfoo) will be unparsed to "<em>foobarbazfoo</em>" because unparsing always tries to produce optional elements.Moreover, it is easy to define configurable formatters :
_start = foobar. foo_prefix = ( "<em>" ). foo_suffix = ( "</em>" ). foobar ::= foo_prefix? /(foo|bar|baz)+/ foo_suffix?.
Prefix and Suffix are defined as pluggable rules. Now, just plug a new formatting terminal in each rule. Unparsing will always use the last formatting terminal plugged in.
For instance :
foo_prefix
.foo_suffix
.