grammar

**TODO:** Document this code. The repository is located at https://github.com/asaparov/grammar.

This code implements the generative model of grammar as described in this paper. Given a logical form, the grammar generates a derivation tree top-down, by selecting production rules probabilistically conditioned on the logical form. The leaves of the derivation tree form the tokens of the output utterance.

If you use this code in your research, please cite:Abulhair Saparov, Vijay Saraswat, and Tom M. Mitchell. 2017. A Probabilistic Generative Grammar for Semantic Parsing. In

Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017).

To use the code, simply download the files into a folder named "grammar". Add this folder to the include path for the compiler, for example by using the `-I`

flag.

This library depends on core, math, and hdp. The code makes use of `C++11`

and is regularly tested with `gcc 9`

but I have previously compiled it with `gcc 4.8`

, `clang 4.0`

, and `Microsoft Visual C++ 14.0 (2015)`

. The code is intended to be platform-independent, so please create an issue if there are any compilation bugs.

`grammar.h`

contains the definitions of the basic data structures, such as production rules in `struct rule`

, nonterminals in `struct nonterminal`

, the grammar in `struct grammar`

, and the nodes of derivation trees in `struct syntax_node`

. These data structures are defined as template types, where the logical form type is given by the `Semantics`

typename parameter, and the distribution over production rules at each nonterminal is given by the `RuleDistribution`

template parameter.

`struct null_semantics`

in `grammar.h`

is an example of a `Semantics`

type where the logical forms are empty. `pcfg_grammar.h`

implements a `RuleDistribution`

type where the distribution over production rules does not depend on the logical form. The test program in `pcfg_induction.cpp`

uses this rule distribution along with `null_semantics`

to effectively perform grammar induction on a probabilistic context-free grammar.

`hdp_grammar.h`

defines `struct hdp_rule_distribution`

which provides another example of a `RuleDistribution`

type, where the distribution over production rules is given by a hierarchical Dirichlet process (HDP).

The parser and Metropolis-Hastings sampler (for training) are implemented in `parser.h`

. Since the structure of the two algorithms are so similar, they are implemented as a single algorithm, where the `Mode`

template parameter determines whether the algorithm is parsing or sampling. The function `parse_result parse(...)`

contains the main loop of this algorithm, where at every iteration, the search state is popped from the priority queue (agenda) and processed according to its type. This processing may add new search states to the priority queue.

`morphology.h`

implements a simple model of word morphology, where each word is modeled as an inflected root. The root may be inflected according to grammatical number, aspect, and tense.

`pcfg_induction`

creates an instance of the grammar where the logical forms are empty. So the grammar is a probabilistic context-free grammar and is purely syntactic. The program then trains the grammar on a handful of examples created from the reference grammar:`E -> 0 E 1`

,`E -> 0 1`

,`E -> E E`

. The program is defined in`pcfg_induction.cpp`

and can be compiled using`make pcfg_induction`

.`math_induction`

learns a grammar where the logical forms are simple arithmetic expressions such as`1 + (1 * 0) / 1`

. These logical expressions are modeled as simple binary trees, as defined in`tree_semantics.h`

. This program uses an HDP to model the distribution over production rules at each nonterminal (`struct hdp_rule_distribution`

in`hdp_grammar.h`

). This program is located in`math_induction.cpp`

and can be compiled using`make math_induction`

.