UCLA Prize Pack release of a parser & associated tools for Minimalist Grammars author: John T. Hale version: "timonium" (summer 2003) requirements: should run on anything Ocaml runs on. Tested with Ocaml 3.06 on Mac OS X, linux and AIX. This is the README file for a parsing system for Minimalist Grammars (MGs) I developed in connection with my dissertation "Grammar, Uncertainty and Sentence Processing." To follow up a particular psycholinguistic hypothesis, I needed to be able to find parses for strings generated by MGs. I also wanted to weight these parses, and calculate the entropy of probabilistic sublanguages constrained to have a given initial string. Since my sentences were long, I put in some effort into making the parser run fast. I am very grateful for help on this project from Ed Stabler (UCLA) and Scott Smith (JHU). Maxime Amblard contributed a tree-drawing function that uses the AT&T dot program as a backend. At all stages, I was guided by Henk Harkema's excellent dissertation, "Parsing Minimalist Grammars" that lays out the design of the CKY parser for MGs. At present, it is my intention to distribute this software only to colleagues at UCLA. If you use this software in your work, please let me know, and include an acknowledgement of my dissertation. I would also be grateful to hear about any bugs found! FILES Makefile used by make(1) to build bytecode modules and native "runme" program utilities.ml useful goodies amblard.ml Maxime Amblard's contributed tree-drawing function feature.ml data type for MG features item.ml functor for parser items; C185 is the item type that uses MG features grammar.ml functor for grammars; C185 is the grammar type that uses MG features unify.ml unification in ML from Baader and Nipkow "Term Rewriting and all that" mg-rules.ml source file for MG parsing rules used in unify chart.ml data structure for recognized constituents indexQueue.ml data structure for unfinished items lexerprolog.mll ocamllex source file to read in prolog-syntax MGs parserprolog.mly ocamlyacc source file to read in prolog-syntax MGs main.ml parser entry point, usable in the toplevel runme.ml main file for a compiled version that estimates a derivation-PCFG from weighted training data and computes entropy values for prefixes of test sentences thesis-graphs.nb Mathematica notebook with graphs of values computed by the parser INSTALLATION 1. Obtain and install the latest distribution of Objective Caml from http://caml.inria.fr mindful of caveats like the ones at http://caml.inria.fr/caml-macosx-howto/index.html 2. unpack the distribution using the tar command tar xzf prize-pack.tar.gz 3. If you'd like to use the parser from the toplevel, link in the Unix module as directed in chapter 21 of the Ocaml manual, e.g: ocamlmktop -o mytop unix.cma ./mytop USE: interactive parsing in the toplevel 1. ensure that the bytecode files (.cmo) exist by issuing % make bcodes from the shell in the directory you unpacked in. 2. run ocaml (perhaps using emacs inferior caml mode: meta-X run-caml) 3. issue #use "main.ml" at the Caml hash mark prompt. you should see a long list of values defined including the function val parse : string -> string -> Chart.DerivationSet.t = if, instead, you see a bunch of "Cannot find file blah.cmo" and "Unbound constructor" make sure step 1 was successful and the bytecode files were actually compiled. now you can parse a sentence -- # parse "tests/larsonian.pl" "John love -s a woman";; accepted as category c in 0.32 seconds with 0.07 seconds for initialization, chart size: 106 - : Chart.DerivationSet.t = the two arguments to the parse function are the filename of the MG to be used, and a space-delimited string of morphemes. the result type is a set of derivation trees for each analysis of the given string. these can be viewed using Amblard's tree-drawing function -- # #use "amblard.ml";; val ajout2 : Chart.DerivationSet.t -> string = val fichier : string -> Chart.DerivationSet.t -> unit = # let result = parse "tests/larsonian.pl" "John love -s a woman";; accepted as category c in 0.30 seconds with 0.09 seconds for initialization, chart size: 106 val result : Chart.DerivationSet.t = # fichier "johnloves.dot" result;; - : unit = () fichier means "file" in French; it writes out a dot script which can be converted to postscript by running, e.g. % dot -Tps johnloves.dot -o johnloves.ps the Larsonian grammar discussed in the thesis is included in the "tests" subdirectory. These grammars use the same prolog term syntax as the prolog MG parses do. # let result = parse "tests/larsonian.pl" "John love -s a child";; rejected as category c in 0.07 seconds with 0.10 seconds for initialization, chart size: 38 val result : Chart.DerivationSet.t = # Chart.DerivationSet.elements result;; - : Chart.DerivationSet.elt list = [] The grammar as provided doesn't have any lexical entries to cover the word "child." To add one, write its phonological features in square brackets, followed by two colons, followed by a square-bracketed, comma separated list of syntactic features e.g. [child]::[n]. a period ends each lexical entry. With this modification, "child" is now part of the Larsonian grammar, and strings including it are recognized. # let result = parse "tests/larsonian.pl" "John love -s a child";; accepted as category c in 0.17 seconds with 0.11 seconds for initialization, chart size: 69 val result : Chart.DerivationSet.t = # Chart.DerivationSet.elements result;; - : Chart.DerivationSet.elt list = [Chart.Binary ("c", "::=t c", "t", Chart.Lexical "::=t c", Chart.Unary ("t", "+case t,-case", Chart.Binary ("+case t,-case", "::=>little_v +case t", "little_v,-case", Chart.Lexical "::=>little_v +case t", Chart.Binary ("little_v,-case", "=d little_v", "::d -case", Chart.Binary ("=d little_v", "::=>v =d little_v", "v", Chart.Lexical "::=>v =d little_v", Chart.Unary ("v", "+case v,-case", Chart.Binary ("+case v,-case", "::=d +case v", "d -case", Chart.Lexical "::=d +case v", Chart.Binary ("d -case", "::=Num d -case", "Num", Chart.Lexical "::=Num d -case", Chart.Binary ("Num", "::=n Num", "::n", Chart.Lexical "::=n Num", Chart.Lexical "::n"))))), Chart.Lexical "::d -case"))))] USE: batch processing with a compiled parser The "runme" file is intended to be compiled and used as a standalone executable; it doesn't have a user interface, relying on data values listed in the file itself. Because of the generality of the parsing inferences, the parser uses unification, and failed unifications constitute the major impediment to greater speed. To head off this problem, I implemented a two pass system whereby the derivation tree branches used to cover a certain corpus restrict the parser's search analyzing new sentences. This is the "training" phrase. These branches constitute a probabilistic context-free phrase structure grammar that can associate a probability with each analysis of a string. 1. edit the file runme.ml, say, by uncommenting the lines at the end that set "obs" (derivation tree branch observations) to come from the training set ah_train, and call the testing function that prints out the log probability of each derivation. 1003: let obs = train "tests/larsonian.pl" ah_train in 1016: test_logprob bumpy "tests/larsonian.pl" ah_test;; 2. build the executable by issuing % make runme at the shell prompt. this should invoke the Ocaml native code compiler. 3. run it! % ./runme > runme.out It could take a while to parse the entire training set and come up with a PCFG. Progress is marked by accept/reject messages for each sentence. Once the observations are obtained, they can be serialized using the Marshal module (* result has type ((string * string list) * float) list *) let oc = open_out_bin "chomskyan-obs.marshal" in Marshal.to_channel oc (get_observations "tests/chomskyan.pl" weightedah) []; close_out oc;; and read back in let get_obs () = let ic = open_in_bin "chomskyan-obs.marshal" in let result = ((Marshal.from_channel ic) : (((string * string list) * float) list) ) in close_in ic; result;; and in fact there's a utility function in Utilities called printout_guards which, given some observations of this kind, will spit out ML source code for the "guard" functions observedbinary and observedunary that save the parser from considering unattested derivation tree branches. In main.ml these are just set to "true" -- considering everything.