MorphologicalAnalysisTutorial  
A self-contained tutorial for building morphological analyzers.

Author: Mans Hulden (2011)
Rescued from now-defunct Google code page

Morphological analysis with FSTs

The following is a brief and basic tutorial on how to construct a morphological analyzer for a language using finite-state techniques. A small toy grammar of English noun and verb inflection is built step-by-step to illustrate overall design issues. While the grammar is small, much larger grammars can be built using the same design principles. Basic familiarity with regular expressions and foma is assumed, such as is outlined in the Getting started with foma page.

Definition

Since a "morphological analyzer" could mean any number of things, let's first settle on a task description and define what the morphological analyzer is supposed to accomplish. In this implementation, a morphological analyzer is taken to be a black box (as in Fig. 1), which happens to be implemented as a finite-state transducer, that translates word forms (such as runs) into a string that represents its morphological makeup, such as run+V+3p+Sg: a verb in the third person singular present tense.

Naturally, if the word form is ambiguous (as runs is), the job of the analyzer is to output all tag sequences consistent with the grammar and the input word. In the above example, the transducer should perhaps also output run+N+Pl, or some similar sequence to convey the possibility of a noun reading of runs. Since finite-state transducers are inherently bidirectional devices, i.e. we can run a transducer in the inverse direction as well as the forward direction, the same FST, once we've built it, can serve both as a generator and an analyzer. The standard practice is to build morphological transducers so that the input (or domain) side is the analysis side, and the output (or range) side contains the word forms.

In real life, morphological analyzers tend to provide much more detailed information than this. Figure 2 shows a more elaborate analyzer's output for Basque with the input work maiatzaren, together with an illustration about how a feature matrix can be recovered from the string output of the analyzer.

The goal is then is build a finite-state transducer that accomplishes this string-to-string mapping of analyses to surface forms and vice versa.

Design

The construction of the final transducer will be broken down into two large components:

  • A lexicon/morphotactics part
  • A phonological/morphophonological/orthographic part

The lexicon

The first component, which we call the lexicon component, will be a transducer that:

  • Accepts as input only the valid stems/lemmas of the language, followed by only a legal sequence of tags.
  • Produces as output from these, an intermediate form, where the tags are replaced by the morphemes that they correspond to.
  • May produce additional symbols in the output, such as special symbols that serve to mark the presence of morpheme boundaries.

For example, in the analyzer about to be constructed, the lexicon component FST will perform the following mappings:

c a t +N +Pl      w a t c h +N +Pl      w a t c h +V +3P +Sg     (input side)
c a t ^  s        w a t c h ^  s        w a t c h ^  s           (output side)

There are two things to note here. The first is that we are using the symbol ^ to mark a morpheme boundary. The second is that while each letter in the stem is represented by its own symbol (w,a,t,c,h, etc.), each complete tag is one separate symbol, a multicharacter symbol (+N, +Pl, etc.) The spaces in the above show the symbol boundaries to illustrate this. Figure 3 shows what a lexicon transducer that only encoded these three words would look like. Naturally, we will have some more features and a larger lexicon in what is described below.

The part that accomplishes this, the lexicon transducer, will be written in a formalism called lexc. While it is possible to construct the lexicon transducer through regular expressions in foma, the lexc-formalism is more suited for lexicon construction and expressing morphotactics.

Alternation rules

The role of the alternation rules is to modify the output of the lexicon transducer according to orthographic, phonological, and morphophonological rules and conventions. So far, for example, we've assumed that English nouns can be pluralized by concatenating the morpheme s with the stem (catcats). However, some nouns, those where the stem ends in a sibilant, have the allomorph es (watchwatches). A way to describe the process of forming correct nouns is to always represent the plural as the morpheme s as we have, and then subject these word forms to alternation rules that insert an e only if the stem ends in a sibilant. This, among others, is the task of the alternation rules component: to produce the valid surface forms from the intermediate forms output by the lexicon transducer.

Since alternation rule FSTs that are conditioned by their environment are very difficult to construct by hand, we use the replacement rules formalism in foma to compile the necessary rules into FSTs.

After the necessary replacement rules have been built, we combine the lexicon FST and these various FSTs that encode such alternations as e-insertions, into one large transducer that acts like a cascade: a single transducer that has the same effect as would providing an input word to the lexicon transducer, take its output and feed it into the first rule transducer, take it's output, then feed it into the next rule transducer, and so on.

This is accomplished by the regular expression composition operator (.o.). Suppose we have the lexicon transducer in an FST we've named Lexicon, and the various alternation rules as FSTs named Rule1, ..., RuleN (in reality we might want to use more perspicuous names). Now, we can issue the regular expression:

Lexicon .o. Rule1 .o. Rule2 .o. ... .o. RuleN ;

and produce a single transducer that is the composite of the different rule transducers and the lexicon transducer.

This idea is illustrated in figure 4.

This is indeed one of the main reasons to use finite-state transducers for morphological analysis: the fact that we can describe the word-formation rules in the direction of generation, but in fact use the final FST in the other direction. It would be trivial to write little programs (in one's programming language of choice) to perform the lexicon mapping and the alternation rules mappings, each little program's output being fed as the next program's input, and so provide a model for generation. But with such a model, we generally have no way to perform analysis. The fact that we can calculate the composition of each individual transducer, is what makes this possible.

Although we write the grammar in the direction of generation, we at the same time produce an analyzer, or parser. This is because the final product is represented as one single transducer and transducers operate just as easily in the inverse direction. All the software has to when parsing (applying the transducer in the inverse direction), is to follow the output labels on each transition that correspond to the word to parsed, and output the matching input labels.

The lexicon script

As mentioned, the lexicon script that will be a transducer that maps analysis forms to intermediate forms will be written in the lexc-formalism and stored in a text file, which we call english.lexc. This file can be compiled into a transducer in the main foma interface.

The facts to be modeled

Before moving further, let's narrow down the details we want to capture in our analyzer.

We'll want to include the following noun/verb features into the grammar.

  • Nouns: singular (cat) vs. plural (cats)
  • Verbs: infinitive (watch), 3rd person singular (watches), past tense (watched), past participle (watched), and present participle (watching)

Tags

We'll also want to use the following tags for marking the above grammatical information.

  • +N for nouns
  • +V for verbs
  • +3P for third person
  • +Sg for singular forms
  • +Pl for plural forms
  • +Past for past tense
  • +PastPart for past participle
  • +PresPart for present participle

The lexc-script

The lexc-formalism operates under the simple notion of continuation classes. One declares a set of labeled lexicons, the content of those lexicons, and rules which dictate how lexicon entries are to be concatenated.

The entire lexc-script is given further below; here, we provide a line-by-line analysis of it.

First, in the lexc-formalism, we need to declare those symbols that are to be multicharacter strings:

Multichar_Symbols +N +V +PastPart +Past +PresPart +3P +Sg +Pl

Then, we must declare a Root lexicon. The Root lexicon is where we start building a word:

LEXICON Root

Noun ;
Verb ;

Here we declared a lexicon called Root, and provided it with two entries. Both entries are empty (there is no left-hand side), and by choosing one of these, we can move to either the Noun or the Verb lexicon.

The Noun lexicon looks as follows:

LEXICON Noun

cat   Ninf;
city  Ninf;
fox   Ninf;
panic Ninf;
try   Ninf;
watch Ninf;

Here, we have six entries. All of these entries continue to the Ninf lexicon.

LEXICON Verb

beg   Vinf;
fox   Vinf;
make  Vinf;
panic Vinf;
try   Vinf;
watch Vinf;

The Verb lexicon also has six entries, and if their continuation lexicon is the Vinf-lexicon for all of them.

LEXICON Ninf

+N+Sg:0   #;
+N+Pl:^s  #;

Now, the Ninf-lexicon has two entries, and unlike in the previous lexicons we've seen, both entries specify the input string and output string separately. For example, the entry +N+Sg:0 specifies that the input side contains +N+Sg and the output side the empty string. The continuation lexicon for both is #, meaning end-of-word. So, for example, if we chose from the Root lexicon to enter the Noun lexicon, and from there chose the cat entry, and in this lexicon chose the +N+Sg:0 entry, we would have constructed the input-output pairing:

c a t +N +Sg
c a t

In the Vinf lexicon we have likewise:

LEXICON Vinf

+V:0             #;
+V+3P+Sg:^s      #;
+V+Past:^ed      #;
+V+PastPart:^ed  #;
+V+PresPart:^ing #;

Here, we have a few more entries, one for each verb form. We also use the morpheme boundary symbol ^, which will be taken advantage of in the triggering contexts by the replacement rules.

In foma we can compile the english.lexc script into a transducer by issuing the read lexc-command.

foma[0]: read lexc english.lexc
Root...2, Noun...6, Verb...6, Ninf...2, Vinf...5
Building lexicon...Determinizing...Minimizing...Done!
1.7 kB. 32 states, 46 arcs, 42 paths.
foma[1]:

Now, we've added the lexicon transducer on the stack, and can label it for subsequent use:

foma[1]: define Lexicon;
defined Lexicon: 1.7 kB. 32 states, 46 arcs, 42 paths.
foma[0]:

Figure 5 shows what this lexicon transducer actually looks like.

We can also test it against some words to see that it gives the desired intermediate forms:

  foma[0]: regex Lexicon;
  1.7 kB. 32 states, 46 arcs, 42 paths.
  foma[1]: down
  apply down> watch+V+PastPart
  watch^ed
  apply down> city+N+Pl
  city^s
  apply down>

Here is the complete lexc-file in one chunk:

!!!english.lexc!!!

Multichar_Symbols +N +V +PastPart +Past +PresPart +3P +Sg +Pl

LEXICON Root

Noun ;
Verb ;

LEXICON Noun

cat   Ninf;
city  Ninf;
fox   Ninf;
panic Ninf;
try   Ninf;
watch Ninf;

LEXICON Verb

beg   Vinf;
fox   Vinf;
make  Vinf;
panic Vinf;
try   Vinf;
watch Vinf;

LEXICON Ninf

+N+Sg:0   #;
+N+Pl:^s  #;

LEXICON Vinf

+V:0             #;
+V+3P+Sg:^s      #;
+V+Past:^ed      #;
+V+PastPart:^ed  #;
+V+PresPart:^ing #;

The alternation rules component

Having accomplished the first part of the grammar construction, we now turn to the alternation rules component. The idea is to construct a set of ordered rule transducers that modify the intermediate forms output by the lexicon component. At the very least we will need to remove the ^-symbol which is used to separate morpheme boundaries before we produce valid surface forms. This will in fact be the last rule transducer in the cascade. Before that rule applies, however, there is a number of rules that are needed for orthographic and phonological adjustment.

The facts to be modeled

e-deletion

Stems that end in a silent e drop the e with certain suffixes (ing and ed in our case). For example, makemaking. This can be captured by:

define EDeletion e -> 0 || _ "^" [ i n g | e d ] ;

In other words, applying the rule to the output of the lexicon component, we get changes like:

m a k e +V +PresPart        (lexicon input)
m a k e ^  i         n g    (lexicon output)
m a k   ^  i         n g    (after e-deletion)

e-insertion

We also have an e-insertion rule: whenever a stem ends in a sibilant and is followed by the plural morpheme s (watchwatches). Sibilants can be defined orthographically by s, z, x, c h, and s h. The rule then looks like:

define EInsertion [..] -> e || s | z | x | c h | s h _ "^" s ;

Note that the above rule will work for the words we've defined in our lexicon. However, the sibilant representation in English orthography is not entirely consistent. For example, a word such as monarch would trigger the rule and incorrectly yield *monarches in the plural. One way to handle such exceptions is to rewrite the c h sequence in the lexicon component with some symbol that will not trigger e-insertion (say K), which is then rewritten back to c h after all the rules have applied.

y-replacement

The y-replacement rule which acts in constructions such as trytries and trytried is perhaps best broken into two parts. Consider the corresponding input-intermediate form pairings:

    (1)                   (2)
t r y   +V +3P +Sg    t r y +V +PastPart      (lex in)
t r y   ^  s          t r y ^  e          d   (lex out)
t r i e ^  s          t r i ^  e          d   (desired rule output)

Now, in the event that the verb ends in y and is followed by morpheme boundary s (1), we want to change the y into i e, whereas case (2) requires only changing the y into i. We can accomplish this by having two separate rules combined with composition, or as is done below, as two parallel rules:

define YReplacement y -> i e || _ "^" s    ,,
                    y -> i   || _ "^" e d  ;

Note the ,, which separates two distinct parallel rules. These rules are not joined by composition, rather, they operate simultaneously and independently. For this particular case it is of no importance whether the two rules apply simultaneously or one after the other, but for other cases the output may well be different.

k-insertion

Verbs that end in a c (corresponding to a phonological k), add a k before the morpheme boundary if followed by an affix beginning with a vowel. For example panicpanicking, panicked. We can identify quite accurately when the k needs to be inserted as it is preceded by a vowel in those cases (as opposed to, say the verb construction arcarcing). The rule can then be captured by:

define KInsertion [..] -> k || C V c _ "^" V;

Of course, this rule relies on us having defined V as our set of vowel symbols.

Note the [..]-construction on the LHS. Here, using an epsilon (0) as the left hand side will not do. The rule formalism is such that if a 0 were used on the LHS, we would insert a potentially infinite number of k-symbols. The special construction [..] forces epsilon insertions to restrict themselves to maximally one insertion per potential insertion site.

Consonant doubling (gemination)

Consonant doubling is really the same phenomenon as the k-insertion. We double final consonants in the stem in certain environments: begbegging, runrunnable, etc. Our lexicon contains only one such word (beg), so we can make do with a simple rule:

define ConsonantDoubling g -> g g || C V _ "^" V;

For a more thorough treatment, we'd also need to add rules for the other consonants. Although, this again is a phenomenon that is not perfectly capturable without additional knowledge as to the pronunciation of the word. The consonant doubling only happens when the final syllable of the stem is stressed. Compare commitcommitting and editediting. The latter word fulfills the structural requirements of the rewrite rule (a C V sequence before the consonant to be doubled), yet the gemination isn't permitted.

Auxiliary symbol removal

Naturally, as the last rule, we need to remove the auxiliary ^, whose presence has been necessary for the correct definition of the alternation rules. The last rule removes these, after which we've produced valid surface forms:

define Cleanup "^" -> 0;

Compiling the grammar

In the same script that compiles the replacement rules, we also read in the lexc-file, and label it Lexicon, so we can use that label in subsequent regular expressions.

read lexc english.lexc
define Lexicon;

Having defined and labeled all the necessary rules and the lexicon, these can be composed by:

regex Lexicon           .o.
      ConsonantDoubling .o.
      EDeletion         .o.
      EInsertion        .o.
      YReplacement      .o.
      KInsertion        .o.
      Cleanup;

Testing the grammar

Having at our disposal the two files english.lexc and english.foma, we can simply run the .foma-file from the interface which will compile the lexicon and the replacement rules and join them all by composition.

foma[0]: source english.foma
Opening file 'english.foma'.
defined V: 405 bytes. 2 states, 5 arcs, 5 paths.
Root...2, Noun...6, Verb...6, Ninf...2, Vinf...5
Building lexicon...Determinizing...Minimizing...Done!
1.7 kB. 32 states, 46 arcs, 42 paths.
defined Lexicon: 1.7 kB. 32 states, 46 arcs, 42 paths.
defined ConsonantDoubling: 1.1 kB. 11 states, 47 arcs, Cyclic.
defined EDeletion: 1.2 kB. 11 states, 52 arcs, Cyclic.
defined EInsertion: 1.1 kB. 7 states, 43 arcs, Cyclic.
defined YReplacement: 998 bytes. 9 states, 36 arcs, Cyclic.
defined KInsertion: 1.9 kB. 12 states, 89 arcs, Cyclic.
defined Cleanup: 324 bytes. 1 states, 2 arcs, Cyclic.
2.2 kB. 47 states, 70 arcs, 42 paths.
foma[1]:

Now we can test the resulting transducer in the analysis direction:

foma[1]: up
apply up> tries
try+V+3P+Sg
try+N+Pl
apply up> panicking
panic+V+PresPart
apply up> [CTRL-D]

and the generation direction:

foma[1]: down
apply down> beg+V+PresPart
begging
apply down> watch+N+Pl
watches
apply down>

Here is the complete english.foma-file:

### english.foma ###

# Vowels
define V [ a | e | i | o | u ];

read lexc english.lexc
define Lexicon;

# Consonant doubling: 1-letter consonant doubled before -ing/-ed (beg/begging)
define ConsonantDoubling g -> g g || _ "^" [i n g | e d ];

# E deletion: silent e dropped before -ing and -ed (make/making)
define EDeletion e -> 0 || _ "^" [ i n g | e d ] ;

# E insertion e added after -s, -z, -x, -ch, -sh before s (watch/watches)
define EInsertion [..] -> e || s | z | x | c h | s h _ "^" s ;

# Y replacement: -y changes to -ie before -s, -i before -ed (try/tries)
define YReplacement y -> i e || _ "^" s    ,,
                    y -> i   || _ "^" e d  ;

# K insertion: verbs ending with vowel + -c add -k (panic/panicked)
define KInsertion [..] -> k || V c _ "^" [e d | i n g];

# Cleanup: remove morpheme boundaries
define Cleanup "^" -> 0;

read lexc english.lexc
define Lexicon

define Grammar Lexicon           .o.
               ConsonantDoubling .o.
               EDeletion         .o.
               EInsertion        .o.
               YReplacement      .o.
               KInsertion        .o.
               Cleanup;

regex Grammar;

Additional testing

We can also list all the input/output pairs of the words with the pairs-command:

  foma[1]: pairs
  cat+N+Sg	cat
  cat+N+Pl	cats
  city+N+Pl	cities
  city+N+Sg	city
  try+V+PastPart	tried
  try+V+Past	tried
  try+V+3P+Sg	tries
  try+N+Pl	tries
  try+V	try
  try+V+PresPart	trying
  try+N+Sg	try
  panic+V	panic
  panic+V+3P+Sg	panics
  panic+N+Sg	panic
  panic+N+Pl	panics
  panic+V+PresPart	panicking
  panic+V+PastPart	panicked
  panic+V+Past	panicked
  watch+V	watch
  watch+V+PresPart	watching
  watch+V+PastPart	watched
  watch+V+Past	watched
  watch+V+3P+Sg	watches
  watch+N+Pl	watches
  watch+N+Sg	watch
  make+V	make
  make+V+3P+Sg	makes
  make+V+PresPart	making
  make+V+PastPart	maked
  make+V+Past	maked
  beg+V	beg
  beg+V+3P+Sg	begs
  beg+V+PresPart	begging
  beg+V+PastPart	begged
  beg+V+Past	begged
  fox+V	fox
  fox+V+PresPart	foxing
  fox+V+PastPart	foxed
  fox+V+Past	foxed
  fox+V+3P+Sg	foxes
  fox+N+Pl	foxes
  fox+N+Sg	fox
  foma[1]:

Here, eyeballing the results, we see that we've made a slight overgeneralization: the surface form maked is incorrect. How to solve this, and other exceptional forms in the grammar, will be addressed below.

Saving the resulting transducer

Having compiled the analyzer/generator transducer, we can also save it into a file with the save stack-command:

foma[1]: save stack english.bin
Writing to file english.bin.
foma[1]:

For large grammars, loading the binary file format will in general be much faster than recompiling the grammar. However, if we need to make changes to the grammar, recompilation becomes necessary. The binary tranducer format is also necessary to use the analyser/generator with the external utility flookup. This is a simple way (outside of the foma API) to connect a morphological analyzer to other applications.

Running the transducer with the flookup utility

Assuming we've stored the transducer in english.bin, we can also use the generic flookup utility to use the transducer. The flookup utility reads a transducer from a given file, and the reads words from stdin, yielding the transducer output to stdout:

$ echo "begging" | flookup english.bin
begging	beg+V+PresPart
$

By default the flookup applies words in the inverse (analysis) direction of the transducer (as apply up in the foma interface). If we want to generate, we can run flookup with the -i flag.

$ echo "beg+V+PresPart" | flookup -i english.bin
beg+V+PresPart	begging
$

Grammar tweaking: Adding exceptions (irregular forms)

At this point, let's return to the concern about generating the incorrect form maked from make+V+PastPart and make+V+Past. This is clearly not a phonological/orthographic issue that we would want to mend by adding a rewrite rule. Such a rule would only target those particular word forms (which are really exceptions), so we might as well treat it as an exception by listing it as such. There are a couple of options regarding how to model exceptional forms in foma. One straightforward way is to simply list the exceptional forms in the lexicon so that they bypass the replacement rules. Another way is to write a separate mini-grammar for exceptions which is joined together with the main grammar by the union or priority union operations.

Exceptions in LEXC

The way to go about this in lexc is to bypass word formation at a certain point. For example, make is listed as:

make  Vinf;

where Vinf is the continuation class that handles all the verb inflection. Now, we could replace this line by:

make+V:make            #;
make+V+PresPart:making #;
make+V+Past:made       #;
make+V+PastPart:made   #;

In effect, this lists all the pairings of make and its analysis to its surface form separately. Note the the continuation lexicon is now # for all these forms, since we need to bypass Vinf altogether as these words no longer need to be affixed with inflectional affixes.

For single words, or a few entries, this method will work fine. For larger projects, it is advisable to write small subgrammars that are combined with the regular grammars through priority union, as is explained below.

Exceptions by priority union

Priority union (.P.) is a special operation that works exactly like regular union (|) in most cases. That is, if we have two mappings A and B, A .P. B will perform the mapping that either A or B would. However, the priority union of A and B, as opposed to regular union, additionally discards those relations from B that have a corresponding input in A. This idea can be used to combine two grammars, an 'exception' grammar and a 'regular' grammar. Fig. 6 illustrates the approach.

In other words, if we had an 'exceptions' transducer that only contained the irregular form mappings, we could combine this with the regular transducer as follows.

define Exceptions [[{make} "+V" "+Past" .x. {made}] | [{make} "+V" "+PastPart" .x. {made}]] ;
regex Exceptions .P. Grammar ;

Grammar tweaking: adding exceptions (parallel forms)

Although the grammar above does not include words that have parallel or alternate forms, the need to accommodate such words is typical. In English, for example, we may have a need to be able to parse latinate or greek plural forms alongside the regular anglicized forms in words such as cactuses/cacti, indexes/indices, formulae/formulas.

Parallel forms in LEXC

If we had an entry for cactus in the above Noun lexicon, the regular form would of course be produced correctly by the grammar as-is:

...
cactus            Ninf;
...

To also produce the alternate form cacti, we can again hard-wire this into the lexc definition; however, this time without removing the regular entry:

...
cactus            Ninf;
cactus+N+Pl:cacti #;

Now, the regular entry cactus with the continuation class Ninf handles the regular pluralization, while we've also listed an irregular one, which is not followed by any continuation class but by end-of-word (#).

Parallel forms with union

Similar to the exceptional forms, we can also construct a separate grammar containing alternate forms, and then union in this grammar with the main, regular, grammar. Assuming we've defined our irregular mappings in `ParallelForms, we can easily produce the union by ParallelForms | Grammar`. For the simple example above of including cacti as a parallel form, we could issue:

define ParallelForms [c a c t u s %+N %+Pl .x. c a c t i ];
regex ParallelForms | Grammar ;

assuming of course we've already defined Grammar (as is done in english.foma).

Including both parallel and exceptional forms

In many cases, we will build separate subgrammars for the exceptional forms and the parallel forms. These can be combined with the main grammar using both union and priority union. Assume we've defined the Grammar, ParallelForms and Exceptions. In the example below, we will provide simple definitions of these for the sake of illustration, but these subgrammars themselves can in actual applications be fairly elaborate and contain their own lexc-code and replacement rules, just like the main grammar:

define ParallelForms [c a c t u s %+N %+Pl .x. c a c t i ];
define Exceptions [[{make} "+V" "+Past" .x. {made}] | [{make} "+V" "+PastPart" .x. {made}]] ;

Now, we can combine the two with the main grammar as follows:

regex Exceptions .P. [Grammar | ParallelForms] ;

Advanced: long-distance dependencies and flag diacritics

In some cases, we need to model long-distance dependencies between morphemes - something that is not easy to do in the lexc-formalism. Flag diacritics offer a way to perform long-distance dependency checks during word formation.

English morphology has few (if any) truly general long-distance dependencies, but to have a concrete example to illustrate the technique with, let's consider the following verbs and deverbal adjectives:

undrinkable+Adj
unpayable+Adj
undoable+Adj
drinkable+Adj
payable+Adj
doable+Adj
*undrink+V+Inf
*unpay+V+Inf
undo+V+Inf

To write a lexc-grammar that captures everything (except the two forms we have deemed ungrammatical), we could presumably do something like the following skeleton:

Multichar_Symbols +V +Inf +Adj +Pl UN+

LEXICON Root
Prefix;

LEXICON Prefix
UN+:un Verb;
       Verb;

LEXICON Verb

do    Vinf;
drink Vinf;
pay   Vinf;


LEXICON Vinf

able    Adjinf;
+V+Inf:0  #;

LEXICON Adjinf
+Adj:0      #;
+Adj+Pl:s   #;

The script now generates all 3*6 = 18 combinations of:

VERB
VERB+able
VERB+ables
un+VERB
un+VERB+able
un+VERB+ables

To address the two wordforms that shouldn't be allowed (*undrink and *unpay) the first problem is that, somehow, the Vinf-lexicon should be aware if the UN+ prefix had been selected from the prefix lexicon and in that case force the selection of able to disallow *undrink, for example. This is tricky to manage in a clean way in the continuation-class formalism of lexc. The second problem is that we don't want this restriction to apply in the case of doable. Both of these restrictions can be accomplished by the use of flag diacritic symbols placed into the grammar which gives a more fine-grained control over co-occurrence.

Flag diacritics to control co-occurrence

Flag diacritics are "invisible" symbols that allow one to express constraints between disjoint parts of words. All flag symbols in a grammar are of the following format:

  • `@FLAGTYPE.FEATURE.VALUE@

The flag values and features are arbitrary strings, and up to the user to decide. During application of a transducer, the runtime of foma will decide whether or not a word is to be accepted depending on which flags co-occur in the same word. A good way to handle co-occurrence problems is generally to drop suitable flags by certain morphemes, which then prevents them from combining, but leaving the flags invisible to the user.

The different flag types are

  • U unify features @U.FEATURE.VALUE@
  • P positive set @P.FEATURE.VALUE@
  • N negate @N.FEATURE.VALUE@
  • R require feature/value @R.FEATURE.VALUE@ or @R.FEATURE@
  • D disallow feature/value @D.FEATURE.VALUE@ or @D.FEATURE@
  • C clear feature @C.FEATURE@
  • E require equal feature/values @E.FEATURE.VALUE@

We will not use all of the flag types here, and restrict ourselves only to a few, namely positive (P), clear (C), and disallow (D) which set a value to a feature, clear a feature, and disallow a feature to have certain (or any) values, respectively.

The functioning of flags is best illustrated through an example. Let's add some flags to the previous lexc-script:

Multichar_Symbols +V +Inf +Adj +Pl UN+ @P.UN.ON@ @D.UN@ @C.UN@

LEXICON Root
Prefix;

LEXICON Prefix
@P.UN.ON@UN+:@P.UN.ON@un Verb;
       Verb;

LEXICON Verb

@C.UN@do    Vinf;
drink Vinf;
pay   Vinf;


LEXICON Vinf

able    Adjinf;
@D.UN@+V+Inf:@D.UN@ #;

LEXICON Adjinf
+Adj:0      #;
+Adj+Pl:s   #;

Note the following things:

  • We declared the flags together with the other multi-character symbols
  • We put a @P.UN.ON@-flag on the un-prefix
  • We put a @D.UN@-flag on the infinitive
  • We put a @C.UN@-flag on the do-lemma
  • We put the flags on both sides (the upper side, and lower side), and made sure they lined up by putting them first in the string whenever we had upper:lower entry pairs.

The flags in this grammar interact as follows to capture the phenomena we require:

  • do gives the value ON to the feature named UN (@P.UN.ON@), (positive set)
  • the bare infinitive disallows the feature UN to have any value @D.UN@ (disallow)

Now, this leads to that, even though lexc generates the combination

@P.UN.ON@unpay@D.UN@

the apply runtime sees that the feature UN was first set, and later in the same string disallowed to have any value, thus blocking this word [we could have used @D.UN.ON@ just as well: this would have disallowed UN to have the specific value ON whereas we disallowed any value at all by @D.UN@.] This gets us almost everything, with the exception that we want to allow undo, unlike the two other un-VERB combinations. That is handled by the @C.UN@-flag, which clears any value the feature UN may have. In other words, for that particular combination, lexc generates:

@P.UN.ON@un@C.UN@do@D.UN@

And in this case, the runtime flag-watcher has no problem accepting it because the feature UN was cleared before it was disallowed.

If you load this file in foma, it will actually still say it accepts 18 paths:

Root...1, Prefix...2, Verb...3, Vinf...2, Adjinf...2
Building lexicon...Determinizing...Minimizing...Done!
1.2 kB. 21 states, 27 arcs, 18 paths.

However, doing a lower-words will only print out 16 - those words that are accepted.

Flag commands

If you want to see the flags in the output, you can issue:

set show-flags ON

And now, lower-words or upper-words gives something like:

@P.UN.ON@un@C.UN@do@D.UN@

Foma can also calculate the equivalent transducer without the flags present by eliminate flags

foma[1]: eliminate flags
1.3 kB. 26 states, 32 arcs, 16 paths.

And now we see that after the flags have been purged and an equivalent FST calculated, the number of paths has been reduced to 16, since they are no longer part of the FST at all.

If you want the foma apply commands to disregard flag constraints, say for debugging purposes, you can also issue the command:

set obey-flags OFF

Flags and replacement rules

A common trap with using flag diacritics is to forget that they are actual symbols in the transducers. This often manifests itself in the design of replacement rules. Consider a rule like:

b -> p || _ .#. ;

This rule changes b to p at the end of a word. Now, if the word is designed with a flag-bearing grammar, it may actually look like:

blab@U.SOMEFEATURE.SOMEVALUE@

although the flag is never seen when the strings are printed. In this case the rule will not apply. There are two easy remedies for this:

  • Make rules flag-aware:
b -> p || _ ("@U.SOMEFEATURE.SOMEVALUE@") .#. ;

or

  • Issue the command set flag-is-epsilon ON in the foma interface before composing the lexicon with the replacement rules.

The following general guidelines should help to steer clear of trouble with flag diacritics:

  • Declare all flags as Multicharacter_Symbols in lexc.
  • Make sure the flags are both on the upper and lower side in lexc, aligned. Put the flag first (before the morpheme) in a lexc entry, and on both sides of the colon if the entry is two-level.
  • Use set show-flags ON when debugging.
  • Make sure the replacement rules take the flag symbols into account.