Chapter 1

Introduction


1.1    Encircling layers

An utterance of English is like an onion with encircling layers around internal elements, as pictured in Figure 1.1.

Figure 1.1: Layers around an innermost element

   

The action of parsing amounts to:

An innermost element is either the head word of a phrase layer or the verb of a clause layer.

    While placing everything back together again, we can add labels called annotations. This preserves a record of what was uncovered while pealing back the layers. Subsequently, these labels act as handles that we can query in order to access the structural analysis revealed by parsing.

    Furthering the onion analogy, the kind of layering seen in Figure 1.2 occurs when there is coordination, with each conjunct having to conform to whatever is possible at the same particular layer.

Figure 1.2: Parallel layers

   
   

    In addition to structures that arise from multiple layers, we will observe ripple effects, where what happens at one layer goes on to have consequences for what is possible in other layers.


1.2    Phrase structure rules

Phrase structure rules are a way to build labelled layers around embedded elements. A phrase structure rule has:

    For example, the five rules of (1.1) involve labelled layers: vp, np_sbj, and ip; and terminal elements: [smiles], [and], and [he].

(1.1)
1.  vp --> [smiles].
2.  vp --> vp, [and], vp.
3.  np_sbj --> [he].
4.  ip --> np_sbj, vp.
5.  ip --> ip, [and], ip.

By following the rules of (1.1), we can assemble the structure of (1.2): We can start with rule 4 where an ip layer can consist of an np_sbj layer followed by a vp layer. With rule 3, we get to a complete np_sbj with the terminal word [he]. With rule 2, we get to a vp, which is itself made up of a vp, followed by an [and] word, followed by a vp. Finally, with rule 1, we can get to complete vp layers with the terminal word [smiles].

(1.2)
ip
np_sbj
[he]
vp
vp
[smiles]
[and]
vp
[smiles]

    As an alternative to (1.2), we might have started with rule 5 of (1.1), and thereafter assembled (1.3), and so on.

(1.3)
ip
ip
np_sbj
[he]
vp
[smiles]
[and]
ip
np_sbj
[he]
vp
[smiles]

Question:

Applying the phrase structure rules of (1.1), how many different ip structures can be made?


1.3    Computer parsing in this book

Computer parsing in this book will use Definite Clause Grammar notation (DCG; Pereira and Warren 1980). This is a notation for writing phrase structure grammar rules in which labels for layers can take attributes with values unified by Prolog-style term unification. DCG notation is a feature of virtually all Prolog systems. With DCG notation, we can write phrase structure rules (like (1.1) above) and then we have an executable Prolog program.

    Prolog and DCGs are discussed in introductory textbooks by Clocksin and Mellish (1981), Pereira and Shieber (1987), Covington (1994), Matthews (1998), Blackburn, Bos and Striegnitz (2006), among others. Shieber (1986) contains discussion of how DCGs relate to other formalisms for encoding natural language grammars, including Categorial Grammar, Lexical Functional Grammar (LFG), Generalized Phrase Structure Grammar (GPSG), and Head-driven Phrase Structure Grammar (HPSG).

    We will want to include phrase structure grammar rules with left recursion, like rules 2 and 5 of (1.1) above. When there is left recursion, avoiding infinite loops from rule calls requires remembering what has already been evaluated. This is possible with a technique called tabling introduced by Swift and Warren (1994). Tabling is a feature of some Prolog systems, notably, the XSB Tabling Prolog system (Swift and Warren 2022). Christiansen and Dahl (2018) discuss the evolution of natural language processing as it relates to Logic Programming, with particular focus on DCGs and tabling.


1.4    Structural analysis and labelling in this book

The structural analysis and labelling in this book follows the annotation scheme of the Treebank Semantics Parsed Corpus (TSPC; Butler 2023). The TSPC is a corpus of English for general use with hand worked tree analysis for half-a-million words.

    The annotation approach is an attempt to consolidate alternative annotation schemes for English:

    From the SUSANNE scheme, there is adoption of form and function information, such that the TSPC scheme can be linked most closely to the SUSANNE scheme. The SUSANNE scheme is closely related to the English grammars of Quirk et al. (1972, 1985).

    The ICE Parsing Scheme similarly follows the Quirk et al. grammars. In addition, ICE is notable for its rich range of features. The TSPC annotation supports the ability for many of these features to be automatically derived.

    The Penn Historical Corpora scheme, which itself draws on the bracketed approach of the Penn Treebank scheme, has informed the ‘look’ of the annotation. This includes:

However, it should be noted that, labels aside, the tag set of the TSPC scheme is most compatible with the SUSANNE scheme, especially with regards to function marking. Moreover, word class tags closely overlap with the Lancaster word class tagging systems, especially the UCREL CLAWS5 tag set used for the British National Corpus (BNC Consortium 2005).

    The TSPC scheme also contains plenty that is innovative. Most notably, there is normalisation of structure, achieved with intermediate layers at:

    Another area of innovation is verb code integration. There are codes to classify catenative verbs; cf. Huddleston and Pullum (2002, p.1220). These are verbs of a verb sequence that are prior to the main verb of a clause, and this type of verb is further supported in the annotation by (IP-PPL-CAT) intermediate clause structure. The verb codes for main verbs are from the mnemonic system of the fourth edition of the Oxford Advanced Learner's Dictionary (OALD4; Cowie 1989). Additional codes with very particular distributions are included from Hornby (1975).

    The most innovative aspect of the annotation gives the TSPC its name: It can be fed to the Treebank Semantics evaluation system (Butler 2021). Treebank Semantics processes constituency tree annotations and returns logic-based meaning representations. While the annotation seldom includes indexing, results calculated with Treebank Semantics resolve both inter and intra clause dependencies, including cross sentential anaphoric dependencies.


1.5    Representing layered structure

When parsing, we will accumulate layers with node structures. These structures have the form:

where Label is a label for the layer and NodeList is a list of node structures that are the content for the layer.

    In Prolog, lists are written with square bracket notation:

where a1 through to aN are the list items. The list that contains no items ([]) is the empty list. A terminal layer of structure has the form:

    With this background, consider (1.4) as an example of encircling layers gathered with node structures.

(1.4)
node('IP-MAT',
     [node('NP-SBJ',
           [node('PRO',
                 [node('He',[])])]),
      node('ILYR',
           [node('ILYR',
                 [node('VBP;~I',
                       [node('smiles',[])])]),
            node('CONJP',
                 [node('CONJ',
                       [node('and',[])]),
                  node('ILYR',
                       [node('VBP;~I',
                             [node('smiles',[])])])])]),
      node('PUNC',
           [node('.',[])])])

We will more typically present node structure as labelled bracketed structure, as in (1.5).

(1.5)
(IP-MAT (NP-SBJ (PRO He))
        (ILYR (ILYR (VBP;~I smiles))
              (CONJP (CONJ and)
                     (ILYR (VBP;~I smiles))))
        (PUNC .))

1.6    Orientation

We will proceed as follows. In the remainder of this chapter, section 1.7 introduces rules for identifying sentences, and section 1.8 introduces rules for sentence fragments and utterances. This will demonstrate writing DCG rules with attributes for collecting layered structure and passing on ripple information. This will also demonstrate using the Prolog system to query English word input.

    Subsequent chapters define rules to build up a wide coverage of consequences from word input. This starts in chapter 2 with an inventory of the kinds of words there are in English. Once we have a handle on the range of words, we will want to start placing words together to create phrases, and this is the goal of chapter 3. Chapter 4 establishes what verbs require to be present in the clause based on their given verb code. With access to phrases and verb words with their selection requirements, we will have content for clause layers. Establishing what is possible at the clause layer is the goal of chapter 5. Chapter 6 builds on all the preceding chapters by integrating clause subordination. Chapter 7 gives closing remarks.


1.7    Sentences

The Prolog code of (1.6) gives phrase structure rules to initiate finding layered content to say there is a sentence.

(1.6)
sentence([node('IP-MAT',IL)|L],L) -->
  clause_top_layer(statement_order,[],IL,IL1),
  punc(final,IL1,[]).
sentence([node('IP-IMP',IL)|L],L) -->
  clause_top_layer(imperative_clause,[],IL,IL1),
  punc(final,IL1,[]).
sentence([node('CP-QUE',[node('IP-SUB',IL),PU])|L],L) -->
  clause_top_layer(matrix_interrogative,[],IL,[]),
  punc(final_question,[PU],[]).
sentence([node('IP-MAT',IL)|L],L) -->
  clause_top_layer(statement_order,[],IL,IL2),
  punc(non_final,IL2,[node('PRN',[node('CP-QUE',[node('IP-SUB',TL)])])|IL1]),
  clause_top_layer(tag_question,[],TL,[]),
  punc(final_question,IL1,[]).

Rule 1 of (1.6) will create a node structure with a list IL for the layer content of the node and 'IP-MAT' (statement matrix clause) as the node label. IP is an abbreviation for Inflectional Phrase, which is another way to say ‘clause’. To succeed, rule 1 needs content to parse that involves a list of items where all but the last item satisfies one of the clause_top_layer phrase structure rules (see section 5.2), and where the last list item is an instance of final punctuation (see section 2.12).

    In (1.7), two Prolog calls are made. The first call has tphrase_set_string to store a list of word items to parse. Let's call this the word list. The second call has parse to question whether the stored word list has content to satisfy a call of sentence. Note that the parameter of sentence for accumulating parse structure is hidden internally to the parse call. If parse succeeds, then all parse results are presented as bracketed tree output.

(1.7)
| ?- tphrase_set_string([w('PRO','He'), w('VBP',';~I','smiles'), w('PUNC','.')]), parse(sentence).

(IP-MAT (NP-SBJ (PRO He))
        (VBP;~I smiles)
        (PUNC .))

yes

The output from (1.7) is structure with IP-MAT as the topmost node, so rule 1 of (1.6) completed successfully.

    Now consider (1.8).

(1.8)
| ?- tphrase_set_string([w('VB',';~I','Smile'), w('PUNC','.')]), parse(sentence).

(IP-IMP (VB;~I Smile)
        (PUNC .))

yes

The output from (1.8) is structure with IP-IMP (imperative clause) as the topmost node: Rule 1 of (1.6) fails, but rule 2 succeeds. Rules 1 and 2 are similar. Both rules call clause_top_layer, but they do so with different clause type settings:

    As with (1.8), an imperative clause should have no subject and an initial verb with imperative form (a word with VB, DO, HV or BE word class information). If we try (1.9), which is unsuitable as a statement sentence because of the lack of content for a subject but then also unsuitable as an imperative sentence because the verb has VBP (present tense) word class, then there can be no parse result, as (1.9) shows.

(1.9)
| ?- tphrase_set_string([w('VBP',';~I','Smiles'), w('PUNC','.')]), parse(sentence).

no

    As a further example of checking if there is word list content for a sentence, consider (1.10).

(1.10)
| ?- tphrase_set_string([w('DOD','','Did'), w('PRO','he'), w('VB',';~I','smile'), w('PUNC','?')]), parse(sentence).

(CP-QUE (IP-SUB (DOD Did)
                (NP-SBJ (PRO he))
                (VB;~I smile))
        (PUNC ?))

yes

The output from (1.10) shows the return of a structure with CP-QUE (interrogative clause) as the topmost node. This is achieved with rule 3 of (1.6), which expects the word list to provide content for the success of a clause_top_layer rule with the clause type parameter set to matrix_interrogative.

    As yet another example of checking word list content for being a sentence, consider (1.11).

(1.11)
| ?- tphrase_set_string([w('PRO','He'), w('MD',';~cat_Vi','might'), w('VB',';~I','smile'), w('PUNC',','), w('MD',';~cat_Vi','might'), w('NEG;_clitic_','n<apos>t'), w('PRO','he'), w('PUNC','?')]), parse(sentence).

(IP-MAT (NP-SBJ (PRO He))
        (MD;~cat_Vi might)
        (IP-INF-CAT (VB;~I smile))
        (PUNC ,)
        (PRN (CP-QUE (IP-SUB (MD;~cat_Vi might)
                             (NEG;_clitic_ n<apos>t)
                             (NP-SBJ (PRO he)))))
        (PUNC ?))

yes

The output from (1.11) shows the return of a sentence ending in a tag question, which is achieved with rule 4 of (1.6).

    To sum up, this section has shown how to query whether a word list contains content for an English sentence. With success, returned structure tells us about the kind of sentence:


Question:

Why do the failure and success results of (1.12)–(1.14) obtain? (Also, note the result of (1.10) above.)

(1.12)
| ?- tphrase_set_string([w('DOD','','Did'), w('PRO','he'), w('VB',';~I','smile'), w('PUNC','.')]), parse(sentence).

no
(1.13)
| ?- tphrase_set_string([w('PRO','he'), w('DOD','','did'), w('VB',';~I','smile'), w('PUNC','.')]), parse(sentence).

(IP-MAT (NP-SBJ (PRO he))
        (DOD did)
        (VB;~I smile)
        (PUNC .))

yes
(1.14)
| ?- tphrase_set_string([w('PRO','he'), w('DOD','','did'), w('VB',';~I','smile'), w('PUNC','?')]), parse(sentence).

no

1.8    Sentence fragments and utterances

An overall utterance can consist of a single sentence or sentence fragment, or multiple instances of sentences and/or sentence fragments, and might itself form part of an embedding as reported speech.

    Rule (1.15) identifies a sentence fragment as content for a fragment layer followed by sentence final punctuation.

(1.15)
fragment([node('FRAG',FL)|L],L) -->
  fragment_layer(FL,FL1),
  punc(final,FL1,[]).

Fragment layer content is found with calls of (1.16) that detect:

(1.16)
fragment_layer(L,L0) -->
  noun_phrase('',_,L,L0).
fragment_layer(L,L0) -->
  adjective_phrase('',_,L,L0).
fragment_layer(L,L0) -->
  adverbial(L,L0).
fragment_layer(L,L0) -->
  ip_to_inf('',[],L,L0).
fragment_layer([node('IP-PPL2;IP-PPL',IL)|L],L) -->
  ip_ppl_adverbial_layer(filled_subject,IL,[]).
fragment_layer([node('IP-PPL3',IL)|L],L) -->
  ip_ppl_adverbial_layer(unfilled_subject,IL,[]).

For rule 3 of (1.16), an adverbial (see section 2.10 for the adverbial rules) can be:

    With the rules to identify sentence fragments, and the sentence rules from section 1.7, rule (1.17) calls the rules of (1.18) to recursively find the content of sentences and/or sentence fragments [rules 3 and 4 of (1.18)]. The recursion ends by picking up a final sentence or sentence fragment (which could be the only utterance element) [rules 1 and 2 of (1.18)].

(1.17)
utterance(Ext,[node(Label,UL)|L],L) -->
  utterance_collect(UL,[]),
  {
    atom_concat('utterance',Ext,Label)
  }.
(1.18)
utterance_collect(L,L0) -->
  sentence(L,L0).
utterance_collect(L,L0) -->
  fragment(L,L0).
utterance_collect(L,L0) -->
  sentence(L,L1),
  utterance_collect(L1,L0).
utterance_collect(L,L0) -->
  fragment(L,L1),
  utterance_collect(L1,L0).

    Example (1.19) illustrates the parse of an utterance that has an initial imperative sentence and ends with a sentence fragment that has a noun phrase as content for its fragment layer.

(1.19)
| ?- tphrase_set_string([w('VB',';~I','Beware'), w('PUNC','!'), w('PRO;_genm_','His'), w('ADJ','flashing'), w('NS','eyes'), w('PUNC','!')]), parse(utterance('')).

(utterance (IP-IMP (VB;~I Beware)
                   (PUNC !))
           (FRAG (NP (NP-GENV (PRO;_genm_ His))
                     (ADJP (ADJ flashing))
                     (NS eyes))
                 (PUNC !)))

yes