Chapter 1 Introduction

1.1 Encircling layers

An utterance of English is like an onion with encircling layers around internal elements, as pictured in Figure 1.1.

Figure 1.1: Layers around an innermost element

The action of parsing amounts to:

pealing back layers to reach internal elements
placing everything back together again

An innermost element is either the head word of a phrase layer or the verb of a clause layer.

While placing everything back together again, we can add labels called annotations. This preserves a record of what pealing back the layers uncovered. Subsequently, these labels act as handles that we can query in order to access the structural analysis that parsing revealed.

Furthering the onion analogy, the kind of layering seen in Figure 1.2 occurs when there is coordination, with each conjunct having to conform to the same layer requisites.

Figure 1.2: Parallel layers

In addition to structures that arise from multiple layers, we will observe ripple effects, where what happens at one layer goes on to have consequences for what is possible in other layers.

1.2 Phrase structure rules

Phrase structure rules are a way to build labelled layers around embedded elements. A phrase structure rule has:

a left-hand side that is the label for the layer
a right-hand side made of one, two or more embedded elements, where an embedded element can be either the label of a layer or a terminal element

For example, the five rules of (1.1) involve labelled layers: vp, np_sbj and ip; and terminal elements: [smiles], [and] and [he].

(1.1): 1.  vp --> [smiles].
2.  vp --> vp, [and], vp.
3.  np_sbj --> [he].
4.  ip --> np_sbj, vp.
5.  ip --> ip, [and], ip.

By following the rules of (1.1), we can assemble the structure of (1.2): We can start with [rule 4], where an ip layer can consist of an np_sbj layer followed by a vp layer. [rule 3] gives a complete np_sbj with the terminal word [he]. [rule 2] gives a vp which is itself made up of a vp, followed by terminal word [and], followed by a vp. Finally, [rule 1] gives vp layers with the terminal word [smiles].

(1.2)

ip

np_sbj

[he]

vp

[smiles]

[and]

vp

[smiles]

As an alternative to (1.2), we might have started with [rule 5] of (1.1) and thereafter assembled (1.3), and so on.

(1.3)

ip

np_sbj

[he]

vp

[smiles]

[and]

ip

np_sbj

[he]

vp

[smiles]

Question

Applying the phrase structure rules of (1.1), how many different ip structures can you make?

1.3 Computer parsing in this book

Computer parsing in this book will use Definite Clause Grammar notation (DCG; Pereira and Warren 1980). This is a notation for writing phrase structure grammar rules in which labels for layers can take attributes with values unified by Prolog-style term unification. DCG notation is a feature of virtually all Prolog systems (see Neumerkel 2019). With DCG notation, we can write phrase structure rules (like (1.1) above) and then we have an executable Prolog program.

Prolog and DCGs are discussed in introductory textbooks by Clocksin and Mellish (1981), Pereira and Shieber (1987), Covington (1994), Matthews (1998), Blackburn, Bos and Striegnitz (2006), among others. Shieber (1986) contains discussion of how DCGs relate to other formalisms for encoding natural language grammars, including Categorial Grammar, Lexical Functional Grammar (LFG), Generalized Phrase Structure Grammar (GPSG) and Head-driven Phrase Structure Grammar (HPSG).

We will want to include phrase structure grammar rules with left recursion, like [rules 2 and 5] of (1.1) above. When there is left recursion, avoiding infinite loops from rule calls requires remembering results. This is possible with a technique called tabling introduced by Swift and Warren (1994). Tabling is a feature of some Prolog systems, notably, the XSB Tabling Prolog system (Swift and Warren 2022). Christiansen and Dahl (2018) discuss the evolution of natural language processing as it relates to Logic Programming, with particular focus on DCGs and tabling.

1.4 Structural analysis and labelling in this book

The structural analysis and labelling in this book follow the annotation scheme of the Treebank Semantics Parsed Corpus (TSPC; Butler 2023). The TSPC is a corpus of English for general use with hand worked tree analysis for half-a-million words.

The annotation approach is an attempt to consolidate alternative annotation schemes for English:

The SUSANNE Corpus and Analytic Scheme (Sampson 1995)
The ICE Parsing Scheme (Nelson, Wallis and Aarts 2002)
The Penn Treebank Scheme (Marcus, Santorini and Marcinkiewicz 1993)
The Penn Historical Parsed Corpora Scheme (Santorini 2016)

From the SUSANNE scheme, there is adoption of form and function information, such that the analysis of the TSPC scheme most closely matches the SUSANNE scheme. The SUSANNE scheme is closely related to the English grammars of Quirk et al. (1972, 1985).

The ICE Parsing Scheme similarly follows the Quirk et al. grammars. In addition, ICE is notable for its rich range of features. The TSPC annotation supports the ability for many of these features to be automatically derived.

The Penn Historical Corpora scheme, which itself draws on the bracketed approach of the Penn Treebank scheme, has informed the ‘look’ of the annotation. This includes:

the tag labelling: CP and IP for clause layers; ADJP, ADVP, NP and PP for phrase layers
that label extensions occur after CP, IP, ADJP, ADVP, NP and PP to indicate function
the presentation of conjunction structure with CONJP layers

However, labels aside, the tag set of the TSPC scheme is most compatible with the SUSANNE scheme. This includes word class tags that closely overlap with the Lancaster word class tagging systems, especially the UCREL CLAWS5 tag set used for the British National Corpus (BNC Consortium 2005).

The TSPC scheme also contains plenty that is innovative. Most notably, there is normalisation of structure, achieved with internal layers at:

clause levels (ILYR)
noun phrase levels (NLYR)
adjective phrase levels (AJLYR)
adverb phrase levels (AVLYR)

Another area of innovation is verb code integration. There are codes to classify catenative verbs following Huddleston (1988, p.63) and Huddleston and Pullum (2002, p.1220). These are verbs of a verb sequence that are prior to the main verb of a clause and internal structure with IP-PPL-CAT labelled layers is included with the clause annotation to support this verb type. The verb codes for main verbs are from the mnemonic system of the fourth edition of the Oxford Advanced Learner's Dictionary (OALD4; Cowie 1989). Additional codes with very particular distributions are included from Hornby (1975).

The most innovative aspect of the annotation gives the TSPC its name: The Treebank Semantics system can evaluate it. Treebank Semantics (Butler 2021) processes constituency tree annotations and returns logic-based meaning representations. While the annotation seldom includes indexing, results from Treebank Semantics can resolve both inter and intra clause dependencies, including cross sentential anaphoric dependencies.

1.5 Representing layered structure

When parsing, we will accumulate layers with node structures. These structures have the form:

node(Label,NodeList)

where Label is a label for the layer and NodeList is a list of node structures that are the content for the layer.

In Prolog, lists are written with square bracket notation:

[a1, ..., aN]

where a1 through to aN are the list items. The list that contains no items ([]) is the empty list. A terminal layer of structure has the form:

node(Word,[])

With this background, consider (1.4) as an example of encircling layers gathered with node structures.

(1.4): node('IP-MAT',
     [node('NP-SBJ',
           [node('PRO',
                 [node('He',[])])]),
      node('ILYR',
           [node('ILYR',
                 [node('VBP;~I',
                       [node('smiles',[])])]),
            node('CONJP',
                 [node('CONJ',
                       [node('and',[])]),
                  node('ILYR',
                       [node('VBP;~I',
                             [node('smiles',[])])])])]),
      node('PUNC',
           [node('.',[])])])

We will more typically present node structure as labelled bracketed structure, as in (1.5).

(1.5): (IP-MAT (NP-SBJ (PRO He))
        (ILYR (ILYR (VBP;~I smiles))
              (CONJP (CONJ and)
                     (ILYR (VBP;~I smiles))))
        (PUNC .))

1.6 Orientation

We will proceed as follows. In the remainder of this chapter, section 1.7 introduces rules for identifying sentences and section 1.8 introduces rules for sentence fragments and utterances. This will demonstrate writing DCG rules with attributes for collecting layered structure and passing on ripple information. This will also demonstrate using the Prolog system to query English word input.

Subsequent chapters define rules to build up a wide coverage of consequences from word input. This starts in chapter 2 with an inventory of the kinds of words there are in English. Once we have a handle on the range of words, we will want to start placing words together to create phrases and this is the goal of chapter 3. Chapter 4 establishes what verbs require to be present in the clause based on their given verb code. With access to phrases and verb words with their selection requirements, we will have content for clause layers. Establishing what is possible at the clause layer is the goal of chapter 5. Chapter 6 builds on all the preceding chapters by integrating clause subordination. Chapter 7 gives closing remarks.

1.7 Sentences

The Prolog code of (1.6) gives phrase structure rules to initiate finding layered content to say there is a sentence.

(1.6): sentence([node('IP-MAT',IL)|L],L) -->
  clause_top_layer(statement_order,[],IL,IL1),
  punc(final,IL1,[]).
sentence([node('IP-IMP',IL)|L],L) -->
  clause_top_layer(imperative_clause,[],IL,IL1),
  punc(final,IL1,[]).
sentence([node('CP-QUE-MAT',[node('IP-SUB',IL),PU])|L],L) -->
  clause_top_layer(matrix_interrogative,[],IL,[]),
  punc(final_question,[PU],[]).
sentence([node('IP-MAT',IL)|L],L) -->
  clause_top_layer(statement_order,[],IL,IL2),
  punc(non_final,IL2,[node('CP-QUE-TAG',[node('IP-SUB',TL)])|IL1]),
  clause_top_layer(tag_question,[],TL,[]),
  punc(final_question,IL1,[]).
sentence([node('IP-MAT',[node('ILYR',[node('ILYR',IL1),node('CONJP',[CONJ,node('ILYR',[node('ADVP-CLR',[node('ADV',[node('so',[])])])|IL2])])]),PU])|L],L) -->
  clause_top_layer(statement_order,[],IL1,[]),
  conj(CONJ),
  [w('ADV','so')],
  clause_top_layer(tag_question,[],IL2,[]),
  punc(final,[PU],[]).

[rule 1] of (1.6) will create a node structure with a list IL for the layer content of the node and 'IP-MAT' (statement matrix clause) as the node label. IP is an abbreviation for Inflectional Phrase, which is another way to say ‘clause’. To succeed, [rule 1] needs content to parse that involves a list of items where all but the last item is content for a clause_top_layer phrase structure rule (see section 5.2) and where the last list item is an instance of final punctuation (see section 2.12).

The Prolog query of (1.7) first calls tphrase_set_string to store a list of word items to parse. Let's call this the word list. The second and last call is to parse (see the Appendix for its definition) for questioning whether the stored word list can be input for a sentence rule. Note that parameters of sentence visible in (1.6) are hidden by the parse call. These are parameters to accumulate node structures and if parse succeeds, then all gathered node structures are presented as bracketed tree output.

(1.7): | ?- tphrase_set_string([w('PRO','He'), w('VBP',';~I','smiles'), w('PUNC','.')]), parse(sentence).

(IP-MAT (NP-SBJ (PRO He))
(VBP;~I smiles)
(PUNC .))

yes

The output from (1.7) is structure with IP-MAT as the topmost node, so [rule 1] of (1.6) completed successfully.

Now consider (1.8).

(1.8): | ?- tphrase_set_string([w('VB',';~I','Smile'), w('PUNC','.')]), parse(sentence).

(IP-IMP (VB;~I Smile)
(PUNC .))

yes

The output from (1.8) is structure with IP-IMP (imperative clause) as the topmost node: [rule 1] of (1.6) fails, but [rule 2] succeeds. [rules 1 and 2] are similar. Both rules call clause_top_layer, but they do so with different clause type settings:

[rule 1] statement_order
[rule 2] imperative_clause

As with (1.8), an imperative clause should have no subject and an initial verb with imperative form (a word with VB, DO, HV or BE word class information). If we try (1.9), which is unsuitable as a statement sentence because of the lack of content for a subject but then also unsuitable as an imperative sentence because the verb has VBP (present tense) word class, then there can be no parse result, as (1.9) shows.

(1.9): | ?- tphrase_set_string([w('VBP',';~I','Smiles'), w('PUNC','.')]), parse(sentence).

no

As a further example of checking if there is word list content for a sentence, consider (1.10).

(1.10): | ?- tphrase_set_string([w('DOD','','Did'), w('PRO','he'), w('VB',';~I','smile'), w('PUNC','?')]), parse(sentence).

(CP-QUE-MAT (IP-SUB (DOD Did)
                    (NP-SBJ (PRO he))
                    (VB;~I smile))
            (PUNC ?))

yes

The output from (1.10) shows the return of a structure with CP-QUE-MAT (matrix interrogative clause) as the topmost node. This follows from [rule 3] of (1.6), which expects the word list to provide content for the success of a clause_top_layer rule with the clause Type set to matrix_interrogative.

As another example of checking word list content for being a sentence, consider (1.11).

(1.11): | ?- tphrase_set_string([w('PRO','He'), w('MD',';~cat_Vi','might'), w('VB',';~I','smile'), w('PUNC',','), w('MD',';~cat_Vi','might'), w('NEG;_clitic_','n<apos>t'), w('PRO','he'), w('PUNC','?')]), parse(sentence).

(IP-MAT (NP-SBJ (PRO He))
        (MD;~cat_Vi might)
        (IP-INF-CAT (VB;~I smile))
        (PUNC ,)
        (CP-QUE-TAG (IP-SUB (MD;~cat_Vi might)
                            (NEG;_clitic_ n<apos>t)
                            (NP-SBJ (PRO he))))
        (PUNC ?))

yes

The output from (1.11) shows the return of a sentence ending in a tag question, which follows from [rule 4] of (1.6).

As yet another example of checking word list content for being a sentence, consider (1.12).

(1.12): | ?- tphrase_set_string([w('PRO','He'), w('VBP',';~I','smiles'), w('CONJ','and'), w('ADV','so'), w('DOP','','does'), w('PRO','she'), w('PUNC','.')]), parse(sentence).

(IP-MAT (ILYR (ILYR (NP-SBJ (PRO He))
                    (VBP;~I smiles))
              (CONJP (CONJ and)
                     (ILYR (ADVP-CLR (ADV so))
                           (DOP does)
                           (NP-SBJ (PRO she)))))
        (PUNC .))

yes

The output from (1.12) shows the conjunction of two main clauses with the first a clause that has statement order and the second a clause that has an initial so followed by content with the shape of a tag question, which follows from [rule 5] of (1.6).

To sum up, this section has shown how to query whether a word list contains content for an English sentence. With success, returned structure tells us about the kind of sentence:

statement matrix sentence (IP-MAT)
imperative sentence (IP-IMP)
interrogative matrix sentence (CP-QUE-MAT)

Question

Why do the failure and success results of (1.13)–(1.15) obtain? (Also, note the result of (1.10) above.)

(1.13): | ?- tphrase_set_string([w('DOD','','Did'), w('PRO','he'), w('VB',';~I','smile'), w('PUNC','.')]), parse(sentence).

no

(1.14): | ?- tphrase_set_string([w('PRO','he'), w('DOD','','did'), w('VB',';~I','smile'), w('PUNC','.')]), parse(sentence).

(IP-MAT (NP-SBJ (PRO he))
        (DOD did)
        (VB;~I smile)
        (PUNC .))

yes

(1.15): | ?- tphrase_set_string([w('PRO','he'), w('DOD','','did'), w('VB',';~I','smile'), w('PUNC','?')]), parse(sentence).

no

1.8 Sentence fragments and utterances

An overall utterance can consist of a single sentence or sentence fragment or multiple instances of sentences and/or sentence fragments and might itself form part of an embedding as reported speech.

Rule (1.16) identifies a sentence fragment as content for a fragment layer followed by sentence final punctuation.

(1.16): fragment([node('FRAG',FL)|L],L) -->
fragment_layer(FL,FL1),
punc(final,FL1,[]).

Calls of (1.17) find fragment layer content as:

[rule 1] a noun phrase (see section 3.2)
[rule 2] an adjective phrase (see section 3.4)
[rule 3] an adverbial from an initial_adverbial call (see section 2.10)
[rule 4] an adverbial from an adverbial call (see section 4.1)
[rule 5] a to-infinite clause (see section 6.5.2)
[rule 6] a participle clause that has no internal subject (see section 6.8.3)
[rule 7] a participle clause with an internal subject (see section 6.8.3)

(1.17): fragment_layer(L,L0) -->
  noun_phrase('',_,L,L0).
fragment_layer(L,L0) -->
  {
    member(Type,[established,interrogative,relative,filled_sbj])
  },
  adjective_phrase('',[],Type,L,L0).
fragment_layer(L,L0) -->
  initial_adverbial(L,L0).
fragment_layer(L,L0) -->
  adverbial(L,L0).
fragment_layer(L,L0) -->
  ip_to_inf('',[],L,L0).
fragment_layer([node('IP-PPL',IL)|L],L) -->
  {
    member(Infl,[hag_participle,en_participle,ing_participle])
  },
  ip_ppl_adverbial_layer(filled_sbj,Infl,IL,[]).
fragment_layer([node('IP-PPL3',IL)|L],L) -->
  {
    member(Infl,[hag_participle,en_participle,ing_participle])
  },
  ip_ppl_adverbial_layer(unfilled_sbj,Infl,IL,[]).

With the rules to identify sentence fragments and the sentence rules from section 1.7, rule (1.18) calls the rules of (1.19) to recursively find the content of sentences and/or sentence fragments [rules 3 and 4 of (1.19)]. The recursion ends by picking up a final sentence or sentence fragment (which could be the only utterance element) [rules 1 and 2 of (1.19)].

(1.18): utterance(Ext,[node(Label,UL)|L],L) -->
  {
    atom_concat('utterance',Ext,Label)
  },
  utterance_collect(UL,[]).

(1.19): utterance_collect(L,L0) -->
  sentence(L,L0).
utterance_collect(L,L0) -->
  fragment(L,L0).
utterance_collect(L,L0) -->
  sentence(L,L1),
  utterance_collect(L1,L0).
utterance_collect(L,L0) -->
  fragment(L,L1),
  utterance_collect(L1,L0).

Example (1.20) illustrates the parse of an utterance that has an initial imperative sentence and ends with a sentence fragment that has a noun phrase as content for its fragment layer.

(1.20): | ?- tphrase_set_string([w('VB',';~I','Beware'), w('PUNC','!'), w('PRO;_genm_','His'), w('ADJ','flashing'), w('NS','eyes'), w('PUNC','!')]), parse(utterance('')).

(utterance (IP-IMP (VB;~I Beware)
                   (PUNC !))
           (FRAG (NP (NP-GENV (PRO;_genm_ His))
                     (ADJP (ADJ flashing))
                     (NS eyes))
                 (PUNC !)))

yes