TGrep2 and Tregex queries with the online interface

Alastair Butler

Hirosaki University



Abstract

TGrep2 and Tregex are tools for searching syntax trees. This guide introduces their shared query language which is available when searching for data with the online search interface.


1    Introduction

TGrep (Pito 1994) was the original tree-matching command-line program that came with the Penn Treebank. TGrep2 (Rohde 2005) is an implementation that adds extensions to the query language. Tregex (Levy and Andrew 2006) is yet another implementation that replicates the extended search functionality of TGrep2 while also offering a desktop application for directly browsing source trees, and even manipulating syntax trees in traditional bracketed (Penn) treebank notation. This guide provides an introduction to the extended query language which is shared by both TGrep2 and Tregex and which is available when searching for data with the online search interface.

    Search queries are expressed as patterns that consist of node descriptions to match nodes and relationships establishing links or negated links to other nodes. Notably:

    Typically, spaces are optional in patterns. But for readability, it is usually best to place spaces between relationships and node expressions.


2    Node descriptions

Single tree nodes are matched by node descriptions:

    If a simple character string or a regular expression begins with an exclamation mark (!), the matching process will be complemented. That is, matches will turn into non-matches, and vice-versa. For example, !VB matches any node that is not VB, and !/^I/ can match any node that does not start with I.


2.1    Tips for using node descriptions that are regular expressions

Specified as a string with surrounding slashes (/), a regular expression matches a node if there is a part of the node that is matched. For example, (1) matches よう, ようやく, そのような, おはよう, etc.

(1)
/よう/

The caret symbol (^) anchors the regular expression to the beginning of a matched node. So, (2) cannot match そのような or おはよう but can match よう and ようやく.

(2)
/^よう/

A dollar sign ($) as the last character will anchor the regular expression to the end of a matched node. For example, the caret and dollar-sign in (3) restricts matching to あの, この and その.

(3)
/^(あ|こ|そ)の$/

Note how (3) contains round brackets ((, )) which are used to mark a subexpression called a block or capturing group. Within a capturing group the pipe symbol (|) can be used to separate subexpressions with the meaning to match at least one of the subexpressions.

    Note, care is needed when writing regular expressions that end with dollar signs. This is because you are forced to predict what the end content of a node could be. For example, suppose you wished for a node description that would match all bare noun phrases, that is, all noun phrase nodes that lack functional extensions. In addition to matching nodes that are simply NP, you will also want to match nodes like NP;{BOY}. However, you will not want to match either NP-SBJ or NP-SBJ;{BOY}, since these have subject (-SBJ) functional marking. A regular expression able to match all bare noun phrases is (4).

(4)
/^NP(;.+)?$/

Note how (4) contains a capturing group that ends with a question mark symbol (?) saying to match its preceding element (the capturing group) one or zero times. Also note how the capturing group contains a dot symbol (.) that is a character wildcard to match any character and a plus symbol (+) saying to match its preceding element (any character) one or more times. In contrast, the semi-colon character (;) will only match itself.

    A regular expression can also be constrained with ‘\b’ to mark a word boundary. Thus, (5) can match NP and NP-PRD but will not match NPR.

(5)
/^NP\b/

Word boundary markers are typically used to constrain matching nodes at the constituent level of structure (e.g., /^NP\b/, /^ADVP\b/, /^IP-MAT\b/) and this is generally a good policy to follow. However, we do need /^IP-ADV/ with no word boundary marker to match both IP-ADV and IP-ADV2 nodes. Similarly, we need /^IP-EMB/ with no word boundary marker to match both IP-EMB and IP-EMB2 nodes.

    Regular expressions are case sensitive and so (5) and (6) match different nodes.

(6)
/^np\b/

    The star symbol (*) is a metacharacter saying to match its preceding element (any character) zero or more times. To match an instance within a node label of this character or any other metacharacter (“^”, “$”, “.”, “+”), add a backslash (\) as an “escape symbol”. For example, (7) searches for nodes that begin with “*”.

(7)
/^\*/

3    Relationships between nodes

Relationships establish connections between nodes matched by node descriptions. There is a complete pairing of relationships with counterparts, allowing for flexibility when constructing complex relationships (see section 4). Notable relationships can be pictured as in Figure 1.

Figure 1: Notable relationships between nodes

Notable relationships between nodes

The relationships pictured in Figure 1 are described in (8). The instances of A and A' in (8) are node descriptions able to match the node labelled A in Figure 1. The instances of B in (8) are node descriptions able to match nodes labelled B in Figure 1.

(8)
A >> B     A is dominated by (is a descendant of) B
A ,, B     A follows B
A , B      A immediately follows B
A > B      A is immediately dominated by (is the child of) B
A . B      A immediately precedes B
A .. B     A precedes B
A $,, B    A is a sister of and follows B
A $, B     A is a sister of and immediately follows B
A == A'    A and A' match the same node
A $ B      A is a sister of and not equal to B
A $. B     A is a sister of and immediately precedes B
A $.. B    A is a sister of and precedes B
A <i B     B is the ith child of A (i > 0); also expressed as B >i A
A <-i B    B is the ith-to-last child of A (i > 0); also expressed as A >-i B
A < B      A immediately dominates (is the parent of) B
A << B     A dominates (is an ancestor of) B

Any relationship can be negated by immediately preceding it with an exclamation mark (!). Examples of patterns with negated relations are given below in sections 3.11, 4 and 5.


3.1    The immediately-dominates relation

As a demonstration of the immediately dominates relation, the wildcard node descriptions (__) in (9) are constrained to match any node that happens to be immediately dominated by a node labelled C.

(9)
C < __                (immediately dominates / is the parent of)
__ > C                (is immediately dominated by / is the child of)

As examples of search matching, consider the tree in (10) which has a node labelled C.

(10)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

The wildcards of (9) can match any of the marked nodes in (10) (one match for each search hit) that are not C, both of which are situated immediately below C, but not any lower.


3.2    The dominates relation

As a demonstration of the dominates relation, the wildcard node descriptions (__) in (11) are constrained to match any of the nodes that happen to be dominated by a node labelled C.

(11)
C << __               (dominates / is an ancestor of)
__ >> C               (is dominated by / is a descendant of)

As examples of search matching, consider the tree in (12) which has a node labelled C.

(12)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

The wildcards of (11) can match any of the marked nodes in (12) (one match for each search hit) that are not C, all of which are situated below C.


3.3    The is-dominated-by relation

As a demonstration of the is-dominated-by relation, the wildcard node descriptions (__) in (13) are constrained to match any node that happens to dominate a node labelled E.

(13)
E >> __               (is dominated by / is a descendant of)
__ << E               (dominates / is an ancestor of)

As examples of search matching, consider the tree in (14) which has a node labelled E.

(14)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

The wildcards of (13) can match any of the marked nodes in (14) (one match for each search hit) that are not E, all of which are situated above E.


3.4    The precedes relation

As a demonstration of the precedes relation, the wildcard node descriptions (__) in (15) are constrained to match any node that happens to be preceded by a node labelled I.

(15)
I .. __               (precedes)
__ ,, I               (follows)

As examples of search matching, consider the tree in (16) which has a node labelled I.

(16)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

The wildcards of (15) can match any of the marked nodes in (16) (one match for each search hit) that are not I, all of which occur after I.


3.5    The follows relation

As a demonstration of the follows relation, the wildcard node descriptions (__) in (17) are constrained to match any node that happens to be followed by a node labelled I.

(17)
I ,, __               (follows)
__ .. I               (precedes)

As examples of search matching, consider the tree in (18) which has a node labelled I.

(18)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

The wildcards of (17) can match any of the marked nodes in (18) (one match for each search hit) that are not I, all of which occur before I.


3.6    The immediately-precedes relation

As a demonstration of the immediately-precedes relation, the wildcard node descriptions (__) in (19) are constrained to match any node that happens to be immediately preceded by a node labelled I.

(19)
I . __                (immediately precedes)
__ , I                (immediately follows)

As examples of search matching, consider the tree in (20) which has a node labelled I.

(20)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

The wildcards of (19) can match any of the marked nodes in (20) (one match for each search hit) that are not I, all of which occur immediately after I and no later.


3.7    The immediately-follows relation

As a demonstration of the immediately-follows relation, the wildcard node descriptions (__) in (21) are constrained to match any node that happens to be immediately followed by a node labelled I.

(21)
I , __                (immediately follows)
__ . I                (immediately precedes)

As examples of search matching, consider the tree in (22) which has a node labelled I.

(22)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

The wildcards of (21) can match any of the marked nodes in (22) (one match for each search hit) that are not I, all of which occur immediately before I and no earlier.


3.8    The sister relation

As a demonstration of the sister relation, the wildcard node descriptions (__) in (23) are constrained to match any node that happens to be a sister of a node labelled F.

(23)
F $ __                (sister)
__ $ F                (sister)

As examples of search matching, consider the tree in (24) which has a node labelled F.

(24)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

The wildcards of (23) can match any of the marked nodes in (24) (one match for each search hit) that are not F, both of which occur at the same level as F in the tree structure.


3.9    The sister-and-precedes or sister-and-follows relations

As a demonstration of the sister-and-precedes relation, the wildcard node descriptions (__) in (25) are constrained to match any node where a node labelled E is a preceding sister.

(25)
E $.. __              (sister and precedes)
__ $,, E              (sister and follows)

As examples of search matching, consider the tree in (26) which has a node labelled E.

(26)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

The wildcards of (25) can match any of the marked nodes in (26) (one match for each search hit) that are not E, both of which occur at the same level in the tree structure as E and occur after E.


3.10    The sister-and-immediately-precedes and sister-and-immediately-follows relations

As a demonstration of the sister-and-immediately-precedes relation, the wildcard node descriptions (__) in (27) are constrained to match the node where a node labelled E is the immediately preceding sister.

(27)
E $. __               (sister and immediately precedes)
__ $, E               (sister and immediately follows)

As examples of search matching, consider the tree in (28) which has a node labelled E.

(28)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

The wildcards of (27) can match the marked F node in (28), which occurs at the same level in the tree structure as E and comes immediately after E.


3.11    The i-th child relationships

To specify the exact child position which a node must (or must not) occupy, use the i-th child relationships (<i, <-i, !<i, and !<-i; also expressed with >i, >-i, !>i, and !>-i). In these relationships, i represents a number indicating which child ‘slot’ is to be considered. Positive numbers are counts from the left edge, while negative numbers are counts from the right edge. For example, (29) finds instances of NP where the second child is a PP.

(29)
/^NP\b/ <2 /^PP\b/

Pattern (30) finds instances of NP where the last child is a PRN.

(30)
/^NP\b/ <-1 /^PRN\b/

Pattern (31) finds instances of NP which have less than three children. That is, every NP which has three or more children will have something as its third child to make matching with (31) fail.

(31)
/^NP\b/ !<3 __

To select nodes with specific numbers of children, you can use a positive and negative i-th child relationship. For example, to find NP nodes which dominate either twelve or thirteen children you would use pattern (32).

(32)
/^NP\b/ <12 __ !<14 __

3.12    Other relationships

Other available relationships between nodes, additional to those given above in (8), are given in (33):

(33)
A <<, B   B is a leftmost descendant of A
A <<- B   B is a rightmost descendant of A
A >>, B   A is a leftmost descendant of B
A >>- B   A is a rightmost descendant of B
A <, B    B is the first child of A (synonymous with A <1 B)
A >, B    A is the first child of B (synonymous with A >1 B)
A <- B    B is the last child of A (synonymous with A <-1 B)
A >- B    A is the last child of B (synonymous with A >-1 B)
A <: B    B is the only child of A
A >: B    A is the only child of B
A <<: B   A dominates B via an unbroken chain (length > 0) of unary local trees
A >>: B   A is dominated by B via an unbroken chain (length > 0) of unary local trees

4    Boolean expressions for building complex patterns

Complex Tregex patterns can be constructed by specifying Boolean expressions of relationships that connect to a node. If no operator is placed between two relationships, they are assumed to be connected by an and, which can be specified for clarity with an ampersand (&). Thus, patterns (34) and (35) both match an /^IP\b/ node which immediately dominates a /^PP\b/ node and which dominates an /^IP\b/ node.

(34)
/^IP\b/ < /^PP\b/ << /^IP\b/
(35)
/^IP\b/ < /^PP\b/ & << /^IP\b/

The second relationship ‘<< /^IP\b/’ refers to the first /^IP\b/ and not to the /^PP\b/, and is equivalent to (36).

(36)
(/^IP\b/ < /^PP\b/) << /^IP\b/

The placement of parenthesis can be used to group a node with its relationship(s), so that (37) can match an /^IP\b/ which immediately dominates a /^PP\b/ which in turn dominates some /^IP\b/.

(37)
/^IP\b/ < (/^PP\b/ << /^IP\b/)

    If a pipe (|) is placed between two relationships, only one of the relationships must be satisfied. Pattern (38) will find an /^IP\b/ node which immediately dominates a /^PP\b/ node or dominates an /^IP\b/ node.

(38)
/^IP\b/ < /^PP\b/ | << /^IP\b/

The (possibly implicit) and binds tighter than the or. So pattern (39) will find an /^IP\b/ node that has VB and VB2 children or that has /^NP-PRD\b/ and AX children.

(39)
/^IP\b/ < VB < VB2 | < /^NP-PRD\b/ < AX

    To create more complex expressions, square brackets ([, ]) can be used to group relationships to form a complex relationship. Thus, pattern (40) finds an /^IP\b/ that has a VB or /^NP-PRD\b/ child, and that precedes a subject sister or follows an /^ADVP\b/ sister.

(40)
/^IP\b/ [ < VB | < /^NP-PRD\b/ ] [ $.. /SBJ/ | $,, /^ADVP\b/ ]

Because a square bracketed expression itself acts as a complex relationship connecting to a node, an open square bracket must immediately precede a relationship or another open square bracket that itself starts a complex relationship.

    Any relationship, including a square bracketed one, can be negated by immediately preceding it with an exclamation mark (!). Thus, pattern (41) will find /^IP\b/ nodes that do not simultaneously have /^NP\b/ and /^PP\b/ children.

(41)
/^IP\b/ ![ < /^NP\b/ < /^PP\b/ ]

5    Labeled nodes and back links

Search patterns allow for a matched node to be given a unique label with an equal sign (=). Once labelled, a node can thereafter be referred back to elsewhere within the search pattern. Such a ‘back link’ is accomplished when the node name only consists of an equal sign and a label, with the label serving to refer to the node that has been assigned to that name elsewhere in the pattern. It is only possible to express patterns in which the labelled node comes prior in the connected structure of the pattern that leads to a back link.

    As an example of how back links are useful, consider the task of finding relative clauses with object traces. An object trace might be embedded arbitrarily deeply inside its relative clause, which is captured with the dominates relation (‘<< (/OB1/ < /\*T\*/)’). However, without further qualification, this is too liberal, as it will also allow the return of a relative clause that itself has no object trace but contains a relative clause with an object trace. We can add the extra qualification to prohibit the unwelcome scenario with a back link, resulting in (42).

(42)
/REL/=p << (/OB1/ < /\*T\*/ !>> (/REL/ >> =p))

The added back link statement (‘!>> (/REL/ >> =p)’) requires there to be no intervening relative clause (/REL/) between the object trace and the relative clause that the pattern serves to find.


6    Segmented patterns

Search patterns can be split into multiple segments, which are separated by colons (:). Segments following the first have to include a reference to a labelled node that is introduced in a previous segment. Thus, pattern (43) will find trees where an NP immediately dominates a CP-THT.

(43)
/^NP\b/=x : =x < /^CP-THT\b/

    With labeling of complex nodes, segments can entirely replace patterns that otherwise require embedded relationships. Pattern (44) without segments and pattern (45) with segments match identical structure. For example, they can match the tree fragment (46).

(44)
/^NP\b/ < N < (/^CP-THT\b/ < (/^IP-SUB\b/ < (AX < う|よう)))
(45)
/^NP\b/=x : (=x < N) : (=x < /^CP-THT\b/=y)
: (=y < /^IP-SUB\b/=i) : (=i < AX=c) : (=c < う|よう)
(46)
NP CP-THT IP-SUB NP-SBJ *pro* PP-OB1 NP N 言葉 P-ROLE VB 改め AX よう P-COMP という N 努力

    Segments are helpful for constraining data to be sourced from somewhere particular. For example (47) finds examples of relative clauses, but only when the data is from a file with a name (content under an ID node) that includes fiction.

(47)
/REL/=x < __ : =x .. (ID < /fiction/)



References

Levy, Roger and Galen Andrew. 2006. Tregex and Tsurgeon: tools for querying and manipulating tree data structure. In Calzolari, Nicoletta et al., Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06), European Language Resources Association (ELRA), Genoa, Italy. Available at https://aclanthology.org/L06-1311/.

Pito, Richard. 1994. tgrepdoc–documentation for tgrep. University of Pennsylvania.

Rohde, Douglas. 2005. TGrep2 User Manual version 1.15. Available at: https://github.com/andreasvc/tgrep2.