TGrep2 and Tregex queries with the online interface

Alastair Butler

Hirosaki University



Abstract

TGrep2 and Tregex are tools for searching syntax trees. This guide introduces their shared query language which is available when searching for data with the online search interface.


1    Introduction

TGrep (Pito 1994) was the original tree-matching command-line program that came with the Penn Treebank. TGrep2 (Rohde 2005) is an implementation that adds extensions to the query language. Tregex (Levy and Andrew 2006) is yet another implementation that replicates the extended search functionality of TGrep2 while also offering a desktop application for directly browsing source trees, and even manipulating syntax trees in traditional bracketed (Penn) treebank notation. This guide provides an introduction to the extended query language shared by TGrep2 and Tregex which is available when searching for data with the online search interface.

    Search queries are expressed as patterns that consist of expressions to match nodes and relationships defining links or negated links to other nodes. Notably:

    Typically, spaces are optional in patterns. But for readability, it is usually best to place spaces between relationships and node expressions.


2    Specifying nodes in a search pattern

Single tree nodes are matched as follows:


2.1    Tips for specifying nodes with regular expressions

Specified as a string with surrounding slashes (/), a regular expression matches a node if there is a part of the node that is matched. For example, (1) matches よう, ようやく, そのような, おはよう, etc.

(1)
/よう/

The caret symbol (^) anchors the regular expression to the beginning of a matched node. So, (2) will not match そのような or おはよう but will match よう and ようやく.

(2)
/^よう/

A dollar sign ($) as the last character will anchor the regular expression to the end of a matched node. For example, the caret and dollar-sign in (3) restricts matching to あの, この and その.

(3)
/^(あ|こ|そ)の$/

Note how (3) contains round brackets ((, )) which are used to mark a subexpression called a block or capturing group. Within a capturing group the pipe symbol (|) can be used to separate subexpressions with the meaning to match at least one of the subexpressions.

    Note, care is needed when writing regular expressions that end with dollar signs. This is because you are forced to predict what the end content of a node could be. For example, suppose you wished for a node description that would match all bare noun phrases, that is, all noun phrase nodes that lack functional extensions. In addition to matching nodes that are simply NP, you will also want to match nodes like NP;{BOY}. However, you will not want to match either NP-SBJ or NP-SBJ;{BOY}, since these have subject (-SBJ) functional marking. A regular expression able to match all bare noun phrases is (4).

(4)
/^NP(;.+)?$/

Note how (4) contains a capturing group that ends with a question mark symbol (?) saying to match its preceding element (the capturing group) one or zero times. Also note how the capturing group contains a dot symbol (.) that is a character wildcard to match any character and a plus symbol (+) saying to match its preceding element (any character) one or more times. In contrast, the semi-colon character (;) will only match itself.

    A regular expression can also be constrained with ‘\b’ to mark a word boundary. Thus, (5) will match NP and NP-PRD but will not match NPR.

(5)
/^NP\b/

Word boundary markers are typically used to constrain matching nodes at the constituent level of structure (e.g., /^NP\b/, /^ADVP\b/, /^IP-MAT\b/) and this is generally a good policy to follow. However, we do need /^IP-ADV/ with no word boundary marker to match both IP-ADV and IP-ADV2 nodes. Similarly, we need /^IP-EMB/ with no word boundary marker to match both IP-EMB and IP-EMB2 nodes.

    Regular expressions are case sensitive and so (5) and (6) match different nodes.

(6)
/^np\b/

3    Relationships between nodes

Relationships define connections between nodes. There is a complete pairing of relationships with counterparts, allowing for flexibility when constructing complex relationships (see section 4). Notable relationships can be pictured as in Figure 1.

Figure 1: Notable relationships between nodes

Notable relationships between nodes

The relationships pictured in Figure 1 are described in (7).

(7)
A >> B     A is dominated by (is a descendant of) B
A ,, B     A follows B
A , B      A immediately follows B
A > B      A is immediately dominated by (is the child of) B
A . B      A immediately precedes B
A .. B     A precedes B
A $,, B    A is a sister of and follows B
A $, B     A is a sister of and immediately follows B
A == B     A and B are the same node
A $ B      A is a sister of and not equal to B
A $. B     A is a sister of and immediately precedes B
A $.. B    A is a sister of and precedes B
A <i B     B is the ith child of A (i > 0); also expressed as B >i A
A <-i B    B is the ith-to-last child of A (i > 0); also expressed as A >-i B
A < B      A immediately dominates (is the parent of) B
A << B     A dominates (is an ancestor of) B

Any relationship can be negated by immediately preceding it with an exclamation mark (!).


3.1    The dominates relation

As a demonstration of the dominates relation, the wildcard node descriptions (__) in (8) are constrained to match any of the nodes that happen to be dominated by C.

(8)
C << __               (dominates / is an ancestor of)
__ >> C               (is dominated by / is a descendant of)

That is, the wildcards will match any of the marked nodes in (9) that are not C, all of which are situated below C.

(9)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

3.2    The is-dominated-by relation

As a demonstration of the is-dominated-by relation, the wildcard node descriptions (__) in (10) are constrained to match any node that happens to dominate E.

(10)
E >> __               (is dominated by / is a descendant of)
__ << E               (dominates / is an ancestor of)

That is, the wildcards will match any of the marked nodes in (11) that are not E, all of which are situated above E.

(11)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

3.3    The immediately-dominates relation

As a demonstration of the immediately dominates relation, the wildcard node descriptions (__) in (12) are constrained to match any node that happens to be immediately dominated by C.

(12)
C < __                (immediately dominates / is the parent of)
__ > C                (is immediately dominated by / is the child of)

That is, the wildcards will match any of the marked nodes in (13) that are not C, all of which are situated immediately below C, but not any lower.

(13)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

3.4    The precedes relation

As a demonstration of the precedes relation, the wildcard node descriptions (__) in (14) are constrained to match any node that happens to be preceded by I.

(14)
I .. __               (precedes)
__ ,, I               (follows)

That is, the wildcards will match any of the marked nodes in (15) that are not I, all of which occur after I.

(15)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

3.5    The follows relation

As a demonstration of the follows relation, the wildcard node descriptions (__) in (16) are constrained to match any node that happens to be followed by I.

(16)
I ,, __               (follows)
__ .. I               (precedes)

That is, the wildcards will match any of the marked nodes in (17) that are not I, all of which occur before I.

(17)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

3.6    The immediately-precedes relation

As a demonstration of the immediately-precedes relation, the wildcard node descriptions (__) in (18) are constrained to match any node that happens to be immediately preceded by I.

(18)
I . __                (immediately precedes)
__ , I                (immediately follows)

That is, the wildcards will match any of the marked nodes in (19) that are not I, all of which occur immediately after I and no later.

(19)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

3.7    The immediately-follows relation

As a demonstration of the immediately-follows relation, the wildcard node descriptions (__) in (20) are constrained to match any node that happens to be immediately followed by I.

(20)
I , __                (immediately follows)
__ . I                (immediately precedes)

That is, the wildcards will match any of the marked nodes in (21) that are not I, all of which occur immediately before I and no earlier.

(21)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

3.8    The sister relation

As a demonstration of the sister relation, the wildcard node descriptions (__) in (22) are constrained to match any node that happens to be a sister of F.

(22)
F $ __                (sister)
__ $ F                (sister)

That is, the wildcards will match any of the marked nodes in (23) that are not F, both of which occur at the same level as F in the tree structure.

(23)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

3.9    The sister-and-precedes or sister-and-follows relations

As a demonstration of the sister-and-precedes relation, the wildcard node descriptions (__) in (24) are constrained to match any node where E is a preceding sister.

(24)
E $.. __              (sister and precedes)
__ $,, E              (sister and follows)

That is, the wildcards will match any of the marked nodes in (25) that are not E, both of which occur at the same level in the tree structure as E and occur after E.

(25)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

3.10    The sister-and-immediately-precedes and sister-and-immediately-follows relations

As a demonstration of the sister-and-immediately-precedes relation, the wildcard node descriptions (__) in (26) are constrained to match the node where E is the immediately preceding sister.

(26)
E $. __               (sister and immediately precedes)
__ $, E               (sister and immediately follows)

That is, the wildcards will match the marked F node in (27), which occurs at the same level in the tree structure as E and comes immediately after E.

(27)
A B C D E w0 F w1 G w2 H w3 I J w4 K L w5 M N w6 O P w7 Q R w8 S w9

3.11    The i-th child relationships

To specify the exact child position which a node must (or must not) occupy, use the i-th child relationships (<i, <-i, !<i, and !<-i; also expressed with >i, >-i, !>i, and !>-i). In these relationships, i represents a number indicating which child ‘slot’ is to be considered. Positive numbers are counts from the left edge, while negative numbers are counts from the right edge. For example, (28) finds instances of NP where the second child is a PP.

(28)
/^NP\b/ <2 /^PP\b/

Pattern (29) finds instances of NP where the last child is a PRN.

(29)
/^NP\b/ <-1 /^PRN\b/

Pattern (30) finds instances of NP which have less than three children. That is, every NP which has three or more children will have something as its third child to make matching with (30) fail.

(30)
/^NP\b/ !<3 __

To select nodes with specific numbers of children, you can use a positive and negative i-th child relationship. For example, to find NP nodes which dominate either twelve or thirteen children you would use pattern (31).

(31)
/^NP\b/ <12 __ !<14 __

3.12    Other relationships

Other available relationships between nodes, additional to those given above in (7), are given in (32):

(32)
A <<, B   B is a leftmost descendant of A
A <<- B   B is a rightmost descendant of A
A >>, B   A is a leftmost descendant of B
A >>- B   A is a rightmost descendant of B
A <, B    B is the first child of A (synonymous with A <1 B)
A >, B    A is the first child of B (synonymous with A >1 B)
A <- B    B is the last child of A (synonymous with A <-1 B)
A >- B    A is the last child of B (synonymous with A >-1 B)
A <: B    B is the only child of A
A >: B    A is the only child of B
A <<: B   A dominates B via an unbroken chain (length > 0) of unary local trees
A >>: B   A is dominated by B via an unbroken chain (length > 0) of unary local trees

4    Boolean expressions for building complex patterns

Complex Tregex patterns can be constructed by specifying Boolean expressions of relationships that connect to a node. If no operator is placed between two relationships, they are assumed to be connected by an and, which can be specified for clarity with an ampersand (&). Thus, patterns (33) and (34) both match an /^IP\b/ node which immediately dominates a /^PP\b/ node and which dominates an /^IP\b/ node.

(33)
/^IP\b/ < /^PP\b/ << /^IP\b/
(34)
/^IP\b/ < /^PP\b/ & << /^IP\b/

The second relationship ‘<< /^IP\b/’ refers to the first /^IP\b/ and not to the /^PP\b/, and is equivalent to (35).

(35)
(/^IP\b/ < /^PP\b/) << /^IP\b/

The placement of parenthesis can be used to group a node with its relationship(s), so that (36) will match an /^IP\b/ which immediately dominates a /^PP\b/ which in turn dominates some /^IP\b/.

(36)
/^IP\b/ < (/^PP\b/ << /^IP\b/)

    If a pipe (|) is placed between two relationships, only one of the relationships must be satisfied. Pattern (37) will find an /^IP\b/ node which immediately dominates a /^PP\b/ node or dominates an /^IP\b/ node.

(37)
/^IP\b/ < /^PP\b/ | << /^IP\b/

The (possibly implicit) and binds tighter than the or. So pattern (38) will find an /^IP\b/ node that has VB and VB2 children or that has /^NP-PRD\b/ and AX children.

(38)
/^IP\b/ < VB < VB2 | < /^NP-PRD\b/ < AX

    To create more complex expressions, square brackets ([, ]) can be used to group relationships to form a complex relationship. Thus, pattern (39) finds an /^IP\b/ that has a VB or /^NP-PRD\b/ child, and that precedes a subject sister or follows an /^ADVP\b/ sister.

(39)
/^IP\b/ [ < VB | < /^NP-PRD\b/ ] [ $.. /SBJ/ | $,, /^ADVP\b/ ]

Because a square bracketed expression itself acts as a complex relationship connecting to a node, an open square bracket must immediately precede a relationship or another open square bracket that itself starts a complex relationship.

    Any relationship, including a square bracketed one, can be negated by immediately preceding it with an exclamation mark (!). Thus, pattern (40) will find /^IP\b/ nodes that do not simultaneously have /^NP\b/ and /^PP\b/ children.

(40)
/^IP\b/ ![ < /^NP\b/ < /^PP\b/ ]

5    Labeled nodes and back links

Search patterns allow for a matched node to be given a unique label with an equal sign (=). Once labelled, a node can thereafter be referred back to elsewhere within the search pattern. Such a ‘back link’ is accomplished when the node name only consists of an equal sign and a label, with the label serving to refer to the node that has been assigned to that name elsewhere in the pattern. It is only possible to express patterns in which the labeled node comes prior in the connected structure of the pattern that leads to a back link.

    As an example of how back links are useful, consider the task of finding relative clauses with object traces. An object trace might be embedded arbitrarily deeply inside its relative clause, which is captured with the dominates relation (‘<< (/OB1/ < /\*T\*/)’). However, without further qualification, this is too liberal, as it will also allow the return of a relative clause that itself has no object trace but contains a relative clause with an object trace. We can add the extra qualification to prohibit the unwelcome scenario with a back link, resulting in (41).

(41)
/REL/=p << (/OB1/ < /\*T\*/ !>> (/REL/ >> =p))

The added back link statement (‘!>> (/REL/ >> =p)’) requires there to be no intervening relative clause (/REL/) between the object trace and the relative clause that the pattern serves to find.


6    Segmented patterns

Search patterns can be split into multiple segments, which are separated by colons (:). Segments following the first have to include a reference to a labeled node that is defined in a previous segment. Thus, pattern (42) will find trees where an NP immediately dominates a CP-THT.

(42)
/^NP\b/=x : =x < /^CP-THT\b/

    With labeling of complex nodes, segments can entirely replace patterns that otherwise require embedded relationships. Pattern (43) without segments and pattern (44) with segments match identical structure. For example, they will match the tree fragment (45).

(43)
/^NP\b/ < N < (/^CP-THT\b/ < (/^IP-SUB\b/ < (AX < う|よう)))
(44)
/^NP\b/=x : (=x < N) : (=x < /^CP-THT\b/=y)
: (=y < /^IP-SUB\b/=i) : (=i < AX=c) : (=c < う|よう)
(45)
NP CP-THT IP-SUB NP-SBJ *pro* PP-OB1 NP N 言葉 P-ROLE VB 改め AX よう P-COMP という N 努力

    Segments are helpful for constraining data to be sourced from somewhere particular. For example (46) finds examples of relative clauses, but only when the data is from a file with a name (content under an ID node) that includes fiction.

(46)
/REL/=x < __ : =x .. (ID < /fiction/)



References

Levy, Roger and Galen Andrew. 2006. Tregex and Tsurgeon: tools for querying and manipulating tree data structure. In Calzolari, Nicoletta et al., Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06), European Language Resources Association (ELRA), Genoa, Italy. Available at https://aclanthology.org/L06-1311/.

Pito, Richard. 1994. tgrepdoc–documentation for tgrep. University of Pennsylvania.

Rohde, Douglas. 2005. TGrep2 User Manual version 1.15. Available at: https://github.com/andreasvc/tgrep2.