Tregex Search Guide

This guide provides an introduction to the Tregex query language. The content will generally apply to the search of all bracketed (Penn) style parsed corpora. This guide sources explanatory detail from README-tregex.txt, which is part of the Tregex distribution. Explanation is also taken from the manual that accompanies TGrep2, another search program with a closely related query language.

1 Introduction

Tregex by Roger Levy and Galen Andrew (Levy and Andrew 2006) is a tool for browsing, searching, and even manipulating syntax trees in traditional bracketed (Penn) treebank notation. The tool is available at http://nlp.stanford.edu/software/tregex.shtml.

Tregex can be invoked either with a browser interface by Anna Rafferty that works as a desktop application, or as a command line tool. This guide provides an introduction to the query language that Tregex uses when searching data.

The Tregex query language closely resembles the query languages of TGrep by Richard Pito (Pito 1994) and TGrep2 by Douglas Rohde (Rohde 2005). TGrep was the original tree-matching program that came with the Penn Treebank. TGrep2 is a reimplementation that also added extensions, available at https://github.com/andreasvc/tgrep2.

Following TGrep and TGrep2, Tregex queries are expressed as patterns that consist of expressions to match nodes and relationships defining links or negated links to other nodes. Nodes of searched trees are matched either with simple character strings, or OR'd character strings, or regular expressions (see section 2). Possible node relationships are illustrated and detailed in section 3.

A complex node expression consists of a node expression followed by Boolean expressions of relationships to other nodes, as explained in Section 4. Also, nodes can be assigned labels and may be referred to elsewhere in the pattern by those labels (see section 5). If you write a node description using a regular expression, you can assign its matching groups to variable names (see section 6). Finally, patterns can be broken into multiple segments (see section 7).

Typically, spaces are optional in patterns. But for readability, it is usually best to place spaces between relationships, node expressions, and other operators.

2 Specifying nodes in a Tregex pattern

Single tree nodes are matched in Tregex as follows:

A wildcard, specified by two underscores (__), matches any node.
A simple character string, such as IP-ADV, NP, ADV, Taro, etc., matches if it is the full content of the matched node.
A pattern can be built from simple character strings separated by pipes (|) with no spaces. Such a pattern will match a node if any of its components match the node's full content. For example, NP|PP will identify NP or PP nodes.
A regular expression, indicated by surrounding slashes (/), such as /^NP\b/, can be used as a description to match nodes which possibly differ in their full content.
If a simple character string or regular expression begins with !, the matching process will be complemented. That is, matches will turn into non-matches, and vice-versa. For example, !VB will match all nodes that are not VB, and !/^I/ will match any node that does not start with I.

2.1 Tregex regular expressions

Tregex regular expressions follow Java conventions for expressing regular expressions. For details, see e.g.: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html.

Specified as a string, a regular expression matches a node if there is a part of the node that is matched. For example, /[PCV]P/ matches PP, CP-THT, ADVP, etc.

The caret (^) anchors the regular expression to the beginning of a matched node. So, /^[PCV]P/ will not match ADVP but will match PP and CP-THT. A dollar sign ($) as the last character will anchor the regular expression to the end of a matched node. For example, /PRD$/ will not match NP-PRD2 but will match NP-PRD. Use of both the caret and dollar-sign in /^[NP]P$/ constrains the match to only NP and PP. A regular expression can also be constrained with \b to mark a word boundary. Thus, /^NP\b/ will match NP and NP-PRD but will not match NPR.

By default, regular expressions are case sensitive and so /^The$/ and /^the$/ match different nodes. Case insensitivity can be turned on by including the flag (?i) and then everything to the right will be case-insensitive, e.g., /^(?i)the$/ matches: THE The, the, tHE, etc. Having turned case insensitivity on, it is possible to turn case insensitivity off with the flag (?-i) to become case sensitive again, e.g., /^t(?i)h(?-i)e$/ matches: the, tHe. It is also possible to limit the region of case insensitivity to a given variable group (see section 6), e.g., while /^t(h)e$/ matches: the, /^t(?i:h)e$/ matches: the, tHe.

3 Relationships between nodes

Relationships define connections between nodes. There is a complete pairing of relationships with counterparts, allowing for flexibility when constructing complex relationships (see section 4). Notable relationships are given in (1):

(1)

A << B     A dominates (is an ancestor of) B
A >> B     A is dominated by (is a descendant of) B
A < B      A immediately dominates (is the parent of) B
A > B      A is immediately dominated by (is the child of) B
A .. B     A precedes B
A ,, B     A follows B
A . B      A immediately precedes B
A , B      A immediately follows B
A $ B      A is a sister of and not equal to B
A $.. B    A is a sister of and precedes B
A $,, B    A is a sister of and follows B
A $. B     A is a sister of and immediately precedes B
A $, B     A is a sister of and immediately follows B

The following presents pictures grouping the relationships of (1) as forward and backward links:

(2)

C << __ (dominates, is an ancestor of)
__ >> C (is dominated by, is a descendant of)

(3)

E >> __ (is dominated by, is a descendant of)
__ << E (dominates, is an ancestor of)

(4)

C < __ (immediately dominates, is the parent of)
__ > C (is immediately dominated by, is the child of)

(5)

I .. __ (precedes)
__ ,, I (follows)

(6)

I ,, __ (follows)
__ .. I (precedes)

(7)

I . __ (immediately precedes)
__ , I (immediately follows)

(8)

I , __ (immediately follows)
__ . I (immediately precedes)

(9)

F $ __ (sister)
__ $ F (sister)

(10)

E $.. __ (sister and precedes)
__ $,, E (sister and follows)

(11)

E $. __ (sister and immediately precedes)
__ $, E (sister and immediately follows)

3.1 The `i`-th child relationships

To specify the exact child position which a node must occupy, use the i-th child relationships defined in (12):

(12)

A <i B     B is the ith child of A (i > 0) (the first child is <1)
A >i B     A is the ith child of B (i > 0) (the first child is >1)
A <-i B    B is the ith-to-last child of A (i > 0) (the last child is <-1)
A >-i B    A is the ith-to-last child of B (i > 0) (the last child is >-1)

In these relationships, i represents a number indicating which child slot is to be considered. Positive numbers are counts from the left edge, while negative numbers are counts from the right edge. For example, (13) finds NP nodes where the second child is a PP.

(13)

NP <2 PP

Pattern (14) finds NP nodes where the last child is a PP.

(14)

NP <-1 PP

Pattern (15) finds NP nodes which have less than three children. That is, every NP which has three or more children will have something as its third child to make matching with (15) fail.

(15)

NP !<3 __

To select nodes with specific numbers of children you can use a positive and negative i-th child relationship. For example, to find NP nodes which dominate either two or three children you would use pattern (16):

(16)

NP <2 __ !<4 __

17 Other relationships

Other available relationships between nodes, additional to those given above in (1) and (12), are given in (17):

(17)

A == B     A and B are the same node
A <<, B    B is a leftmost descendant of A
A <<- B    B is a rightmost descendant of A
A >>, B    A is a leftmost descendant of B
A >>- B    A is a rightmost descendant of B
A <, B     B is the first child of A (synonymous with A <1 B)
A >, B     A is the first child of B (synonymous with A >1 B)
A <- B     B is the last child of A (synonymous with A <-1 B)
A >- B     A is the last child of B (synonymous with A >-1 B)
A <: B     B is the only child of A
A >: B     A is the only child of B
A <<: B    A dominates B via an unbroken chain (length > 0) of unary local trees
A >>: B    A is dominated by B via an unbroken chain (length > 0) of unary local trees

4 Boolean expressions for building complex patterns

Complex Tregex patterns can be constructed by specifying Boolean expressions of relationships that connect to a node. If no operator is placed between two relationships, they are assumed to be connected by an and, which can be specified for clarity with an ampersand (&). Thus, patterns (18) and (19) both match an IP node which immediately dominates a PP node and which dominates an IP node.

(18)

/^IP\b/ < /^PP\b/ << /^IP\b/

(19)

/^IP\b/ < /^PP\b/ & << /^IP\b/

The second relationship << /^IP\b/ refers to the first IP and not to the PP, and is equivalent to (20).

(20)

(/^IP\b/ < /^PP\b/) << /^IP\b/

The placement of parenthesis can be used to group a node with its relationship(s), so that (21) will match an IP which immediately dominates a PP which in turn dominates some IP.

(21)

/^IP\b/ < (/^PP\b/ << /^IP\b/)

If a pipe (|) is placed between two relationships, only one of the relationships must be satisfied. Pattern (22) will find an IP node which immediately dominates a PP node or dominates an IP node:

(22)

/^IP\b/ < /^PP\b/ | << /^IP\b/

The (typically implicit) and binds tighter than the or. So pattern (23) will find an IP node that has VB and VB2 children or that has NP-PRD and AX children.

(23)

/^IP\b/ < VB < VB2 | < NP-PRD < AX

To create more complex expressions, square brackets ([]) can be used to group relationships to form a complex relationship. Thus, pattern (24) finds an IP that has a VB or NP-PRD child, and that precedes a subject sister or follows an ADVP sister.

(24)

/^IP\b/ [ < VB | < NP-PRD ] [ $.. /SBJ/ | $,, ADVP ]

Because a square bracketed expression itself acts as a complex relationship connecting to a node, an open square bracket must immediately precede a relationship or another open square bracket that itself starts a complex relationship.

Any relationship, including a square bracketed one, can be negated by immediately preceding it with an exclamation mark (!). Thus, pattern (25) will find IP nodes that do not simultaneously have NP and PP children.

(25)

/^IP\b/ ! [ < /^NP\b/ < /^PP\b/ ]

5 Labeled nodes and back links

Like TGrep2, Tregex allows for a node to be given a unique label and then referred to elsewhere by that label. If a node name is followed by an equal sign (=) and a label, the label will be assigned to the node. If the node name only consists of an equal sign and a label, it serves as a back link that refers to the node that has been assigned to that name elsewhere in the pattern. In deviance to TGrep2, it is only possible to express patterns in which the labeled node comes prior in the connected structure of the pattern that leads to a back link.

Having back links leads to the creation of patterns with cycles in the link structure. For example, imagine we wanted to find a clause node, /^IP\b/, that dominates a clausal complement, /THT/, and a /^PP\b/, such that the /THT/ precedes the /^PP\b/. This requires stating relationships between the /^IP\b/ and the /THT/, between the /^IP\b/ and the /^PP\b/, and between the /THT/ and the /^PP\b/. With access to labels, the query can be written as (26).

(26)

/^IP\b/=foo << (/THT/ .. (/^PP\b/ >> =foo))

The /^IP/=foo matches any tree node whose name begins with IP. Furthermore, when a matching tree node is found, it is given the label foo. Later, /^PP\b/ >> =foo indicates that the PP must be dominated by that very same node, not just any IP.

As another example of how back links are useful, consider the task of finding relative clauses with object traces. An object trace might be embedded arbitrarily deeply inside its relative clause, which is captured with the dominates relation (<< (/OB1/ < /\*T\*/)). However, without further qualification, this is too liberal, as it will also allow the return of a relative clause that itself has no object trace but contains a relative clause with an object trace. We can add the extra qualification to prohibit the unwelcome scenario with a back link, resulting in (27).

(27)

/REL/=p << (/OB1/ < /\*T\*/ !>> (/REL/ >> =p))

The added back link statement (!>> (/REL/ >> =p)) requires there to be no intervening relative clause (/REL/) between the object trace and the relative clause that the pattern serves to find.

6 Variable groups

If you write a node description using a regular expression, you can assign its matching groups, established with round brackets internal to the regular expression, to variable names. The syntax is:

(28)

/regular-expression/#group-number%variable-name

If more than one node has a group assigned to the same variable name, then matching will only occur when all such groups capture the same string. For example, pattern (29) will find identical adjacent terminal nodes (reduplicated words).

(29)

/^(.+)$/#1%word !< __ . (/^(.+)$/#1%word !< __)

Matched nodes have to be terminal because of the !< __ condition that states how there can be no children, and adjacent because of .. With #1, word takes its value to be the first group in the regular expression, which is the word that needs to have occurred twice in a row.

7 Segmented patterns

Similar to TGrep2, Tregex provides a way to split patterns into multiple segments, which are separated by colons (:). Segments following the first can begin with a reference to a labeled node that is defined in a previous segment. Thus, pattern (30) will find trees where an NP immediately dominates a CP-THT.

(30)

NP=x : =x < CP-THT

With labeling of complex nodes, segments can entirely replace patterns that otherwise require embedded relationships. For example, tree fragment (31) is matched by pattern (32) without segments as well as pattern (33) with segments.

(31)

(32)

NP < (N < idea) < (CP-THT < (C < that)) > PP

(33)

NP=x : (=x < N=n) : (=n < idea) : (=x < CP-THT=y) : (=y < C=c) : (=c < that) : (=x > PP)

Segments can make patterns easier to read and debug. Segments can also assist with pattern reuse. Patterns can frequently be mostly the same while differing in only one region. For such cases, the region that changes can be placed in its own segment, so that full patterns are built when a unique segment is appended to a common base, or vice versa.

Unlike with TGrep2, segments need not be related. Thus pattern (34) finds trees with an IP-ADV and IP-SUB, without other requirements for how these nodes relate to one another.

(34)

IP-ADV : IP-SUB

Pattern (35) will match both an extraposed element and its site of interpretation, wheresoever these nodes happen to be in the tree, using variable groups (see section 6) assigned to the same variable to ensure there is a shared index when nodes are matched.

(35)

/^.+-(.*)$/#1%index : /^\*ICH\*-(.*)$/#1%index

Note that the colon has a low precedence, so that (36) matches trees with an IP-SUB that does not immediately dominate a VBP and an IP-ADV that does immediately dominate a VBP.

(36)

IP-ADV : (IP-SUB !< VBP) < VBP

References

Levy, Roger and Galen Andrew. 2006. Tregex and Tsurgeon: tools for querying and manipulating tree data structure. In 5th International conference on Language Resources and Evaluation.

Pito, Richard. 1994. tgrepdoc–documentation for tgrep. University of Pennsylvania.

Rohde, Douglas. 2005. TGrep2 User Manual version 1.15.