An Unofficial Tsurgeon Guide

Alastair Butler

Hirosaki University




Abstract

The Tsurgeon program (Levy and Andrew 2006) is a software tool for manipulating bracketed (Penn) style trees. This guide provides an introduction to the tool with examples of basic and advanced use. This guide sources explanatory detail from README-tsurgeon.txt, which is part of the Tregex distribution that includes Tsurgeon. In discussing Tregex patterns, explanatory detail is also sourced from the TGrep2 manual that accompanies the TGrep2 (Rohde 2005) search tool.




1   Introduction

The Tsurgeon program (Levy and Andrew 2006) is a software tool for manipulating bracketed (Penn) style trees. The tool is available from http://nlp.stanford.edu/software/tregex.shtml. It works by taking trees specified with round bracketed notation as input, and with each input tree it tries to match one or more Tregex patterns to the tree, and for each successful pattern match it continues to apply surgical actions to the tree until the activating match no longer holds true. After all pairings of a Tregex pattern with Tsurgeon actions have been applied to each input tree, resulting trees are printed.

    This guide is structured as follows. Section 2 details the required format for Tsurgeon scripts that consist of one or perhaps many pairings of a Tregex pattern with Tsurgeon action(s). Section 3 sketches the workings of a full example. Section 4 details Tregex patterns. Section 5 demonstrates in turn each of the available Tsurgeon actions. Section 6 adds the possibility of using macros with the macro processor m4. Section 7 presents an extended example. Section 8 notes a way to run Tsurgeon scripts from the command prompt.


2   Tsurgeon script format

Tsurgeon scripts are a combination of a Tregex pattern to match input trees and a series of Tsurgeon actions to perform on matched trees. The Tsurgeon program is unforgiving about the required line formatting of a script, because this is how Tregex patterns are distinguished from Tsurgeon actions. In particular:

    The following is an example of a correctly formatted script:

Example 1


  !/^(WORD$|\*)/=x !< __ !> /^PU/

  relabel x WORD


  WORD=x

  delete x


3   Pattern-action based programming

When Tsurgeon runs your script, it will read in each tree of the input, one after the other. Each time it reads in a tree, it will see if your Tregex pattern finds a match. If it does, then it will keep performing the Tsurgeon action(s) for the active Tregex pattern, stopping only when the Tregex pattern no longer matches. You can define as many pairings of a Tregex pattern with Tsurgeon action(s) as you like. If more than one pairing occurs, each pairing is performed in the order they appear in your script. So Tsurgeon is, in effect, three nested loops: from the input go over the trees; for each tree go over the patterns and, if a Tregex pattern matches, keep going over the Tsurgeon action(s), until there is no longer a match for the Tregex pattern.

3.1   Program behaviour

To see program behaviour, consider Example 1 of section 2. The first line, repeated as (1), specifies with a Tregex pattern the initial tree structure looked for.

(1)
!/^(WORD$|\*)/=x !< __ !> /^PU/

See section 4 for details about Tregex patterns. Here it is enough to know that (1) finds trees when there is a node such that it:

1.
is not WORD,
2.
does not start with ‘*’,
3.
does not dominate any other node, and
4.
is not dominated by a node that starts with PU.

For example, tree (2) is matched:

(2)
(IP-IMP (NP-SBJ *speaker+hearer*)
        (VB 考え)
        (MD よう)
        (PU 。))

    Nodes of a found tree are named when ‘=’ is appended to the expression for finding a node label. That name can be referred to by subsequent Tsurgeon actions triggered by the same match. With the first line pattern, found nodes can be referred to with ‘x’.

    The Tsurgeon action for the first Tregex pattern, repeated as (3), is given on the third line of the script, which follows an empty line.

(3)
relabel x WORD

This tree transforming action brings about a relabelling to WORD of the node referred to with ‘x’ following the match with pattern (1). The combination of this initial Tregex pattern and Tsurgeon action is to transform the example matched tree (2) into (4).

(4)
(IP-IMP (NP-SBJ *speaker+hearer*)
        (VB WORD)
        (MD WORD)
        (PU 。))

Note how (4) is reached by action (3) being applied twice, after which pattern (1) is no longer able to find a match.

    Following action (3), there are then two empty lines in Example 1, making line 6 that follows a new pattern to find trees that contain a node name that matches the simple string pattern WORD which can thereafter be referred to with ‘x’. There is then another empty line, making line 8 that follows a Tsurgeon action to remove the node referred to with ‘x’, resulting in (5) as the overall output from running the program of Example 1 on the input tree (2).

(5)
(IP-IMP (NP-SBJ *speaker+hearer*)
        VB
        MD
        (PU 。))

3.2   Avoiding infinite loops

By first changing all words to WORD and then removing all WORD instances, Example 1 demonstrates how the match of a Tregex pattern with Tsurgeon action(s) will apply repeatedly until no further match is possible. Attractive consequences of this runtime behaviour are that trees may match Tregex patterns multiple times, and earlier Tsurgeon actions may enable subsequent matches of the active Tregex pattern for further Tsurgeon actions. But with such power there has to come responsibility, with extra special care needing to be made to avoid accidentally starting an infinite loop. For example, consider missing the initial ‘!’ (negation) sign from the first pattern of Example 1, to give:

Example 2


  /^(WORD$|\*)/=x !< __ !> /^PU/

  relabel x WORD

The changed pattern will find terminal nodes that are not dominated by nodes beginning with PU, and that are either WORD or character strings that begin with ‘*’. This pattern will match the same tree (2) as seen with the discussion of Example 1, while now it is the occurrence of *speaker+hearer* that is changed to WORD, resulting in (6).

(6)
(IP-IMP (NP-SBJ WORD)
        (VB 考え)
        (MD よう)
        (PU 。))

The transformed tree of (6) continues to match the triggering pattern because of the presence of WORD resulting in the replacement of WORD for WORD, which continues to match, and so the loop of changing WORD into WORD continues with no termination prospect.


4   Tregex patterns

Patterns are expressed with the Tregex pattern language. This closely resembles the query language of TGrep (Pito 1994), and includes extensions from TGrep2 (Rohde 2005). TGrep was the original tree-matching program that came with the Penn Treebank. TGrep2 is a reimplementation with extensions, available at http://tedlab.mit.edu/~dr/TGrep2/. Notably:

4.1   Specifying nodes in a Tregex pattern

Single tree nodes are matched as follows:

4.1.1   Tregex regular expressions

Tregex regular expressions follow Java conventions for expressing regular expressions. For details, see e.g.: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html.

    Specified as a string, a regular expression matches a node if there is a part of the node that is matched. For example, /[PCV]P/ matches PP, CP-THT, ADVP, etc.

    The caret (‘^’) anchors the regular expression to the beginning of a matched node. So, /^[PCV]P/ will not match ADVP but will match PP and CP-THT. A dollar sign (‘$’) as the last character will anchor the regular expression to the end of a matched node. For example, /PRD$/ will not match NP-PRD2 but will match NP-PRD. Use of both the caret and dollar-sign in /^[NP]P$/ constrains the match to only NP and PP. A regular expression can also be constrained with ‘\b’ to mark a word boundary. Thus, /^NP\b/ will match NP and NP-PRD but will not match NPR.

    By default, regular expressions are case sensitive and so /^The$/ and /^the$/ match different nodes. Case insensitivity can be turned on by including the flag (?i) and then everything to the right will be case-insensitive, e.g., /^(?i)the$/ matches: THE, The, the, tHE, etc. Having turned case insensitivity on, it is possible to turn case insensitivity off with the flag (?-i) to become case sensitive again, e.g., /^t(?i)h(?-i)e$/ matches: the, tHe. It is also possible to limit the region of case insensitivity to a given variable group (see section 4.5), e.g., while /^t(h)e$/ matches: the, /^t(?i:h)e$/ matches: the, tHe.

4.2   Relationships between nodes

Relationships define connections between nodes. There is a complete pairing of relationships with counterparts, allowing for flexibility when constructing complex relationships (see section 4.3). Notable relationships can be pictured as in Figure 1.

Figure 1: Notable relationships between nodes

Notable relationships between nodes

The relationships pictured in Figure 1 are described in (7).

(7)
  A >> B     A is dominated by (is a descendant of) B
  A ,, B     A follows B
  A , B      A immediately follows B
  A > B      A is immediately dominated by (is the child of) B
  A . B      A immediately precedes B
  A .. B     A precedes B
  A $,, B    A is a sister of and follows B
  A $, B     A is a sister of and immediately follows B
  A == B     A and B are the same node
  A $ B      A is a sister of and not equal to B
  A $. B     A is a sister of and immediately precedes B
  A $.. B    A is a sister of and precedes B
  A <i B     B is the ith child of A (i > 0); also expressed as B >i A
  A <-i B    B is the ith-to-last child of A (i > 0); also expressed as A >-i B
  A < B      A immediately dominates (is the parent of) B
  A << B     A dominates (is an ancestor of) B

4.2.1   The i-th child relationships

To specify the exact child position which a node must (or must not) occupy, use the i-th child relationships (<i, <-i, !<i, and !<-i; also expressed with >i, >-i, !>i, and !>-i). In these relationships, i represents a number indicating which child ‘slot’ is to be considered. Positive numbers are counts from the left edge, while negative numbers are counts from the right edge. For example, (8) finds instances of NP where the second child is a PP.

(8)
NP <2 PP

Pattern (9) finds instances of NP where the last child is a PP.

(9)
NP <-1 PP

Pattern (10) finds instances of NP which have less than three children. That is, every NP which has three or more children will have something as its third child to make matching with (10) fail.

(10)
NP !<3 __

    To select nodes with specific numbers of children you can use a positive and negative i-th child relationship. For example, to find NP nodes which dominate either two or three children you would use pattern (11):

(11)
NP <2 __ !<4 __

12   Other relationships

Other available relationships between nodes, additional to those given above in (7), are given in (12):

(12)
  A <<, B    B is a leftmost descendant of A
  A <<- B    B is a rightmost descendant of A
  A >>, B    A is a leftmost descendant of B
  A >>- B    A is a rightmost descendant of B
  A <, B     B is the first child of A (synonymous with A <1 B)
  A >, B     A is the first child of B (synonymous with A >1 B)
  A <- B     B is the last child of A (synonymous with A <-1 B)
  A >- B     A is the last child of B (synonymous with A >-1 B)
  A <: B     B is the only child of A
  A >: B     A is the only child of B
  A <<: B    A dominates B via an unbroken chain (length > 0) of unary local trees
  A >>: B    A is dominated by B via an unbroken chain (length > 0) of unary local trees

4.3   Boolean expressions for building complex patterns

Complex Tregex patterns can be constructed by specifying Boolean expressions of relationships that connect to a node. If no operator is placed between two relationships, they are assumed to be connected by an and, which can be specified for clarity with an ampersand (‘&’). Thus, patterns (13) and (14) both match an IP node which immediately dominates a PP node and which dominates an IP node.

(13)
/^IP\b/ < /^PP\b/ << /^IP\b/
(14)
/^IP\b/ < /^PP\b/ & << /^IP\b/

The second relationship ‘<< /^IP\b/’ refers to the first IP and not to the PP, and is equivalent to (15).

(15)
(/^IP\b/ < /^PP\b/) << /^IP\b/

The placement of parenthesis can be used to group a node with its relationship(s), so that (16) will match an IP which immediately dominates a PP which in turn dominates some IP.

(16)
/^IP\b/ < (/^PP\b/ << /^IP\b/)

    If a pipe (‘|’) is placed between two relationships, only one of the relationships must be satisfied. Pattern (17) will find an IP node which immediately dominates a PP node or dominates an IP node:

(17)
/^IP\b/ < /^PP\b/ | << /^IP\b/

The (possibly implicit) and binds tighter than the or. So pattern (18) will find an IP node that has VB and VB2 children or that has NP-PRD and AX children.

(18)
/^IP\b/ < VB < VB2 | < NP-PRD < AX

    To create more complex expressions, square brackets (‘[]’) can be used to group relationships to form a complex relationship. Thus, pattern (19) finds an IP that has a VB or NP-PRD child, and that precedes a subject sister or follows an ADVP sister.

(19)
/^IP\b/ [ < VB | < NP-PRD ] [ $.. /SBJ/ | $,, ADVP ]

Because a square bracketed expression itself acts as a complex relationship connecting to a node, an open square bracket must immediately precede a relationship or another open square bracket that itself starts a complex relationship.

    Any relationship, including a square bracketed one, can be negated by immediately preceding it with an exclamation mark (‘!’). Thus, pattern (20) will find IP nodes that do not simultaneously have NP and PP children.

(20)
/^IP\b/ ! [ < /^NP\b/ < /^PP\b/ ]

4.4   Labeled nodes and back links

As seen with the description in section 3 of Example 1, Tregex patterns allow for a matched node to be given a unique label with an equal sign (‘=’) and then referred to by that label when Tsurgeon actions are triggered. It is also possible to use labels to refer to other nodes from within the Tregex pattern. Such a ‘back link’ is accomplished when the node name only consists of an equal sign and a label, with the label serving to refer to the node that has been assigned to that name elsewhere in the pattern. It is only possible to express patterns in which the labeled node comes prior in the connected structure of the pattern that leads to a back link.

    Having back links leads to the creation of patterns with cycles in the link structure. For example, imagine we wanted to find a clause node, /^IP\b/, that dominates a clausal complement, /THT/, and a /^PP\b/, such that the /THT/ precedes the /^PP\b/. This requires stating relationships between the /^IP\b/ and the /THT/, between the /^IP\b/ and the /^PP\b/, and between the /THT/ and the /^PP\b/. With access to labels, the query can be written as (21).

(21)
/^IP\b/=foo << (/THT/ .. (/^PP\b/ >> =foo))

The /^IP/=foo matches any tree node whose name begins with IP. Furthermore, when a matching tree node is found, it is given the label foo. Later, ‘/^PP\b/ >> =foo’ indicates that the PP must be dominated by that very same node, not just any IP.

    As another example of how back links are useful, consider the task of finding relative clauses with object traces. An object trace might be embedded arbitrarily deeply inside its relative clause, which is captured with the dominates relation (‘<< (/OB1/ < /\*T\*/)’). However, without further qualification, this is too liberal, as it will also allow the return of a relative clause that itself has no object trace but contains a relative clause with an object trace. We can add the extra qualification to prohibit the unwelcome scenario with a back link, resulting in (22).

(22)
/REL/=p << (/OB1/ < /\*T\*/ !>> (/REL/ >> =p))

The added back link statement (‘!>> (/REL/ >> =p)’) requires there to be no intervening relative clause (/REL/) between the object trace and the relative clause that the pattern serves to find.

4.5   Variable groups

If you write a node description using a regular expression, you can assign its matching groups, established with round brackets internal to the regular expression, to variable names. The syntax is:

(23)
/regular-expression/#group-number\%variable-name

    If more than one node has a group assigned to the same variable name, then matching will only occur when all such groups capture the same string. For example, pattern (24) will find identical adjacent terminal nodes (reduplicated words).

(24)
/^(.+)$/#1\%word !< __ . (/^(.+)$/#1\%word !< __)

Nodes matched with (24) have to be terminal because of the ‘!< __’ conditions that state how there can be no children, and adjacent because of the dot (‘.’). With ‘#1’, word takes its value to be the first group in the regular expression, which is the word that needs to have occurred twice in a row.

    A major use of naming variable groups is to carry over named content into the relabel Tsurgeon action that is used to relabel node content (see section 5.4).

4.6   Segmented patterns

Tregex patterns can be split into multiple segments, which are separated by colons (‘:’). Segments following the first can begin with a reference to a labeled node that is defined in a previous segment. Thus, pattern (25) will find trees where an NP immediately dominates a CP-THT.

(25)
NP=x : =x < CP-THT

    With labeling of complex nodes, segments can entirely replace patterns that otherwise require embedded relationships. For example, tree fragment (26) is matched by pattern (27) without segments as well as pattern (28) with segments.

(26)
PP NP N idea CP-THT C that
(27)
NP < (N < idea) < (CP-THT < (C < that)) > PP
(28)
NP=x : (=x < N=n) : (=n < idea) : (=x < CP-THT=y) : (=y < C=c) : (=c < that) : (=x > PP)

    Segments can make patterns easier to read and debug. Segments can also assist with pattern reuse, especially when combined with macros (see section 6). Patterns can frequently be mostly the same while differing in only one region. For such cases, the region that changes can be placed in its own segment, so that full patterns are built when a unique segment is appended to a common base, or vice versa.

    Segments need not be related. Thus pattern (29) finds trees with an IP-ADV and IP-SUB, without other requirements for how these nodes relate to one another.

(29)
IP-ADV : IP-SUB

Note that the colon has a low precedence, so that (30) matches trees with an IP-SUB that does not immediately dominate a VBP and an IP-ADV that does immediately dominate a VBP.

(30)
IP-ADV : (IP-SUB !< VBP) < VBP

5   Tsurgeon actions

With the previous section describing Tregex patterns, the current section turns to considering the range of Tsurgeon actions that Tregex patterns can invoke. Legal Tsurgeon actions are as follows:

  delete
  prune
  excise
  relabel
  move
  replace
  insert
  createSubtree
  adjoin
  adjoinH
  adjoinF
  coindex

We will look at these Tsurgeon actions in turn, with a description and example(s) for each.

5.1   delete

  delete <name_1> <name_2> ... <name_m>

for each <name_i>, deletes the node it names and everything below it.

Example 3

The following script consists of a pattern to find nodes whose label starts with ‘*’ and that do not dominate any other node. This pattern is paired with an action that deletes any matched node.


  /^\*/=x !< __

  delete x

Input:

  (IP-IMP (NP-SBJ *speaker+hearer*)
          (VB 考え)
          (MD よう)
          (PU 。))

Output:

  (IP-IMP NP-SBJ
          (VB 考え)
          (MD よう)
          (PU 。))

Note that only the matched node, *speaker+hearer*, is removed, to leave NP-SBJ with no child.

5.2   prune

  prune <name_1> <name_2> ... <name_m>

for each <name_i>, prunes out the named node. Pruning differs from deletion in that if pruning a node causes its parent to have no children, then the parent is in turn pruned too.

Example 4

The following script has the same pattern as seen with Example 3, but differs in terms of the action being prune.


  /^\*/=x !< __

  prune x

Input:

  (IP-IMP (NP-SBJ *speaker+hearer*)
          (VB 考え)
          (MD よう)
          (PU 。))

Output:

  (IP-IMP (VB 考え)
          (MD よう)
          (PU 。))

The *speaker+hearer* node is removed, and since this results in the NP-SBJ above it having no terminal children, the NP-SBJ node is deleted as well.

5.3   excise

  excise <name1> <name2>

The <name1> node should either dominate or be the same as the <name2> node. This excises out everything from <name1> to <name2>. All the children of <name2> go into the parent of <name1>, where <name1> was.

Example 5

The following script begins with the wild card pattern to identify a node that will dominate another node that is picked out with the x name and that will itself dominate at least one node that is dominating another node. The pattern is paired with an excise action that will remove nodes that can be named x from the tree.


  __ < (__=x < (__ < __))

  excise x x

Input:

  (FRAG (NP (PP (NP (PP (NP (N 日本語))
                        (P の))
                    (N 勉強))
                (P の))
            (N 時間)))

Output:

  (FRAG (N 日本語)
        (P の)
        (N 勉強)
        (P の)
        (N 時間))

This example has illustrated returning a flattened version of the input tree, with all non-root non-preterminal and non-terminal level nodes removed.

Example 6

With Example 5, an action to excise a single node was repeatedly applied. The following example illustrates excision of multiple nodes, with all inbetween material also being removed.


  PP=x < NP=y

  excise x y

Input:

  (FRAG (NP (PP (NP (PP (NP (N 日本語))
                        (P の))
                    (N 勉強))
                (P の))
            (N 時間)))

Output:

  (FRAG (NP (N 日本語)
            (N 勉強)
            (N 時間)))

From the input, there is repeated elimination of both a PP node and its dominated NP node, as well as all inbetween material, namely, the instances of:

  (PP NP
      (P の))

5.4   relabel

  relabel <name> <new-label>

relabels the node <name> to have a new label <new-label>. There are three possible forms for a relabelling instruction to take. The first form, illustrated with (31), is used for changing a node label (namely, nodeX) to an alphanumeric string (namely, NP).

(31)
relabel nodeX NP

The second form, illustrated with (32), is used for relabeling a node to something that isn't a valid identifier without quoting.

(32)
relabel nodeX /''/

The third form, illustrated with (33), is used for regular expression based relabeling.

(33)
relabel nodeX /^VB(.*)$/verb\/$1/

The third form lets you make a new label that is an arbitrary String function of the original label and additional characters that you supply. In this last case, all matches of the regular expression against the node label are replaced with the replacement String. This has the semantics of Java/Perl's replaceAll: you may use capturing groups and put them in replacements with $n. For example, if the pattern is /foo/bar/ and the node matched is foofoo, the replaceAll semantics results in barbar. If the pattern is /^foo(.*)$/bar$1/ and the node matched is foofoo, relabel will result in barfoo.

    When using the regex replacement method, the form of the regex is limited to literal character strings, the wild card character (‘.’), the Kleene star (‘*’), the string initial marker (‘^’), and the string final marker (‘$’). In order to utilise a more complex regex for use as replacement content, you can use the sequence \%{var} in the replacement string to supply the content of a matched variable group (see section 4.5). You can also use the sequence ={node} in the replacement string to use captured nodes in the replacement string (see Example 8 below). To get a ‘%’ or an ‘=’ or a slash in the replacement, escaping with ‘\’ is necessary, so: ‘\%’, ‘\=’, ‘\/’, and ‘\’.

Example 7

The following pattern will look for a non-terminal node that contains a non-empty substring that starts with ‘;{’ and ends with ‘}’ without containing ‘}’ and that should be preceded by a non-empty substring whose content is captured by the variable group name main and that is followed by a possibly empty substring whose content is captured by the variable group name rest. There follows an action that changes the label of the found node so that only the content of main and rest remain. Note that ‘^.*$’ of the relabel action is necessary to make sure the regex pattern only matches and replaces once on the entire node label.


  /^(.+);\{[^}]+\}(.*)$/#1\%main#2\%rest=x < __

  relabel x /^.*$/\%{main}\%{rest}/

Input:

  (IP-MAT (NP *ICH*-1)
          (NP-SBJ (Q Everyone))
          (VBD;~I ate)
          (PP-LOC (P-ROLE at)
                  (NP-LOC;{BETTYS}-1 (D a)
                                     (N cafe)))
          (PU .))

Output:

  (IP-MAT (NP *ICH*-1)
          (NP-SBJ (Q Everyone))
          (VBD;~I ate)
          (PP-LOC (P-ROLE at)
                  (NP-LOC-1 (D a)
                            (N cafe)))
          (PU .))

This example illustrates the content from two variable groups (main and rest) being carried over from the matched node to serve as the replacement content.

Example 8

The following pattern will look for a terminal node immediately dominated by an ID node and add the content of this node to every non-terminal node with content that is not ID and that does not already contain the to-be-added content.


  ID < (__=label !< __) : (/^./=x < __ !== ID !== /;(.+)/#1\%label)

  relabel x /^.*$/={x};={label}/

Input:

  ( (IP-MAT (NP-SBJ (PRO He))
            (VBP laughs)
            (PU .))
    (ID 8_example))

Output:

  ( (IP-MAT;8_example (NP-SBJ;8_example (PRO;8_example He))
                      (VBP;8_example laughs)
                      (PU;8_example .))
    (ID 8_example))

5.5   move

  move <name> <position>

moves the named node into the specified position. To be precise, it deletes (*NOT* prunes) the node from the tree, and re-inserts it into the specified position. See section 5.7 for how to specify position.

Example 9

This example demonstrates the promotion of punctuation (PU) to a higher constituent layer when on the left or right edge of a constituent layer (identified as ‘y’) that is dominated by at least one more node, that is, when ‘y’ is not the root node of the tree. The first pattern-action pair covers the left edge scenario, while the second pattern-action pair covers the right edge scenario.


  PU=x >1 (__=y > __)

  move x $+ y


  PU=x >-1 (__=y > __)

  move x $- y

Input:

  (IP-MAT (NP-SBJ (NPR Ben)
                  (NPR Affleck))
          (BEP is)
          (NP-PRD (D an)
                  (N actor))
          (ADVP-MOD (PU ,)
                    (ADV too)
                    (PU .)))

Output:

  (IP-MAT (NP-SBJ (NPR Ben)
                  (NPR Affleck))
          (BEP is)
          (NP-PRD (D an)
                  (N actor))
          (PU ,)
          (ADVP-MOD (ADV too))
          (PU .))

Example 10

The pattern contains a disjunction with two parts. The first part of the disjunction captures quantifier movement that occurs within the same (clause) layer. Here, the pattern involves a relation of sisterhood and precedence (‘$..’) between a target position (‘x’) that starts out as an immediate right sister of ‘a’ and the constituent to move (‘b’) that share its index with ‘a’. The second part of the disjunction captures quantifier raising that involves a move to a higher layer of structure, with the pattern involving a relation of domination (‘<<’) from a sister of the target position (‘x’). The action involves relabelling the ‘a’ node with the main content from the ‘b’ node (that is, the ‘b’ node content minus the index and any material that either comes after the index or before the index but after a semi-colon (;)), relabelling the ‘b’ node as NP together with the content of the ‘b’ node that includes the index and any extra content that either comes after the index or comes before the index but after a semi-colon, followed by movement of ‘a’ to be an immediate left sister of ‘b’, followed by movement of ‘b’ to be an immediate left sister of ‘x’.


  __=x $, (NP=a < /^\*ICH\*-(\p{Digit}+)$/#1\%index)
    [ $.. /^(NP\b[^;]*)(.*)-(\p{Digit}+)(.*)$/#1\%main#2\%extra#3\%index#4\%rest=b |
      $ (__ << /^(NP\b[^;]*)(.*)-(\p{Digit}+)(.*)$/#1\%main#2\%extra#3\%index#4\%rest=b)
    ]

  relabel a /^.*$/\%{main}/
  relabel b /^.*$/NP\%{extra}-\%{index}\%{rest}/
  move a $+ b
  move b $+ x

Input:

  (IP-MAT (NP *ICH*-1)
          (NP-SBJ (Q Everyone))
          (VBD;~I ate)
          (PP-LOC (P-ROLE at)
                  (NP-LOC;{BETTYS}-1 (D a)
                                     (N cafe)))
          (PU .))

Output:

  (IP-MAT (NP;{BETTYS}-1 (D a)
                         (N cafe))
          (NP-SBJ (Q Everyone))
          (VBD;~I ate)
          (PP-LOC (P-ROLE at)
                  (NP-LOC *ICH*-1))
          (PU .))

5.6   replace

  replace <name1> <name2>

deletes <name1> and inserts a copy of <name2> in its place.

  replace <name> <tree> <tree2>...

deletes <name> and inserts the new tree(s) in its place. If more than one replacement tree is given, each of the new subtrees will be added in order where the old tree was. Multiple subtrees at the root is an illegal Tsurgeon action and will throw an exception.

Example 11

In the following example, replacement is made specific to matching (VB-VOL 考えよう).


  VB-VOL=x < 考えよう

  replace x (VB 考え) (MD よう)

Input:

  (IP-IMP (VB-VOL 考えよう)
          (PU 。))

Output:

  (IP-IMP (VB 考え)
          (MD よう)
          (PU 。))

5.7   insert

  insert <name> <position>
  insert <tree> <position>

inserts the named node or tree into the position specified. The ways to specify position are:

  $+ <name>   the left sister of the named node
  $- <name>   the right sister of the named node
  >i <name>   the i_th daughter of the named node.
  >-i <name>  the i_th daughter, counting from the right, of the named node.

Example 12

The transformation seen with Example 11 can be achieved more generally with:


  VB-VOL=x < /^(.+)よう$/#1\%stem

  insert (VB =y) $+ x
  relabel y /^.*$/\%{stem}/
  replace x (MD よう)

Note that the insert and relabel combination occurs before replace. This ordering is important because following replace the content of the stem variable is lost.

5.8   createSubtree

  createSubtree <auxiliary-tree-or-label> <name1> [<name2>]

creates a subtree out of all the nodes from name1 through name2. The subtree is moved to the foot of the given auxiliary tree, and the tree is inserted where the nodes of the subtree used to reside. If a simple label is provided as the first argument, the subtree is given a single parent with a name corresponding to the label. To limit the Tsurgeon action to just one node, elide name2.

    Auxiliary trees must have exactly one frontier node ending in the character ‘@’, which marks it as the ‘foot’ node for adjunction. Final instances of the character ‘@’ in terminal node labels will be removed from the actual label of the tree.

    All other instances of ‘@’ in terminal nodes must be escaped (i.e., appear as \@); this escaping will be removed by Tsurgeon.

Example 13

The first pattern-action pair creates a right binarization of the input tree with the creation of ‘artificial’ non-terminal nodes (ILYR) that span between the second node (‘<2 __=x’) and the end node (‘<-1 __=y’) of a constituent with at least three nodes (‘<3 __’) and that contains a main verb that has a preceding sister (‘< (/^V/ $, __)’).

    The second pattern-action pair creates a left binarization of the input tree with the creation of ILYR nodes that span between the first node (‘<1 __=x’) and the second to last node (‘<-2 __=y’) of a constituent with at least three nodes (‘<3 __’).


  __ <2 __=x <-1 __=y <3 __ < (/^V/ $, __)

  createSubtree ILYR x y


  __ <1 __=x <-2 __=y <3 __

  createSubtree ILYR x y

Input:

  ( (IP-MAT (ADVP-TMP (ADV Yesterday))
            (PU ,)
            (NP-SBJ (N Mr.)
                    (NPR Ito))
            (VBD ate)
            (NP-OB1 (N pizza))
            (PP-COM (P-ROLE with)
                    (NP (N Ms.)
                        (NPR Yamada)))
            (PP-LOC (P-ROLE in)
                    (NP (NPR Shibuya)))
            (PU .))
    (ID 8_texts_purple_basic))

Output:

  ( (IP-MAT (ADVP-TMP (ADV Yesterday))
            (ILYR (PU ,)
                  (ILYR (NP-SBJ (N Mr.)
                                (NPR Ito))
                        (ILYR (ILYR (ILYR (ILYR (VBD ate)
                                                (NP-OB1 (N pizza)))
                                          (PP-COM (P-ROLE with)
                                                  (NP (N Ms.)
                                                      (NPR Yamada))))
                                    (PP-LOC (P-ROLE in)
                                            (NP (NPR Shibuya))))
                              (PU .)))))
    (ID 8_texts_purple_basic))

Notably, the output gives the main verb as the most embedded element surrounded by the left binarization of following elements, which in turn is followed by the right binarization of preceding elements.

5.9   adjoin

  adjoin <auxiliary-tree> <target-node>

adjoins the specified <auxiliary-tree> into the specified <target-node>. The daughters of the target node will become the daughters of the foot of the auxiliary tree.

Example 14

The following example has a pattern to find nodes with labels that start with NLYR and do not dominate nodes that start with *ICH* or CONJP. This pattern is paired with an action that replaces the matched node with an NP node that furthermore has a PP projection that contains an initial (P-ROLE #) element.


  /^NLYR\b/=x !< /^(\*ICH\*|CONJP\b)/

  adjoin (PP (P-ROLE #) NP@) x

Input:

  (NP (NLYR (NLYR (ADJP (ADJ liquid))
                (N phase))
           (ADJP (ADJ thermal))
           (N reaction))
      (NS studies))

Output:

  (NP (PP (P-ROLE #)
          (NP (PP (P-ROLE #)
                  (NP (ADJP (ADJ liquid))
                      (N phase)))
              (ADJP (ADJ thermal))
              (N reaction)))
      (NS studies))

5.10   adjoinH

  adjoinH <auxiliary-tree> <target-node>

is similar to adjoin, but preserves the <target-node> and makes it the root of <auxiliary-tree>. The root of the <auxiliary-tree> is ignored.

Example 15

This illustrates the addition of an NLYR layer into a noun phrase that dominates a CONJP node.


  /^NP\b/=x < (CONJP !> /^NLYR/)

  adjoinH ( NLYR@) x

Input:

  (NP (NP (NP Pat))
      (CONJP (CONJ and)
             (NP (NPR Tony))))

Output:

  (NP (NLYR (NP (NP Pat))
            (CONJP (CONJ and)
                   (NP (NPR Tony)))))

5.11   adjoinF

  adjoinF <auxiliary-tree> <target-node>

is similar to adjoin, but preserves the <target-node> and makes it the foot of <auxiliary-tree>. The foot of the <auxiliary-tree> is ignored.

Example 16

The first pattern-action pair adds a ZERO node above terminal nodes that start with ‘*’ and are dominated by an NP node. The second pattern-action pair adds a WORD node above terminal nodes that are not dominated by WORD (thus, ensuring termination), ZERO, or a node that begins with PU.


  /^\*/=x !< __ > /^NP\b/

  adjoinF (ZERO @) x


  __=x !< __ !> /^(WORD$|ZERO$|PU)/

  adjoinF (WORD @) x

Input:

  (IP-IMP (NP-SBJ *speaker+hearer*)
          (VB 考え)
          (MD よう)
          (PU 。))

Output:

  (IP-IMP (NP-SBJ (ZERO *speaker+hearer*))
          (VB (WORD 考え))
          (MD (WORD よう))
          (PU 。))

5.12   coindex

  coindex <name_1> <name_2> ... <name_m>

puts a (Penn Treebank style) coindexation suffix of the form ‘-n’ on each of nodes <name_1> through <name_m>. The value of n will be generated in reference to the existing coindexations in the tree, thus avoiding an accidental clash of indices across things that are not meant to be coindexed.

Example 17

The following pattern contains two pattern segments (see section 4.6). The first pattern segment matches a node (identified as ‘x’) that: (i) contains a non-empty substring that starts with ‘;{’ and ends with ‘}’ and that is referenced with the variable group name index, (ii) doesn't dominate a PRO node, (iii) is contained within a node (identified as ‘y’) that has CND as a substring (a constituent layer that will be the antecedent of a conditional), and (iv) has a leftmost terminal descendant (identified as ‘a’) that doesn't end within an index string. The second pattern segment matches a node that: (i) contains a non-empty substring that starts with ‘;{’ and ends with ‘}’ and is the same string referenced with the variable group name index of the first pattern segment, (ii) dominates a PRO node, (iii) precedes the node referenced as ‘x’ by the first pattern segment, (iv) is not contained within the node referenced as ‘y’ by the first pattern segment, and (v) has a leftmost terminal descendant (identified as ‘b’) that doesn't end within an index string. When the overall pattern is satisfied an action is triggered to coindex the nodes identified with ‘a’ and ‘b’.


  /;(\{[^}]+\})/#1\%index=x !< /^PRO\b/ >> /\bCND\b/=y <<, (!/-\p{Digit}+$/=a !< __) :
  (/;(\{[^}]+\})/#1\%index < /^PRO\b/ ,, =x !>> =y <<, (!/-\p{Digit}+$/=b !< __))

  coindex a b

Input:

  (IP-MAT (PP-SCON-CNT-CND (P-CONN If)
                           (IP-ADV (NP-SBJ (D a)
                                           (N farmer))
                                   (VBP owns)
                                   (NP-OB1;{DONKEY} (D a)
                                                    (N donkey))))
          (NP-SBJ (PRO he))
          (VBP beats)
          (NP-OB1;{DONKEY} (PRO it))
          (PU .))

Output:

  (IP-MAT (PP-SCON-CNT-CND (P-CONN If)
                           (IP-ADV (NP-SBJ (D a)
                                           (N farmer))
                                   (VBP owns)
                                   (NP-OB1;{DONKEY} (D a-1)
                                                    (N donkey))))
          (NP-SBJ (PRO he))
          (VBP beats)
          (NP-OB1;{DONKEY} (PRO it-1))
          (PU .))

While a yield of the input tree gives (34):

(34)
If a farmer owns a donkey he beats it .

a yield of the output tree gives (35):

(35)
If a farmer owns a-1 donkey he beats it-1 .

That is, the change serves to retain the anaphoric link information following a yield.


6   Adding macros with m4

m4 is a general-purpose macro processor that can be used to preprocess Tsurgeon scripts. The basic operation of m4 is to read input and determine if the input contains any macro names. If a macro name is found, the name is replaced by its defining text, with and the resulting changed input is rescanned. As a consequence, text replacements can themselves contain macro calls. Macros can be called with arguments. The arguments are collected and substituted into the right places in the defining text before the defining text is rescanned.

    Note that if m4 is used as a preprocessor, instances of ‘#’ used to mark variable groups in patterns (see section 4.5) should be placed between a backquote character (`) and a single quote character ('). For example, (36),

(36)
/^([^*]+)$/#1\%word

should become (37):

(37)
/^([^*]+)$/`#'1\%word

Example 18

This example contains actions that will move PP constituents (‘x’) with IP-CTL that have no potential controllers of the containing clause in preceding positions to a position that makes the subject binding of the clause (‘y’) its immediately preceding sister. This is accomplished with reference to two macros. The first macro defines _pp which can receive either no arguments or a single argument that is a node name to create a search pattern for finding PP nodes. The second macro defines _controllers_all_following, which includes calls to the _pp macro (with argument and without). This second macro is used to name the subject that is following in the containing clause, and prohibit the possibility of there being preceding arguments with ROLE dominating a node prefixed ARG — potential controllers.


  define(`_pp', `ifelse(`$#', `0', `/^PP\b/', `/^PP\b/=$1')')dnl
  define(`_controllers_all_following', `$.. (_pp($1) < (/^ROLE/ < /^ARG0/) !$,, (_pp < (/^ROLE/ < /^ARG/))) !$,, (_pp < (/^ROLE/ < /^ARG/))')dnl

  /^IP-CTL/=x _controllers_all_following(y)

  move x $- y


  _pp(x) < /^IP-CTL/ _controllers_all_following(y)

  move x $- y


  _pp(x) < (/^CP-THT/ < (/^IP-CTL/)) _controllers_all_following(y)

  move x $- y

Input:

  (IP-MAT (PP (CONN If)
              (IP-CTL (VB laughing)))
          (PP (ROLE ARG0)
              (NP (N boy)))
          (VB COPULA_is)
          (ADXP (ADX happy)))

Output:

  (IP-MAT (PP (ROLE ARG0)
              (NP (N boy)))
          (PP (CONN If)
              (IP-CTL (VB laughing)))
          (VB COPULA_is)
          (ADXP (ADX happy)))

7   An extended example

Clause-level coordination in the NINJAL Parsed Corpus of Modern Japanese (NPCMJ) is represented by placing the content of clausal conjuncts that are prior to the last conjunct under stacked subordinate IP-ADV nodes that contain a -CONJ extension or that are followed by a conjunction marker (CONJ) (possibly with intervening punctuation) (or both). Example 19 demonstrates how it is possible to transform this coding of clause-level coordination with subordinate structure into a coding that has coordinate structure with CONJP and ILYR (clause intermediate level) nodes.

Example 19


  /^IP-ADV-CONJ/=x $. (PU $. CONJ)

  relabel x /^IP-ADV-CONJ(.*)$/IP-ADV$1/


  /^IP-ADV/ $. (PU=y $. CONJ=x)

  adjoinF (ILYR CONJP @ ILYR) x
  move y $+ x


  /^IP-ADV-CONJ/=x !$. CONJ

  relabel x /^IP-ADV-CONJ(.*)$/IP-ADV$1/
  insert (CONJ *) $- x


  /^IP-ADV/ $. CONJ=x

  adjoinF (ILYR CONJP @ ILYR) x


  /^IP-ADV(.*)$/#1\%extra=x $. (ILYR <1 (CONJP=y !< __))

  move x >1 y
  relabel x /^.*$/ILYR\%{extra}/


  ILYR < CONJ <-1 ILYR=y $. __=x

  move x >-1 y


  CONJP $. (CONJ=x < /^\*$/)

  delete x


  CONJP=z $. (PU=y $. CONJ=x)

  move y >-1 z
  move x >-1 z


  CONJP=z $. CONJ=x

  move x >-1 z

Input:

  ( (IP-MAT (NP-SBJ *speaker*)
            (IP-ADV-CONJ (VB わかり)
                         (AX まし)
                         (AXD た))
            (PU 、)
            (CONJ では)
            (PU 、)
            (PP-OB1 (NP (PRO それ))
                    (P-ROLE を))
            (VB 確認)
            (VB0 し)
            (P-CONN て))
    (ID 42_spoken_JF2))

Output:

  ( (IP-MAT (NP-SBJ *speaker*)
            (ILYR (CONJP (ILYR (VB わかり)
                               (AX まし)
                               (AXD た))
                         (PU 、)
                         (CONJ では))
                  (ILYR (PU 、)
                       (PP-OB1 (NP (PRO それ))
                               (P-ROLE を))
                       (VB 確認)
                       (VB0 し)
                       (P-CONN て))))
    (ID 42_spoken_JF2))

8   Running Tsurgeon

This section notes how to run Tsurgeon from the command prompt with the tsurgeon_script wrapper script. This wrapper script is available from: http://www.compling.jp/ajb129.

NAME
      tsurgeon_script - pipeline tsurgeon

SYNOPSIS
      tsurgeon_script script

DESCRIPTION
      This wrapper script runs stanford-tregex.jar in the tsurgeon mode
      as a filter changing stdin.  At least one tsurgeon script must be
      supplied.

      A tsurgeon script is a file containing a list of pattern and
      transformation operation list pairs.  That is, it is a sequence of
      pairs of a Tregex pattern on one or more lines, then a blank line
      (empty or whitespace), then a list of transformation operations
      one per line to apply when the pattern is matched, and then
      another blank line (empty or whitespace).  Note the need for blank
      lines: The code crashes if they are not present as separators.

      The character % introduces a comment that extends to the end of the
      line.  All other intended uses of % must be escaped as \% .

      Before tsurgeon gets to see any script content, the script content
      is passed through the m4 macro processor.

      Instances of # used to mark variable groups in patterns should be
      placed between a backquote character and a single quote character.

      Also lines that begin .R and end .E are commented out. It is not
      necessary to close .R, in which case all lines are commented out
      from the instance of .R to the end of the given script.

OPTIONS
      --debug)   # see modified script
      -*)        # show this help message
      *)         # filename to source the tsurgeon script content



References

Levy, Roger and Galen Andrew. 2006. Tregex and Tsurgeon: tools for querying and manipulating tree data structure. In 5th International conference on Language Resources and Evaluation.

Pito, Richard. 1994. tgrepdoc–documentation for tgrep. University of Pennsylvania.

Rohde, Douglas. 2005. TGrep2 User Manual version 1.15.