delphin.derivation

Classes and functions related to derivation trees.

Derivation trees represent a unique analysis of an input using an implemented grammar. They are a kind of syntax tree, but as they use the actual grammar entities (e.g., rules or lexical entries) as node labels, they are more specific than trees using general category labels (e.g., “N” or “VP”). As such, they are more likely to change across grammar versions.

See also

More information about derivation trees is found at http://moin.delph-in.net/ItsdbDerivations

For the following Japanese example…

遠く    に  銃声    が  聞こえ た 。
tooku   ni  juusei  ga  kikoe-ta
distant LOC gunshot NOM can.hear-PFV
"Shots were heard in the distance."

… here is the derivation tree of a parse from Jacy in the Unified Derivation Format (UDF):

(utterance-root
 (564 utterance_rule-decl-finite 1.02132 0 6
  (563 hf-adj-i-rule 1.04014 0 6
   (557 hf-complement-rule -0.27164 0 2
    (556 quantify-n-rule 0.311511 0 1
     (23 tooku_1 0.152496 0 1
      ("遠く" 0 1)))
    (42 ni-narg 0.478407 1 2
     ("に" 1 2)))
   (562 head_subj_rule 1.512 2 6
    (559 hf-complement-rule -0.378462 2 4
     (558 quantify-n-rule 0.159015 2 3
      (55 juusei_1 0 2 3
       ("銃声" 2 3)))
     (56 ga 0.462257 3 4
      ("が" 3 4)))
    (561 vstem-vend-rule 1.34202 4 6
     (560 i-lexeme-v-stem-infl-rule 0.365568 4 5
      (65 kikoeru-stem 0 4 5
       ("聞こえ" 4 5)))
     (81 ta-end 0.0227589 5 6
      ("た" 5 6)))))))

In addition to the UDF format, there is also the UDF export format “UDX”, which adds lexical type information and indicates which daughter node is the head, and a dictionary representation, which is useful for JSON serialization. All three are supported by PyDelphin.

Derivation trees have 3 types of nodes:

  • root nodes, with only an entity name and a single child
  • normal nodes, with 5 fields (below) and a list of children
    • id – an integer id given by the producer of the derivation
    • entity – rule or type name
    • score – a (MaxEnt) score for the current node’s subtree
    • start – the character index of the left-most side of the tree
    • end – the character index of the right-most side of the tree
  • terminal/left/lexical nodes, which contain the input tokens processed by that subtree

This module uses the UdfNode class for capturing root and normal nodes. Root nodes are expressed as a UdfNode whose id is None. For root nodes, all fields except entity and the list of daughters are expected to be None. Leaf nodes are simply an iterable of token information.

The Derivation class—itself a UdfNode—, has some tree-level operations defined, in particular the Derivation.from_string() method, which is used to read the serialized derivation into a Python object.

Loading Derivation Data

For loading a full derivation structure from either the UDF/UDX string representations or the dictionary representation, the Derivation class provides class methods to help with the decoding.

>>> from delphin import derivation
>>> d1 = derivation.Derivation.from_string(
...     '(1 entity-name 1 0 1 ("token"))')
...
>>> d2 = derivation.Derivation.from_dict(
...     {'id': 1, 'entity': 'entity-name', 'score': 1,
...      'start': 0, 'end': 1, 'form': 'token'}]})
...
>>> d1 == d2
True
class delphin.derivation.Derivation(id, entity, score=None, start=None, end=None, daughters=None, head=None, type=None, parent=None)[source]

Bases: delphin.derivation.UdfNode

A [incr tsdb()] derivation.

This class exists to facilitate the reading of UDF string serializations and dictionary representations (e.g., decoded from JSON). The resulting structure is otherwise equivalent to a UdfNode, and inherits all its methods.

classmethod from_dict(d)[source]

Instantiate a Derivation from a dictionary representation.

The dictionary representation may come from the HTTP interface (see the ErgApi wiki) or from the UdfNode.to_dict() method. Note that in the former case, the JSON response should have already been decoded into a Python dictionary.

Parameters:d (dict) – dictionary representation of a derivation
classmethod from_string(s)[source]

Instantiate a Derivation from a UDF or UDX string representation.

The UDF/UDX representations are as output by a processor like the LKB or ACE, or from the UdfNode.to_udf() or UdfNode.to_udx() methods.

Parameters:s (str) – UDF or UDX serialization

UDF/UDX Node Types

There are three different node Types

class delphin.derivation.UdfNode[source]

Normal (non-leaf) nodes in the Unified Derivation Format.

Root nodes are just UdfNodes whose id, by convention, is None. The daughters list can composed of either UdfNodes or other objects (generally it should be uniformly one or the other). In the latter case, the UdfNode is a preterminal, and the daughters are terminal nodes.

Parameters:
  • id (int) – unique node identifier
  • entity (str) – grammar entity represented by the node
  • score (float, optional) – probability or weight of the node
  • start (int, optional) – start position of tokens encompassed by the node
  • end (int, optional) – end position of tokens encompassed by the node
  • daughters (list, optional) – iterable of daughter nodes
  • head (bool, optional) – True if the node is a syntactic head node
  • type (str, optional) – grammar type name
  • parent (UdfNode, optional) – parent node in derivation
id

the unique node identifier

entity

the grammar entity represented by the node

score

the probability or weight of to the node; for many processors, this will be the unnormalized MaxEnt score assigned to the whole subtree rooted by this node

start

the start position (in inter-word, or chart, indices) of the substring encompassed by this node and its daughters

end

the end position (in inter-word, or chart, indices) of the substring encompassed by this node and its daughters

type

the lexical type (available on preterminal UDX nodes)

is_root()[source]

Return True if the node is a root node.

Note

This is not simply the top node; by convention, a node is a root if its id is None.

to_udf(indent=1)

Encode the node and its descendants in the UDF format.

Parameters:indent (int) – the number of spaces to indent at each level
Returns:str – the UDF-serialized string
to_udx(indent=1)

Encode the node and its descendants in the UDF export format.

Parameters:indent (int) – the number of spaces to indent at each level
Returns:str – the UDX-serialized string
to_dict(fields=('score', 'head', 'end', 'daughters', 'start', 'type', 'id', 'form', 'tokens', 'entity'), labels=None)

Encode the node as a dictionary suitable for JSON serialization.

Parameters:
  • fields – if given, this is a whitelist of fields to include on nodes (daughters and form are always shown)
  • labels – optional label annotations to embed in the derivation dict; the value is a list of lists matching the structure of the derivation (e.g., [“S” [“NP” [“NNS” [“Dogs”]]] [“VP” [“VBZ” [“bark”]]]])
Returns:

dict – the dictionary representation of the structure

basic_entity()[source]

Return the entity without the lexical type information.

In the export (UDX) format, lexical types follow entities of preterminal nodes, joined by an at-sign (@). In regular UDF or non-preterminal nodes, this will just return the entity string.

Deprecated since version 0.5.1: Use entity

is_head()[source]

Return True if the node is a head.

A node is a head if it is marked as a head in the UDX format or it has no siblings. False is returned if the node is known to not be a head (has a sibling that is a head). Otherwise it is indeterminate whether the node is a head, and None is returned.

is_root()[source]

Return True if the node is a root node.

Note

This is not simply the top node; by convention, a node is a root if its id is None.

lexical_type()[source]

Return the lexical type of a preterminal node.

In export (UDX) format, lexical types follow entities of preterminal nodes, joined by an at-sign (@). In regular UDF or non-preterminal nodes, this will return None.

Deprecated since version 0.5.1: Use type

preterminals()[source]

Return the list of preterminals (i.e. lexical grammar-entities).

terminals()[source]

Return the list of terminals (i.e. lexical units).

class delphin.derivation.UdfTerminal[source]

Terminal nodes in the Unified Derivation Format.

The form field is always set, but tokens may be None.

See: http://moin.delph-in.net/ItsdbDerivations

Parameters:
  • form (str) – surface form of the terminal
  • tokens (list, optional) – iterable of tokens
  • parent (UdfNode, optional) – parent node in derivation
form

the surface form of the terminal

tokens

the list of tokens

is_root()[source]

Return False (as a UdfTerminal is never a root).

This function is provided for convenience, so one does not need to check if isinstance(n, UdfNode) before testing if the node is a root.

to_udf(indent=1)

Encode the node and its descendants in the UDF format.

Parameters:indent (int) – the number of spaces to indent at each level
Returns:str – the UDF-serialized string
to_udx(indent=1)

Encode the node and its descendants in the UDF export format.

Parameters:indent (int) – the number of spaces to indent at each level
Returns:str – the UDX-serialized string
to_dict(fields=('score', 'head', 'end', 'daughters', 'start', 'type', 'id', 'form', 'tokens', 'entity'), labels=None)

Encode the node as a dictionary suitable for JSON serialization.

Parameters:
  • fields – if given, this is a whitelist of fields to include on nodes (daughters and form are always shown)
  • labels – optional label annotations to embed in the derivation dict; the value is a list of lists matching the structure of the derivation (e.g., [“S” [“NP” [“NNS” [“Dogs”]]] [“VP” [“VBZ” [“bark”]]]])
Returns:

dict – the dictionary representation of the structure

is_root()[source]

Return False (as a UdfTerminal is never a root).

This function is provided for convenience, so one does not need to check if isinstance(n, UdfNode) before testing if the node is a root.

class delphin.derivation.UdfToken[source]

A token represenatation in derivations.

Token data are not formally nodes, but do have an id. Most UdfTerminal nodes will only have one UdfToken, but multi-word entities (e.g. “ad hoc”) will have more than one.

Parameters:
  • id (int) – token identifier
  • tfs (str) – the feature structure for the token
id

the token identifier

form

the feature structure for the token