delphin.derivation¶

Classes and functions related to derivation trees.

Derivation trees represent a unique analysis of an input using an implemented grammar. They are a kind of syntax tree, but as they use the actual grammar entities (e.g., rules or lexical entries) as node labels, they are more specific than trees using general category labels (e.g., “N” or “VP”). As such, they are more likely to change across grammar versions.

See also

More information about derivation trees is found at http://moin.delph-in.net/ItsdbDerivations

For the following Japanese example…

遠く    に  銃声    が  聞こえ た 。
tooku   ni  juusei  ga  kikoe-ta
distant LOC gunshot NOM can.hear-PFV
"Shots were heard in the distance."

… here is the derivation tree of a parse from Jacy in the Unified Derivation Format (UDF):

(utterance-root
 (564 utterance_rule-decl-finite 1.02132 0 6
  (563 hf-adj-i-rule 1.04014 0 6
   (557 hf-complement-rule -0.27164 0 2
    (556 quantify-n-rule 0.311511 0 1
     (23 tooku_1 0.152496 0 1
      ("遠く" 0 1)))
    (42 ni-narg 0.478407 1 2
     ("に" 1 2)))
   (562 head_subj_rule 1.512 2 6
    (559 hf-complement-rule -0.378462 2 4
     (558 quantify-n-rule 0.159015 2 3
      (55 juusei_1 0 2 3
       ("銃声" 2 3)))
     (56 ga 0.462257 3 4
      ("が" 3 4)))
    (561 vstem-vend-rule 1.34202 4 6
     (560 i-lexeme-v-stem-infl-rule 0.365568 4 5
      (65 kikoeru-stem 0 4 5
       ("聞こえ" 4 5)))
     (81 ta-end 0.0227589 5 6
      ("た" 5 6)))))))

In addition to the UDF format, there is also the UDF export format “UDX”, which adds lexical type information and indicates which daughter node is the head, and a dictionary representation, which is useful for JSON serialization. All three are supported by PyDelphin.

Derivation trees have 3 types of nodes:

root nodes, with only an entity name and a single child
normal nodes, with 5 fields (below) and a list of children
- id – an integer id given by the producer of the derivation
- entity – rule or type name
- score – a (MaxEnt) score for the current node’s subtree
- start – the character index of the left-most side of the tree
- end – the character index of the right-most side of the tree
terminal/left/lexical nodes, which contain the input tokens processed by that subtree

This module uses the UDFNode class for capturing root and normal nodes. Root nodes are expressed as a UDFNode whose id is None. For root nodes, all fields except entity and the list of daughters are expected to be None. Leaf nodes are simply an iterable of token information.

Loading Derivation Data¶

There are two functions for loading derivations from either the UDF/UDX string representation or the dictionary representation: from_string() and from_dict().

>>> from delphin import derivation
>>> d1 = derivation.from_string(
...     '(1 entity-name 1 0 1 ("token"))')
...
>>> d2 = derivation.from_dict(
...     {'id': 1, 'entity': 'entity-name', 'score': 1,
...      'start': 0, 'end': 1, 'form': 'token'}]})
...
>>> d1 == d2
True

delphin.derivation.from_string(s)[source]¶

Instantiate a Derivation from a UDF or UDX string representation.

The UDF/UDX representations are as output by a processor like the LKB or ACE, or from the UDFNode.to_udf() or UDFNode.to_udx() methods.

Parameters: s (str) – UDF or UDX serialization

delphin.derivation.from_dict(d)[source]¶

Instantiate a Derivation from a dictionary representation.

The dictionary representation may come from the HTTP interface (see the ErgApi wiki) or from the UDFNode.to_dict() method. Note that in the former case, the JSON response should have already been decoded into a Python dictionary.

Parameters: d (dict) – dictionary representation of a derivation

UDF/UDX Classes¶

There are four classes for representing derivation trees. The Derivation class is used to contain the entire tree, while UDFNode, UDFTerminal, and UDFToken represent individual nodes.

class delphin.derivation.Derivation(id, entity, score=None, start=None, end=None, daughters=None, head=None, type=None, parent=None)[source]¶

Bases: delphin.derivation.UDFNode

A [incr tsdb()] derivation.

A Derivation object is simply a UDFNode but as it is intended to represent an entire derivation tree it performs additional checks on instantiation if the top node is a root node, namely that the top node only has the entity attribute set, and that it has only one node on its daughters list.

class delphin.derivation.UDFNode(id, entity, score=None, start=None, end=None, daughters=None, head=None, type=None, parent=None)[source]¶

Normal (non-leaf) nodes in the Unified Derivation Format.

Root nodes are just UDFNodes whose id, by convention, is None. The daughters list can composed of either UDFNodes or other objects (generally it should be uniformly one or the other). In the latter case, the UDFNode is a preterminal, and the daughters are terminal nodes.

Parameters

id (int) – unique node identifier
entity (str) – grammar entity represented by the node
score (float, optional) – probability or weight of the node
start (int, optional) – start position of tokens encompassed by the node
end (int, optional) – end position of tokens encompassed by the node
daughters (list, optional) – iterable of daughter nodes
head (bool, optional) – True if the node is a syntactic head node
type (str, optional) – grammar type name
parent (UDFNode, optional) – parent node in derivation

id¶: The unique node identifier.

entity¶: The grammar entity represented by the node.

score¶: The probability or weight of to the node; for many processors, this will be the unnormalized MaxEnt score assigned to the whole subtree rooted by this node.

start¶: The start position (in inter-word, or chart, indices) of the substring encompassed by this node and its daughters.

end¶: The end position (in inter-word, or chart, indices) of the substring encompassed by this node and its daughters.

type¶: The lexical type (available on preterminal UDX nodes).

is_root()[source]¶: Return True if the node is a root node.

Note

This is not simply the top node; by convention, a node is a root if its id is None.

to_udf(indent=1)¶

Encode the node and its descendants in the UDF format.

Parameters: indent (int) – the number of spaces to indent at each level
Returns: str – the UDF-serialized string

to_udx(indent=1)¶

Encode the node and its descendants in the UDF export format.

Parameters: indent (int) – the number of spaces to indent at each level
Returns: str – the UDX-serialized string

to_dict(fields=('end', 'id', 'tokens', 'start', 'daughters', 'entity', 'type', 'score', 'form', 'head'), labels=None)¶

Encode the node as a dictionary suitable for JSON serialization.

Parameters

fields – if given, this is a whitelist of fields to include on nodes (daughters and form are always shown)
labels – optional label annotations to embed in the derivation dict; the value is a list of lists matching the structure of the derivation (e.g., [“S” [“NP” [“NNS” [“Dogs”]]] [“VP” [“VBZ” [“bark”]]]])

Returns

dict – the dictionary representation of the structure

is_head()[source]¶

Return True if the node is a head.

A node is a head if it is marked as a head in the UDX format or it has no siblings. False is returned if the node is known to not be a head (has a sibling that is a head). Otherwise it is indeterminate whether the node is a head, and None is returned.

is_root()[source]: Return True if the node is a root node.

Note

This is not simply the top node; by convention, a node is a root if its id is None.

preterminals()[source]¶: Return the list of preterminals (i.e. lexical grammar-entities).

terminals()[source]¶: Return the list of terminals (i.e. lexical units).

class delphin.derivation.UDFTerminal(form, tokens=None, parent=None)[source]¶

Terminal nodes in the Unified Derivation Format.

The form field is always set, but tokens may be None.

See: http://moin.delph-in.net/ItsdbDerivations

Parameters

form (str) – surface form of the terminal
tokens (list, optional) – iterable of tokens
parent (UDFNode, optional) – parent node in derivation

form¶: The surface form of the terminal.

tokens¶: The list of tokens.

is_root()[source]¶

Return False (as a UDFTerminal is never a root).

This function is provided for convenience, so one does not need to check if isinstance(n, UDFNode) before testing if the node is a root.

to_udf(indent=1)¶

Encode the node and its descendants in the UDF format.

Parameters: indent (int) – the number of spaces to indent at each level
Returns: str – the UDF-serialized string

to_udx(indent=1)¶

Encode the node and its descendants in the UDF export format.

Parameters: indent (int) – the number of spaces to indent at each level
Returns: str – the UDX-serialized string

to_dict(fields=('end', 'id', 'tokens', 'start', 'daughters', 'entity', 'type', 'score', 'form', 'head'), labels=None)¶

Encode the node as a dictionary suitable for JSON serialization.

Parameters

fields – if given, this is a whitelist of fields to include on nodes (daughters and form are always shown)
labels – optional label annotations to embed in the derivation dict; the value is a list of lists matching the structure of the derivation (e.g., [“S” [“NP” [“NNS” [“Dogs”]]] [“VP” [“VBZ” [“bark”]]]])

Returns

dict – the dictionary representation of the structure

is_root()[source]

Return False (as a UDFTerminal is never a root).

This function is provided for convenience, so one does not need to check if isinstance(n, UDFNode) before testing if the node is a root.

class delphin.derivation.UDFToken(id, tfs)[source]¶

A token represenatation in derivations.

Token data are not formally nodes, but do have an id. Most UDFTerminal nodes will only have one UDFToken, but multi-word entities (e.g. “ad hoc”) will have more than one.

Parameters

id (int) – token identifier
tfs (str) – the feature structure for the token

id¶: The token identifier.

form¶: The feature structure for the token.