delphin.codecs

Serialization Codecs for Semantic Representations

The delphin.codecs package is a namespace package for modules used in the serialization and deserialization of semantic representations. All modules included in this namespace must follow the common API (based on Python’s pickle and json modules) in order to work correctly with PyDelphin. This document describes that API.

Included Codecs

MRS:

DMRS:

EDS:

Codec API

Module Constants

There is one required module constant for codecs: CODEC_INFO. Its purpose is primarily to specify which representation (MRS, DMRS, EDS) it serializes. A codec without CODEC_INFO will work for programmatic usage, but it will not work with the delphin.commands.convert() function or at the command line with the delphin convert command, which use the representation key in CODEC_INFO to determine when and how to convert representations.

CODEC_INFO

A dictionary containing information about the codec. While codec authors may put arbitrary data here, there are two keys used by PyDelphin’s conversion features: representation and description. Only representation is required, and should be set to one of mrs, dmrs, or eds. For example, the mrsjson codec uses the following:

CODEC_INFO = {
    'representation': 'mrs',
    'description': 'JSON-serialized MRS for the Web API'
}

The following module constants are optional and are used to describe strings that must appear in valid documents when serializing multiple semantics representations at a time, as with dump() and dumps(). It is used by delphin.commands.convert() to provide a streaming serialization rather than dumping the entire file at once. If the values are not defined in the codec module, default values will be used.

The string to output before any of semantic representations are serialized. For example, in delphin.codecs.mrx, the value of HEADER is <mrs-list>, and in delphin.codecs.dmrstikz it is an entire LaTeX preamble followed by begin{document}.

JOINER

The string used to join multiple serialized semantic representations. For example, in delphin.codecs.mrsjson, it is a comma (,) following JSON’s syntax. Normally it is either an empty string, a space, or a newline, depending on the conventions for the format and if the indent argument is set.

The string to output after all semantic representations have been serialized. For example, in delphin.codecs.mrx, it is </mrs-list>, and in delphin.codecs.dmrstikz it is end{document}.

Deserialization Functions

The deserialization functions load(), loads(), and decode() accept textual serializations and return the interpreted semantic representation. Both load() and loads() expect full documents (including headers and footers, such as <mrs-list> and </mrs-list> around a mrx serialization) and return lists of semantic structure objects. The decode() function expects single representations (without headers and footers) and returns a single semantic structure object.

Reading from a file or stream

load(source)

Deserialize and return semantic representations from source.

Parameters:

sourcepath-like object or file handle of a source containing serialized semantic representations

Return type:

list

Reading from a string

loads(s)

Deserialize and return semantic representations from string s.

Parameters:

s – string containing serialized semantic representations

Return type:

list

Decoding from a string

decode(s)

Deserialize and return the semantic representation from string s.

Parameters:

s – string containing a serialized semantic representation

Return type:

subclass of delphin.sembase.SemanticStructure

Serialization Functions

The serialization functions dump(), dumps(), and encode() take semantic representations as input as either return a string or print to a file or stream. Both dump() and dumps() will provide the appropriate HEADER, JOINER, and FOOTER values to make the result a valid document. The encode() function only serializes a single semantic representation, which is generally useful when working with single representations, but is also useful when headers and footers are not desired (e.g., if you want the dmrx representation of a DMRS without <dmrs-list> and </dmrs-list> surrounding it).

Writing to a file or stream

dump(xs, destination, properties=True, lnk=True, indent=False, encoding='utf-8')

Serialize semantic representations in xs to destination.

Parameters:
  • xs – iterable of SemanticStructure objects to serialize

  • destination

    path-like object or file object where data will be written to

  • properties (bool) – if False, suppress morphosemantic properties

  • lnk (bool) – if False, suppress surface alignments and strings

  • indent – if True or an integer value, add newlines and indentation; some codecs may support an integer value for indent, which specifies how many columns to indent

  • encoding (str) – if destination is a filename, write to the file with the given encoding; otherwise it is ignored

Writing to a string

dumps(xs, properties=True, lnk=True, indent=False)

Serialize semantic representations in xs and return the string.

The arguments are interpreted as in dump().

Return type:

str

Encoding to a string

encode(x, properties=True, lnk=True, indent=False)

Serialize single semantic representations x and return the string.

The arguments are interpreted as in dump().

Return type:

str

Variations

All serialization codecs should use the function signatures above, but some variations are possible. Codecs should not remove any positional or keyword arguments from functions, but they can be ignored. If any new positional arguments are added, they should appear after the last positional argument in its function, before the keyword arguments. New keyword arguments may be added in any order. Finally, a codec may omit some functions entirely, such as for export-only codecs that do not provide load(), loads(), or decode(). The module constants HEADER, JOINER, and FOOTER are also optional. Here are some examples of variations in PyDelphin: