delphin.itsdb

See also

See Working with [incr tsdb()] Testsuites for a more user-friendly introduction

Classes and functions for working with [incr tsdb()] profiles.

The itsdb module provides classes and functions for working with [incr tsdb()] profiles (or, more generally, testsuites; see http://moin.delph-in.net/ItsdbTop). It handles the technical details of encoding and decoding records in tables, including escaping and unescaping reserved characters, pairing columns with their relational descriptions, casting types (such as :integer, etc.), and transparently handling gzipped tables, so that the user has a natural way of working with the data. Capabilities include:

  • Reading and writing testsuites:

    >>> from delphin import itsdb
    >>> ts = itsdb.TestSuite('jacy/tsdb/gold/mrs')
    >>> ts.write(path='mrs-copy')
    
  • Selecting data by table name, record index, and column name or index:

    >>> items = ts['item']           # get the items table
    >>> rec = items[0]               # get the first record
    >>> rec['i-input']               # input sentence of the first item
    '雨 が 降っ た .'
    >>> rec[0]                       # values are cast on index retrieval
    11
    >>> rec.get('i-id')              # and on key retrieval
    11
    >>> rec.get('i-id', cast=False)  # unless cast=False
    '11'
    
  • Selecting data as a query (note that types are cast by default):

    >>> next(ts.select('item:i-id@i-input@i-date'))  # query testsuite
    [11, '雨 が 降っ た .', datetime.datetime(2006, 5, 28, 0, 0)]
    >>> next(items.select('i-id@i-input@i-date'))    # query table
    [11, '雨 が 降っ た .', datetime.datetime(2006, 5, 28, 0, 0)]
    
  • In-memory modification of testsuite data:

    >>> # desegment each sentence
    >>> for record in ts['item']:
    ...     record['i-input'] = ''.join(record['i-input'].split())
    ...
    >>> ts['item'][0]['i-input']
    '雨が降った.'
    
  • Joining tables

    >>> joined = itsdb.join(ts['parse'], ts['result'])
    >>> next(joined.select('i-id@mrs'))
    [11, '[ LTOP: h1 INDEX: e2 [ e TENSE: PAST ...']
    
  • Processing data with ACE (results are stored in memory)

    >>> from delphin.interfaces import ace
    >>> with ace.AceParser('jacy.dat') as cpu:
    ...     ts.process(cpu)
    ...
    NOTE: parsed 126 / 135 sentences, avg 3167k, time 1.87536s
    >>> ts.write('new-profile')
    

This module covers all aspects of [incr tsdb()] data, from Relations files and Field descriptions to Record, Table, and full TestSuite classes. TestSuite is the most user-facing interface, and it makes it easy to load the tables of a testsuite into memory, inspect its contents, modify or create data, and write the data to disk.

By default, the itsdb module expects testsuites to use the standard [incr tsdb()] schema. Testsuites are always read and written according to the associated or specified relations file, but other things, such as default field values and the list of “core” tables, are defined for the standard schema. It is, however, possible to define non-standard schemata for particular applications, and most functions will continue to work. One notable exception is the TestSuite.process() method, for which a new FieldMapper class must be defined.

Overview of [incr tsdb()] Testsuites

[incr tsdb()] testsuites are directories containing a relations file (see Relations Files and Field Descriptions) and a file for each table in the database. The typical testsuite contains these files:

testsuite/
  analysis  fold             item-set   parse       relations  run    tree
  decision  item             output     phenomenon  result     score  update
  edge      item-phenomenon  parameter  preference  rule       set

PyDelphin has three classes for working with [incr tsdb()] testsuite databases:

  • TestSuite – The entire testsuite (or directory)
  • Table – A table (or file) in a testsuite
  • Record – A row (or line) in a table
class delphin.itsdb.TestSuite(path=None, relations=None, encoding='utf-8')[source]

A [incr tsdb()] testsuite database.

Parameters:
  • path – the path to the testsuite’s directory
  • relations (Relations, str) – the database schema; either a Relations object or a path to a relations file; if not given, the relations file under path will be used
  • encoding – the character encoding of the files in the testsuite
encoding

character encoding used when reading and writing tables

Type:str
relations

database schema

Type:Relations
exists(table=None)[source]

Return True if the testsuite or a table exists on disk.

If table is None, this method returns True if the TestSuite.path is specified and points to an existing directory containing a valid relations file. If table is given, the function returns True if, in addition to the above conditions, the table exists as a file (even if empty). Otherwise it returns False.

process(cpu, selector=None, source=None, fieldmapper=None, gzip=None, buffer_size=1000)[source]

Process each item in a [incr tsdb()] testsuite

If the testsuite is attached to files on disk, the output records will be flushed to disk when the number of new records in a table is buffer_size. If the testsuite is not attached to files or buffer_size is set to None, records are kept in memory and not flushed to disk.

Parameters:
  • cpu (Processor) – processor interface (e.g., AceParser)
  • selector (str) – data specifier to select a single table and column as processor input (e.g., “item:i-input”)
  • source (TestSuite, Table) – testsuite or table from which inputs are taken; if None, use self
  • fieldmapper (FieldMapper) – object for mapping response fields to [incr tsdb()] fields; if None, use a default mapper for the standard schema
  • gzip – compress non-empty tables with gzip
  • buffer_size (int) – number of output records to hold in memory before flushing to disk; ignored if the testsuite is all in-memory; if None, do not flush to disk

Examples

>>> ts.process(ace_parser)
>>> ts.process(ace_generator, 'result:mrs', source=ts2)
reload()[source]

Discard temporary changes and reload the database from disk.

select(arg, cols=None, mode='list')[source]

Select columns from each row in the table.

The first parameter, arg, may either be a table name or a data specifier. If the former, the cols parameter selects the columns from the table. If the latter, cols is left unspecified and both the table and columns are taken from the data specifier; e.g., select(‘item:i-id@i-input’) is equivalent to select(‘item’, (‘i-id’, ‘i-input’)).

See select_rows() for a description of how to use the mode parameter.

Parameters:
  • arg – a table name, if cols is specified, otherwise a data specifier
  • cols – an iterable of Field (column) names
  • mode – how to return the data
size(table=None)[source]

Return the size, in bytes, of the testsuite or table.

If table is None, return the size of the whole testsuite (i.e., the sum of the table sizes). Otherwise, return the size of table.

Notes

  • If the file is gzipped, it returns the compressed size.
  • Only tables on disk are included.
write(tables=None, path=None, relations=None, append=False, gzip=None)[source]

Write the testsuite to disk.

Parameters:
  • tables – a name or iterable of names of tables to write, or a Mapping of table names to table data; if None, all tables will be written
  • path – the destination directory; if None use the path assigned to the TestSuite
  • relations – a Relations object or path to a relations file to be used when writing the tables
  • append – if True, append to rather than overwrite tables
  • gzip – compress non-empty tables with gzip

Examples

>>> ts.write(path='new/path')
>>> ts.write('item')
>>> ts.write(['item', 'parse', 'result'])
>>> ts.write({'item': item_rows})
class delphin.itsdb.Table(fields, records=None)[source]

A [incr tsdb()] table.

Instances of this class contain a collection of rows with the data stored in the database. Generally a Table will be created by a TestSuite object for a database, but a Table can also be instantiated individually by the Table.from_file() class method, and the relations file in the same directory is used to get the schema. Tables can also be constructed entirely in-memory and separate from a testsuite via the standard Table() constructor.

Tables have two modes: attached and detached. Attached tables are backed by a file on disk (whether as part of a testsuite or not) and only store modified records in memory—all unmodified records are retrieved from disk. Therefore, iterating over a table is more efficient than random-access. Attached files use significantly less memory than detached tables but also require more processing time. Detached tables are entirely stored in memory and are not backed by a file. They are useful for the programmatic construction of testsuites (including for unit tests) and other operations where high-speed random-access is required. See the attach() and detach() methods for more information. The is_attached() method is useful for determining the mode of a table.

Parameters:
  • fields – the Relation schema for this table
  • records – the collection of Record objects containing the table data
name

table name

Type:str
fields

table schema

Type:Relation
path

if attached, the path to the file containing the table data; if detached it is None

Type:str
encoding

the character encoding of the attached table file; if detached it is None

Type:str
classmethod from_file(path, fields=None, encoding='utf-8')[source]

Instantiate a Table from a database file.

This method instantiates a table attached to the file at path. The file will be opened and traversed to determine the number of records, but the contents will not be stored in memory unless they are modified.

Parameters:
  • path – the path to the table file
  • fields – the Relation schema for the table (loaded from the relations file in the same directory if not given)
  • encoding – the character encoding of the file at path
write(records=None, path=None, fields=None, append=False, gzip=None)[source]

Write the table to disk.

The basic usage has no arguments and writes the table’s data to the attached file. The parameters accommodate a variety of use cases, such as using fields to refresh a table to a new schema or records and append to incrementally build a table.

Parameters:
  • records – an iterable of Record objects to write; if None the table’s existing data is used
  • path – the destination file path; if None use the path of the file attached to the table
  • fields (Relation) – table schema to use for writing, otherwise use the current one
  • append – if True, append rather than overwrite
  • gzip – compress with gzip if non-empty

Examples

>>> table.write()
>>> table.write(results, path='new/path/result')
commit()[source]

Commit changes to disk if attached.

This method helps normalize the interface for detached and attached tables and makes writing attached tables a bit more efficient. For detached tables nothing is done, as there is no notion of changes, but neither is an error raised (unlike with write()). For attached tables, if all changes are new records, the changes are appended to the existing file, and otherwise the whole file is rewritten.

attach(path, encoding='utf-8')[source]

Attach the Table to the file at path.

Attaching a table to a file means that only changed records are stored in memory, which greatly reduces the memory footprint of large profiles at some cost of performance. Tables created from Table.from_file() or from an attached TestSuite are automatically attached. Attaching a file does not immediately flush the contents to disk; after attaching the table must be separately written to commit the in-memory data.

A non-empty table will fail to attach to a non-empty file to avoid data loss when merging the contents. In this case, you may delete or clear the file, clear the table, or attach to another file.

Parameters:
  • path – the path to the table file
  • encoding – the character encoding of the files in the testsuite
detach()[source]

Detach the table from a file.

Detaching a table reads all data from the file and places it in memory. This is useful when constructing or significantly manipulating table data, or when more speed is needed. Tables created by the default constructor are detached.

When detaching, only unmodified records are loaded from the file; any uncommited changes in the Table are left as-is.

Warning

Very large tables may consume all available RAM when detached. Expect the in-memory table to take up about twice the space of an uncompressed table on disk, although this may vary by system.

is_attached()[source]

Return True if the table is attached to a file.

list_changes()[source]

Return a list of modified records.

This is only applicable for attached tables.

Returns:A list of (row_index, record) tuples of modified records
Raises:delphin.exceptions.ItsdbError – when called on a detached table
append(record)[source]

Add record to the end of the table.

Parameters:record – a Record or other iterable containing column values
extend(records)[source]

Add each record in records to the end of the table.

Parameters:record – an iterable of Record or other iterables containing column values
select(cols, mode='list')[source]

Select columns from each row in the table.

See select_rows() for a description of how to use the mode parameter.

Parameters:
  • cols – an iterable of Field (column) names
  • mode – how to return the data
class delphin.itsdb.Record(fields, iterable)[source]

A row in a [incr tsdb()] table.

Parameters:
  • fields – the Relation schema for the table of this record
  • iterable – an iterable containing the data for the record
fields

table schema

Type:Relation
classmethod from_dict(fields, mapping)[source]

Create a Record from a dictionary of field mappings.

The fields object is used to determine the column indices of fields in the mapping.

Parameters:
  • fields – the Relation schema for the table of this record
  • mapping – a dictionary or other mapping from field names to column values
Returns:

a Record object

get(key, default=None, cast=True)[source]

Return the field data given by field name key.

Parameters:
  • key – the field name of the data to return
  • default – the value to return if key is not in the row

Relations Files and Field Descriptions

A “relations file” is a required file in [incr tsdb()] testsuites that describes the schema of the database. The file contains descriptions of each table and each field within the table. The first 9 lines of run table description is as follows:

run:
  run-id :integer :key                  # unique test run identifier
  run-comment :string                   # descriptive narrative
  platform :string                      # implementation platform (version)
  protocol :integer                     # [incr tsdb()] protocol version
  tsdb :string                          # tsdb(1) (version) used
  application :string                   # application (version) used
  environment :string                   # application-specific information
  grammar :string                       # grammar (version) used
  ...

In PyDelphin, there are three classes for modeling this information:

  • Relations – the entire relations file schema
  • Relation – the schema for a single table
  • Field – a single field description
class delphin.itsdb.Relations(tables)[source]

A [incr tsdb()] database schema.

Note

Use from_file() or from_string() for instantiating a Relations object.

Parameters:tables – a list of (table, Relation) tuples
find(fieldname)[source]

Return the list of tables that define the field fieldname.

classmethod from_file(source)[source]

Instantiate Relations from a relations file.

classmethod from_string(s)[source]

Instantiate Relations from a relations string.

items()[source]

Return a list of (table, Relation) for each table.

path(source, target)[source]

Find the path of id fields connecting two tables.

This is just a basic breadth-first-search. The relations file should be small enough to not be a problem.

Returns:list
(table, fieldname) pairs describing the path from
the source to target tables
Raises:delphin.exceptions.ItsdbError – when no path is found

Example

>>> relations.path('item', 'result')
[('parse', 'i-id'), ('result', 'parse-id')]
>>> relations.path('parse', 'item')
[('item', 'i-id')]
>>> relations.path('item', 'item')
[]
class delphin.itsdb.Relation[source]

A [incr tsdb()] table schema.

Parameters:
  • name – the table name
  • fields – a list of Field objects
index(fieldname)[source]

Return the Field index given by fieldname.

keys()[source]

Return the tuple of field names of key fields.

class delphin.itsdb.Field[source]

A tuple describing a column in an [incr tsdb()] profile.

Parameters:
  • name (str) – the column name
  • datatype (str) – “:string”, “:integer”, “:date”, or “:float”
  • key (bool) – True if the column is a key in the database
  • partial (bool) – True if the column is a partial key
  • comment (str) – a description of the column
default_value()[source]

Get the default value of the field.

Utility Functions

delphin.itsdb.join(table1, table2, on=None, how='inner', name=None)[source]

Join two tables and return the resulting Table object.

Fields in the resulting table have their names prefixed with their corresponding table name. For example, when joining item and parse tables, the i-input field of the item table will be named item:i-input in the resulting Table. Pivot fields (those in on) are only stored once without the prefix.

Both inner and left joins are possible by setting the how parameter to inner and left, respectively.

Warning

Both table2 and the resulting joined table will exist in memory for this operation, so it is not recommended for very large tables on low-memory systems.

Parameters:
  • table1 (Table) – the left table to join
  • table2 (Table) – the right table to join
  • on (str) – the shared key to use for joining; if None, find shared keys using the schemata of the tables
  • how (str) – the method used for joining (“inner” or “left”)
  • name (str) – the name assigned to the resulting table
delphin.itsdb.match_rows(rows1, rows2, key, sort_keys=True)[source]

Yield triples of (value, left_rows, right_rows) where left_rows and right_rows are lists of rows that share the same column value for key. This means that both rows1 and rows2 must have a column with the same name key.

Warning

Both rows1 and rows2 will exist in memory for this operation, so it is not recommended for very large tables on low-memory systems.

Parameters:
  • rows1 – a Table or list of Record objects
  • rows2 – a Table or list of Record objects
  • key (str) – the column name on which to match
  • sort_keys (bool) – if True, yield matching rows sorted by the matched key instead of the original order
delphin.itsdb.select_rows(cols, rows, mode='list', cast=True)[source]

Yield data selected from rows.

It is sometimes useful to select a subset of data from a profile. This function selects the data in cols from rows and yields it in a form specified by mode. Possible values of mode are:

mode description example [‘i-id’, ‘i-wf’]
‘list’ (default) a list of values [10, 1]
‘dict’ col to value map {‘i-id’: 10,’i-wf’: 1}
‘row’ [incr tsdb()] row ‘10@1’
Parameters:
  • cols – an iterable of column names to select data for
  • rows – the rows to select column data from
  • mode – the form yielded data should take
  • cast – if True, cast column values to their datatype (requires rows to be Record objects)
Yields:

Selected data in the form specified by mode.

delphin.itsdb.make_row(row, fields)[source]

Encode a mapping of column name to values into a [incr tsdb()] profile line. The fields parameter determines what columns are used, and default values are provided if a column is missing from the mapping.

Parameters:
  • row – a mapping of column names to values
  • fields – an iterable of Field objects
Returns:

A [incr tsdb()]-encoded string

delphin.itsdb.escape(string)[source]

Replace any special characters with their [incr tsdb()] escape sequences. The characters and their escape sequences are:

@         -> \s
(newline) -> \n
\         -> \\

Also see unescape()

Parameters:string – the string to escape
Returns:The escaped string
delphin.itsdb.unescape(string)[source]

Replace [incr tsdb()] escape sequences with the regular equivalents. Also see escape().

Parameters:string (str) – the escaped string
Returns:The string with escape sequences replaced
delphin.itsdb.decode_row(line, fields=None)[source]

Decode a raw line from a profile into a list of column values.

Decoding involves splitting the line by the field delimiter (“@” by default) and unescaping special characters. If fields is given, cast the values into the datatype given by their respective Field object.

Parameters:
  • line – a raw line from a [incr tsdb()] profile.
  • fields – a list or Relation object of Fields for the row
Returns:

A list of column values.

delphin.itsdb.encode_row(fields)[source]

Encode a list of column values into a [incr tsdb()] profile line.

Encoding involves escaping special characters for each value, then joining the values into a single string with the field delimiter (“@” by default). It does not fill in default values (see make_row()).

Parameters:fields – a list of column values
Returns:A [incr tsdb()]-encoded string
delphin.itsdb.get_data_specifier(string)[source]

Return a tuple (table, col) for some [incr tsdb()] data specifier. For example:

item              -> ('item', None)
item:i-input      -> ('item', ['i-input'])
item:i-input@i-wf -> ('item', ['i-input', 'i-wf'])
:i-input          -> (None, ['i-input'])
(otherwise)       -> (None, None)

Deprecated

The following are remnants of the old functionality that will be removed in a future version, but remain for now to aid in the transition.

class delphin.itsdb.ItsdbProfile(path, relations=None, filters=None, applicators=None, index=True, cast=False, encoding='utf-8')[source]

A [incr tsdb()] profile, analyzed and ready for reading or writing.

Parameters:
  • path – The path of the directory containing the profile
  • filters – A list of tuples [(table, cols, condition)] such that only rows in table where condition(row, row[col]) evaluates to a non-false value are returned; filters are tested in order for a table.
  • applicators – A list of tuples [(table, cols, function)] which will be used when reading rows from a table—the function will be applied to the contents of the column cell in the table. For each table, each column-function pair will be applied in order. Applicators apply after the filters.
  • index – If True, indices are created based on the keys of each table.
  • cast – if True, automatically cast data into the type defined by its relation field (e.g., :integer)

Deprecated since version v0.7.0.

add_applicator(table, cols, function)[source]

Add an applicator. When reading table, rows in table will be modified by apply_rows().

Parameters:
  • table – The table to apply the function to.
  • cols – The columns in table to apply the function on.
  • function – The applicator function.
add_filter(table, cols, condition)[source]

Add a filter. When reading table, rows in table will be filtered by filter_rows().

Parameters:
  • table – The table the filter applies to.
  • cols – The columns in table to filter on.
  • condition – The filter function.
exists(table=None)[source]

Return True if the profile or a table exist.

If table is None, this function returns True if the root directory exists and contains a valid relations file. If table is given, the function returns True if the table exists as a file (even if empty). Otherwise it returns False.

join(table1, table2, key_filter=True)[source]

Yield rows from a table built by joining table1 and table2. The column names in the rows have the original table name prepended and separated by a colon. For example, joining tables ‘item’ and ‘parse’ will result in column names like ‘item:i-input’ and ‘parse:parse-id’.

read_raw_table(table)[source]

Yield rows in the [incr tsdb()] table. A row is a dictionary mapping column names to values. Data from a profile is decoded by decode_row(). No filters or applicators are used.

read_table(table, key_filter=True)[source]

Yield rows in the [incr tsdb()] table that pass any defined filters, and with values changed by any applicators. If no filters or applicators are defined, the result is the same as from ItsdbProfile.read_raw_table().

select(table, cols, mode='list', key_filter=True)[source]

Yield selected rows from table. This method just calls select_rows() on the rows read from table.

size(table=None)[source]

Return the size, in bytes, of the profile or table.

If table is None, this function returns the size of the whole profile (i.e. the sum of the table sizes). Otherwise, it returns the size of table.

Note: if the file is gzipped, it returns the compressed size.

write_profile(profile_directory, relations_filename=None, key_filter=True, append=False, gzip=None)[source]

Write all tables (as specified by the relations) to a profile.

Parameters:
  • profile_directory – The directory of the output profile
  • relations_filename – If given, read and use the relations at this path instead of the current profile’s relations
  • key_filter – If True, filter the rows by keys in the index
  • append – If True, append profile data to existing tables in the output profile directory
  • gzip – If True, compress tables using gzip. Table filenames will have .gz appended. If False, only write out text files. If None, use whatever the original file was.
write_table(table, rows, append=False, gzip=False)[source]

Encode and write out table to the profile directory.

Parameters:
  • table – The name of the table to write
  • rows – The rows to write to the table
  • append – If True, append the encoded rows to any existing data.
  • gzip – If True, compress the resulting table with gzip. The table’s filename will have .gz appended.
class delphin.itsdb.ItsdbSkeleton(path, relations=None, filters=None, applicators=None, index=True, cast=False, encoding='utf-8')[source]

A [incr tsdb()] skeleton, analyzed and ready for reading or writing.

See ItsdbProfile for initialization parameters.

Deprecated since version v0.7.0.

delphin.itsdb.get_relations(path)[source]

Parse the relations file and return a Relations object that describes the database structure.

Note: for backward-compatibility only; use Relations.from_file()

Parameters:path – The path of the relations file.
Returns:A dictionary mapping a table name to a list of Field tuples.

Deprecated since version v0.7.0.

delphin.itsdb.default_value(fieldname, datatype)[source]

Return the default value for a column.

If the column name (e.g. i-wf) is defined to have an idiosyncratic value, that value is returned. Otherwise the default value for the column’s datatype is returned.

Parameters:
  • fieldname – the column name (e.g. i-wf)
  • datatype – the datatype of the column (e.g. :integer)
Returns:

The default value for the column.

Deprecated since version v0.7.0.

delphin.itsdb.make_skeleton(path, relations, item_rows, gzip=False)[source]

Instantiate a new profile skeleton (only the relations file and item file) from an existing relations file and a list of rows for the item table. For standard relations files, it is suggested to have, as a minimum, the i-id and i-input fields in the item rows.

Parameters:
  • path – the destination directory of the skeleton—must not already exist, as it will be created
  • relations – the path to the relations file
  • item_rows – the rows to use for the item file
  • gzip – if True, the item file will be compressed
Returns:

An ItsdbProfile containing the skeleton data (but the profile data will already have been written to disk).

Raises:

delphin.exceptions.ItsdbError – if the destination directory could not be created.

Deprecated since version v0.7.0.

delphin.itsdb.filter_rows(filters, rows)[source]

Yield rows matching all applicable filters.

Filter functions have binary arity (e.g. filter(row, col)) where the first parameter is the dictionary of row data, and the second parameter is the data at one particular column.

Parameters:
  • filters – a tuple of (cols, filter_func) where filter_func will be tested (filter_func(row, col)) for each col in cols where col exists in the row
  • rows – an iterable of rows to filter
Yields:

Rows matching all applicable filters

Deprecated since version v0.7.0.

delphin.itsdb.apply_rows(applicators, rows)[source]

Yield rows after applying the applicator functions to them.

Applicators are simple unary functions that return a value, and that value is stored in the yielded row. E.g. row[col] = applicator(row[col]). These are useful to, e.g., cast strings to numeric datatypes, to convert formats stored in a cell, extract features for machine learning, and so on.

Parameters:
  • applicators – a tuple of (cols, applicator) where the applicator will be applied to each col in cols
  • rows – an iterable of rows for applicators to be called on
Yields:

Rows with specified column values replaced with the results of the applicators

Deprecated since version v0.7.0.