delphin.itsdb

See also

See Working with [incr tsdb()] Test Suites for a more user-friendly introduction

[incr tsdb()] Test Suites

Note

This module implements high-level structures and operations on top of TSDB test suites. For the basic, low-level functionality, see delphin.tsdb. For complex queries of the databases, see delphin.tsql.

[incr tsdb()] is a tool built on top of TSDB databases for the purpose of profiling and comparing grammar versions using test suites. This module is named after that tool as it also builds higher-level operations on top of TSDB test suites but it has a much narrower scope. The aim of this module is to assist users with creating, processing, or manipulating test suites.

The typical test suite contains these files:

testsuite/
  analysis  fold             item-set   parse       relations  run    tree
  decision  item             output     phenomenon  result     score  update
  edge      item-phenomenon  parameter  preference  rule       set

Test Suite Classes

PyDelphin has three classes for working with [incr tsdb()] test suite databases:

class delphin.itsdb.TestSuite(path=None, schema=None, encoding='utf-8')[source]

Bases: Database

A [incr tsdb()] test suite database.

Parameters:
  • path – the path to the test suite’s directory

  • schema (dict, str) – the database schema; either a mapping of table names to lists of Fields or a path to a relations file; if not given, the relations file under path will be used

  • encoding – the character encoding of the files in the test suite

schema

database schema as a mapping of table names to lists of Field objects

Type:

dict

encoding

character encoding used when reading and writing tables

Type:

str

commit()[source]

Commit the current changes to disk.

This method writes the current state of the test suite to disk. The effect is similar to using tsdb.write_database(), except that it also updates the test suite’s internal bookkeeping so that it is aware that the current transaction is complete. It also may be more efficient if the only changes are adding new rows to existing tables.

property in_transaction

Return True is there are uncommitted changes.

property path

The database directory’s path.

process(cpu, selector=None, source=None, fieldmapper=None, gzip=False, buffer_size=1000, callback=None)[source]

Process each item in a [incr tsdb()] test suite.

The output rows will be flushed to disk when the number of new rows in a table is buffer_size.

The callback parameter can be used, for example, to update a progress indicator.

Parameters:
  • cpu (Processor) – processor interface (e.g., ACEParser)

  • selector – a pair of (table_name, column_name) that specify the table and column used for processor input (e.g., ('item', 'i-input'))

  • source (Database) – test suite from which inputs are taken; if None, use the current test suite

  • fieldmapper (FieldMapper) – object for mapping response fields to [incr tsdb()] fields; if None, use a default mapper for the standard schema

  • gzip – if True, compress non-empty tables with gzip

  • buffer_size (int) – number of output rows to hold in memory before flushing to disk; ignored if the test suite is all in-memory; if None, do not flush to disk

  • callback – a function that is called with the response for each item processed; the return value is ignored

Examples

>>> ts.process(ace_parser)
>>> ts.process(ace_generator, 'result:mrs', source=ts2)
processed_items(fieldmapper=None)[source]

Iterate over the data as Response objects.

reload()[source]

Discard temporary changes and reload the database from disk.

select_from(name, columns=None, cast=True)[source]

Select fields given by names from each row in table name.

If no field names are given, all fields are returned.

If cast is False, simple tuples of raw data are returned instead of Row objects.

Yields:

Row

Examples

>>> next(ts.select_from('item'))
Row(10, 'unknown', 'formal', 'none', 1, 'S', 'It rained.', ...)
>>> next(ts.select_from('item', ('i-id')))
Row(10)
>>> next(ts.select_from('item', ('i-id', 'i-input')))
Row(10, 'It rained.')
>>> next(ts.select_from('item', ('i-id', 'i-input'), cast=False))
('10', 'It rained.')
class delphin.itsdb.Table(dir, name, fields, encoding='utf-8')[source]

Bases: Relation

A [incr tsdb()] table.

Parameters:
  • dir – path to the database directory

  • name – name of the table

  • fields – the table schema; an iterable of tsdb.Field objects

  • encoding – character encoding of the table file

dir

The path to the database directory.

name

The name of the table.

fields

The table’s schema.

encoding

The character encoding of table files.

append(row)[source]

Add row to the end of the table.

Parameters:

row – a Row or other iterable containing column values

clear()[source]

Clear the table of all rows.

close()[source]

Close the table file being iterated over, if open.

column_index(name)[source]

Return the tuple index of the column with name name.

extend(rows)[source]

Add each row in rows to the end of the table.

Parameters:

row – an iterable of Row or other iterables containing column values

get_field(name)[source]

Return the tsdb.Field object with column name name.

select(*names, cast=True)[source]

Select fields given by names from each row in the table.

If no field names are given, all fields are returned.

If cast is False, simple tuples of raw data are returned instead of Row objects.

Yields:

Row

Examples

>>> next(table.select())
Row(10, 'unknown', 'formal', 'none', 1, 'S', 'It rained.', ...)
>>> next(table.select('i-id'))
Row(10)
>>> next(table.select('i-id', 'i-input'))
Row(10, 'It rained.')
>>> next(table.select('i-id', 'i-input', cast=False))
('10', 'It rained.')
update(index, data)[source]

Update the row at index with data.

Parameters:
  • index – the 0-based index of the row in the table

  • data – a mapping of column names to values for replacement

Examples

>>> table.update(0, {'i-input': '...'})
class delphin.itsdb.Row(fields, data, field_index=None)[source]

A row in a [incr tsdb()] table.

The third argument, field_index, is optional. Its purpose is to reduce memory usage because the same field index can be shared by all rows for a table, but using an incompatible index can yield unexpected results for value retrieval by field names (row[field_name]).

Parameters:
  • fields – column descriptions; an iterable of tsdb.Field objects

  • data – raw column values

  • field_index – mapping of field name to its index in fields; if not given, it will be computed from fields

fields

The fields of the row.

data

The raw column values.

keys()[source]

Return the list of field names for the row.

Note this returns the names of all fields, not just those with the :key flag.

Processing Test Suites

The TestSuite.process() method takes an optional FieldMapper object which manages the mapping of data in Response objects from a Processor to the tables and columns of a test suite. In most cases the user will not need to customize or instantiate these objects as the default works with standard [incr tsdb()] schemas, but FieldMapper can be subclassed in order to handle non-standard schemas, e.g., for machine translation workflows.

class delphin.itsdb.FieldMapper(source=None)[source]

A class for mapping between response objects and test suites.

If source is given, it is the test suite providing the inputs used to create the responses, and it is used to provide some contextual information that may not be present in the response.

This class provides two methods for mapping responses to fields:

  • map() – takes a response and returns a list of (table, data) tuples for the data in the response, as well as aggregating any necessary information

  • cleanup() – returns any (table, data) tuples resulting from aggregated data over all runs, then clears this data

And one method for mapping test suites to responses:

  • collect() – yield Response objects by collecting the relevant data from the test suite

In addition, the affected_tables attribute should list the names of tables that become invalidated by using this FieldMapper to process a profile. Generally this is the list of tables that map() and cleanup() create rows for, but it may also include those that rely on the previous set (e.g., treebanking preferences, etc.).

Alternative [incr tsdb()] schemas can be handled by overriding these three methods and the __init__() method. Note that overriding collect() is only necessary for mapping back from test suites to responses.

affected_tables

list of tables that are affected by the processing

map(response)[source]

Process response and return a list of (table, rowdata) tuples.

cleanup()[source]

Return aggregated (table, rowdata) tuples and clear the state.

collect(ts)[source]

Map from test suites to response objects.

The data in the test suite must be ordered.

Note

This method stores the ‘item’, ‘parse’, and ‘result’ tables in memory during operation, so it is not recommended when a test suite is very large as it may exhaust the system’s available memory.

Utility Functions

delphin.itsdb.match_rows(rows1, rows2, key, sort_keys=True)[source]

Yield triples of (value, left_rows, right_rows) where left_rows and right_rows are lists of rows that share the same column value for key. This means that both rows1 and rows2 must have a column with the same name key.

Warning

Both rows1 and rows2 will exist in memory for this operation, so it is not recommended for very large tables on low-memory systems.

Parameters:
  • rows1 – a Table or list of Row objects

  • rows2 – a Table or list of Row objects

  • key (str, int) – the column name or index on which to match

  • sort_keys (bool) – if True, yield matching rows sorted by the matched key instead of the original order

Yields:

tuple

a triple containing the matched value for key, the

list of any matching rows from rows1, and the list of any matching rows from rows2

Exceptions

exception delphin.itsdb.ITSDBError(*args, **kwargs)[source]

Bases: TSDBError

Raised when there is an error processing a [incr tsdb()] profile.