delphin.repp

Regular Expression Preprocessor (REPP)

A Regular-Expression Preprocessor [REPP] is a method of applying a system of regular expressions for transformation and tokenization while retaining character indices from the original input string.

[REPP]

Rebecca Dridan and Stephan Oepen. Tokenization: Returning to a long solved problem—a survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P12-2074.

Note

Requires regex (https://bitbucket.org/mrabarnett/mrab-regex/), for advanced regular expression features such as group-local inline flags. Without it, PyDelphin will fall back to the re module in the standard library which may give some unexpected results. The regex library, however, will not parse unescaped brackets in character classes without resorting to a compatibility mode (see this issue for the ERG), and PyDelphin will warn if this happens. The regex dependency is satisfied if you install PyDelphin with the [repp] extra (see Requirements, Installation, and Testing).

Module Constants

delphin.repp.DEFAULT_TOKENIZER = '[ \\t]+'

The tokenization pattern used if none is given in a REPP module.

Classes

class delphin.repp.REPP(operations=None, name=None, modules=None, active=None)[source]

A Regular Expression Pre-Processor (REPP).

The normal way to create a new REPP is to read a .rpp file via the from_file() classmethod. For REPPs that are defined in code, there is the from_string() classmethod, which parses the same definitions but does not require file I/O. Both methods, as does the class’s __init__() method, allow for pre-loaded and named external modules to be provided, which allow for external group calls (also see from_file() or implicit module loading). By default, all external submodules are deactivated, but they can be activated by adding the module names to active or, later, via the activate() method.

A third classmethod, from_config(), reads a PET-style configuration file (e.g., repp.set) which may specify the available and active modules, and therefore does not take the modules and active parameters.

Parameters:
  • name (str, optional) – the name assigned to this module

  • modules (dict, optional) – a mapping from identifiers to REPP modules

  • active (iterable, optional) – an iterable of default module activations

activate(mod)[source]

Set external module mod to active.

apply(s, active=None)[source]

Apply the REPP’s rewrite rules to the input string s.

Parameters:
  • s (str) – the input string to process

  • active (optional) – a collection of external module names that may be applied if called

Returns:

a REPPResult object containing the processed

string and characterization maps

deactivate(mod)[source]

Set external module mod to inactive.

classmethod from_config(path, directory=None)[source]

Instantiate a REPP from a PET-style .set configuration file.

The path parameter points to the configuration file. Submodules are loaded from directory. If directory is not given, it is the directory part of path.

Parameters:
  • path (str) – the path to the REPP configuration file

  • directory (str, optional) – the directory in which to search for submodules

classmethod from_file(path, directory=None, modules=None, active=None)[source]

Instantiate a REPP from a .rpp file.

The path parameter points to the top-level module. Submodules are loaded from directory. If directory is not given, it is the directory part of path.

A REPP module may utilize external submodules, which may be defined in two ways. The first method is to map a module name to an instantiated REPP instance in modules. The second method assumes that an external group call >abc corresponds to a file abc.rpp in directory and loads that file. The second method only happens if the name (e.g., abc) does not appear in modules. Only one module may define a tokenization pattern.

Parameters:
  • path (str) – the path to the base REPP file to load

  • directory (str, optional) – the directory in which to search for submodules

  • modules (dict, optional) – a mapping from identifiers to REPP modules

  • active (iterable, optional) – an iterable of default module activations

classmethod from_string(s, name=None, modules=None, active=None)[source]

Instantiate a REPP from a string.

Parameters:
  • name (str, optional) – the name of the REPP module

  • modules (dict, optional) – a mapping from identifiers to REPP modules

  • active (iterable, optional) – an iterable of default module activations

tokenize(s, pattern=None, active=None)[source]

Rewrite and tokenize the input string s.

Parameters:
  • s (str) – the input string to process

  • pattern (str, optional) – the regular expression pattern on which to split tokens; defaults to [ ]+

  • active (optional) – a collection of external module names that may be applied if called

Returns:

a YYTokenLattice containing the tokens and their characterization information

tokenize_result(result, pattern='[ \\t]+')[source]

Tokenize the result of rule application.

Parameters:
  • result – a REPPResult object

  • pattern (str, optional) – the regular expression pattern on which to split tokens; defaults to [ ]+

Returns:

a YYTokenLattice containing the tokens and their characterization information

trace(s, active=None, verbose=False)[source]

Rewrite string s like apply(), but yield each rewrite step.

Parameters:
  • s (str) – the input string to process

  • active (optional) – a collection of external module names that may be applied if called

  • verbose (bool, optional) – if False, only output rules or groups that matched the input

Yields:
a REPPStep object for each intermediate rewrite

step, and finally a REPPResult object after the last rewrite

class delphin.repp.REPPResult(string, startmap, endmap)[source]

The final result of REPP application.

string

resulting string after all rules have applied

Type:

str

startmap

integer array of start offsets

Type:

array

endmap

integer array of end offsets

Type:

array

endmap

Alias for field number 2

startmap

Alias for field number 1

string

Alias for field number 0

class delphin.repp.REPPStep(input, output, operation, applied, startmap, endmap)[source]

A single rule application in REPP.

input

input string (prior to application)

Type:

str

output

output string (after application)

Type:

str

operation

operation performed

Type:

delphin.repp._REPPOperation

applied

True if the rule was applied

Type:

bool

startmap

integer array of start offsets

Type:

array

endmap

integer array of end offsets

Type:

array

mask

integer array of mask indicators

Type:

array

applied

Alias for field number 3

endmap

Alias for field number 5

input

Alias for field number 0

mask

Alias for field number 6

operation

Alias for field number 2

output

Alias for field number 1

startmap

Alias for field number 4

Exceptions

exception delphin.repp.REPPError(*args, **kwargs)[source]

Bases: PyDelphinException

Raised when there is an error in tokenizing with REPP.

exception delphin.repp.REPPWarning(*args, **kwargs)[source]

Bases: PyDelphinWarning

Issued when REPP may not behave as expected.