delphin.repp

Regular Expression Preprocessor (REPP)

A Regular-Expression Preprocessor [REPP] is a method of applying a system of regular expressions for transformation and tokenization while retaining character indices from the original input string.

[REPP]Rebecca Dridan and Stephan Oepen. Tokenization: Returning to a long solved problem—a survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P12-2074.
class delphin.repp.REPP(name=None, modules=None, active=None)[source]

A Regular Expression Pre-Processor (REPP).

The normal way to create a new REPP is to read a .rpp file via the from_file() classmethod. For REPPs that are defined in code, there is the from_string() classmethod, which parses the same definitions but does not require file I/O. Both methods, as does the class’s __init__() method, allow for pre-loaded and named external modules to be provided, which allow for external group calls (also see from_file() or implicit module loading). By default, all external submodules are deactivated, but they can be activated by adding the module names to active or, later, via the activate() method.

A third classmethod, from_config(), reads a PET-style configuration file (e.g., repp.set) which may specify the available and active modules, and therefore does not take the modules and active parameters.

Parameters:
  • name (str, optional) – the name assigned to this module
  • modules (dict, optional) – a mapping from identifiers to REPP modules
  • active (iterable, optional) – an iterable of default module activations
activate(mod)[source]

Set external module mod to active.

apply(s, active=None)[source]

Apply the REPP’s rewrite rules to the input string s.

Parameters:
  • s (str) – the input string to process
  • active (optional) – a collection of external module names that may be applied if called
Returns:

a REPPResult object containing the processed

string and characterization maps

deactivate(mod)[source]

Set external module mod to inactive.

classmethod from_config(path, directory=None)[source]

Instantiate a REPP from a PET-style .set configuration file.

The path parameter points to the configuration file. Submodules are loaded from directory. If directory is not given, it is the directory part of path.

Parameters:
  • path (str) – the path to the REPP configuration file
  • directory (str, optional) – the directory in which to search for submodules
classmethod from_file(path, directory=None, modules=None, active=None)[source]

Instantiate a REPP from a .rpp file.

The path parameter points to the top-level module. Submodules are loaded from directory. If directory is not given, it is the directory part of path.

A REPP module may utilize external submodules, which may be defined in two ways. The first method is to map a module name to an instantiated REPP instance in modules. The second method assumes that an external group call >abc corresponds to a file abc.rpp in directory and loads that file. The second method only happens if the name (e.g., abc) does not appear in modules. Only one module may define a tokenization pattern.

Parameters:
  • path (str) – the path to the base REPP file to load
  • directory (str, optional) – the directory in which to search for submodules
  • modules (dict, optional) – a mapping from identifiers to REPP modules
  • active (iterable, optional) – an iterable of default module activations
classmethod from_string(s, name=None, modules=None, active=None)[source]

Instantiate a REPP from a string.

Parameters:
  • name (str, optional) – the name of the REPP module
  • modules (dict, optional) – a mapping from identifiers to REPP modules
  • active (iterable, optional) – an iterable of default module activations
tokenize(s, pattern=None, active=None)[source]

Rewrite and tokenize the input string s.

Parameters:
  • s (str) – the input string to process
  • pattern (str, optional) – the regular expression pattern on which to split tokens; defaults to [   ]+
  • active (optional) – a collection of external module names that may be applied if called
Returns:

a YyTokenLattice containing the tokens and their characterization information

trace(s, active=None, verbose=False)[source]

Rewrite string s like apply(), but yield each rewrite step.

Parameters:
  • s (str) – the input string to process
  • active (optional) – a collection of external module names that may be applied if called
  • verbose (bool, optional) – if False, only output rules or groups that matched the input
Yields:
a REPPStep object for each intermediate rewrite

step, and finally a REPPResult object after the last rewrite

class delphin.repp.REPPResult[source]

The final result of REPP application.

string

resulting string after all rules have applied

Type:str
startmap

integer array of start offsets

Type:array
endmap

integer array of end offsets

Type:array
class delphin.repp.REPPStep[source]

A single rule application in REPP.

input

input string (prior to application)

Type:str
output

output string (after application)

Type:str
operation

operation performed

applied

True if the rule was applied

Type:bool
startmap

integer array of start offsets

Type:array
endmap

integer array of end offsets

Type:array