delphin.repp¶

Regular Expression Preprocessor (REPP)

A Regular-Expression Preprocessor [REPP] is a method of applying a system of regular expressions for transformation and tokenization while retaining character indices from the original input string.

[REPP]

Rebecca Dridan and Stephan Oepen. Tokenization: Returning to a long solved problem—a survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P12-2074.

class delphin.repp.REPP(name=None, modules=None, active=None)[source]¶

A Regular Expression Pre-Processor (REPP).

The normal way to create a new REPP is to read a .rpp file via the from_file() classmethod. For REPPs that are defined in code, there is the from_string() classmethod, which parses the same definitions but does not require file I/O. Both methods, as does the class’s __init__() method, allow for pre-loaded and named external modules to be provided, which allow for external group calls (also see from_file() or implicit module loading). By default, all external submodules are deactivated, but they can be activated by adding the module names to active or, later, via the activate() method.

A third classmethod, from_config(), reads a PET-style configuration file (e.g., repp.set) which may specify the available and active modules, and therefore does not take the modules and active parameters.

Parameters:	name (str, optional) – the name assigned to this module modules (dict, optional) – a mapping from identifiers to REPP modules active (iterable, optional) – an iterable of default module activations

activate(mod)[source]¶: Set external module mod to active.

apply(s, active=None)[source]¶

Apply the REPP’s rewrite rules to the input string s.

Parameters:

s (str) – the input string to process
active (optional) – a collection of external module names that may be applied if called

Returns:

a REPPResult object containing the processed: string and characterization maps

deactivate(mod)[source]¶: Set external module mod to inactive.

classmethod from_config(path, directory=None)[source]¶

Instantiate a REPP from a PET-style .set configuration file.

The path parameter points to the configuration file. Submodules are loaded from directory. If directory is not given, it is the directory part of path.

Parameters:	path (str) – the path to the REPP configuration file directory (str, optional) – the directory in which to search for submodules

classmethod from_file(path, directory=None, modules=None, active=None)[source]¶

Instantiate a REPP from a .rpp file.

The path parameter points to the top-level module. Submodules are loaded from directory. If directory is not given, it is the directory part of path.

A REPP module may utilize external submodules, which may be defined in two ways. The first method is to map a module name to an instantiated REPP instance in modules. The second method assumes that an external group call >abc corresponds to a file abc.rpp in directory and loads that file. The second method only happens if the name (e.g., abc) does not appear in modules. Only one module may define a tokenization pattern.

Parameters:	path (str) – the path to the base REPP file to load directory (str, optional) – the directory in which to search for submodules modules (dict, optional) – a mapping from identifiers to REPP modules active (iterable, optional) – an iterable of default module activations

classmethod from_string(s, name=None, modules=None, active=None)[source]¶

Instantiate a REPP from a string.

Parameters:	name (str, optional) – the name of the REPP module modules (dict, optional) – a mapping from identifiers to REPP modules active (iterable, optional) – an iterable of default module activations

tokenize(s, pattern=None, active=None)[source]¶

Rewrite and tokenize the input string s.

Parameters:	s (str) – the input string to process pattern (str, optional) – the regular expression pattern on which to split tokens; defaults to `[ ]+` active (optional) – a collection of external module names that may be applied if called
Returns:	a `YyTokenLattice` containing the tokens and their characterization information

trace(s, active=None, verbose=False)[source]¶

Rewrite string s like apply(), but yield each rewrite step.

Parameters:	s (str) – the input string to process active (optional) – a collection of external module names that may be applied if called verbose (bool, optional) – if `False`, only output rules or groups that matched the input
Yields:	a `REPPStep` object for each intermediate rewrite step, and finally a `REPPResult` object after the last rewrite

class delphin.repp.REPPResult[source]¶

The final result of REPP application.

string¶

resulting string after all rules have applied

Type:	str

startmap¶

integer array of start offsets

Type:	`array`

endmap¶

integer array of end offsets

Type:	`array`

class delphin.repp.REPPStep[source]¶

A single rule application in REPP.

input¶

input string (prior to application)

Type:	str

output¶

output string (after application)

Type:	str

operation¶: operation performed

applied¶

True if the rule was applied

Type:	bool

startmap¶

integer array of start offsets

Type:	`array`

endmap¶

integer array of end offsets

Type:	`array`