delphin.repp¶

Regular Expression Preprocessor (REPP)

A Regular-Expression Preprocessor [REPP] is a method of applying a system of regular expressions for transformation and tokenization while retaining character indices from the original input string.

[REPP]

Rebecca Dridan and Stephan Oepen. Tokenization: Returning to a long solved problem—a survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P12-2074.

Note

Requires regex (https://bitbucket.org/mrabarnett/mrab-regex/), for advanced regular expression features such as group-local inline flags. Without it, PyDelphin will fall back to the re module in the standard library which may give some unexpected results. The regex library, however, will not parse unescaped brackets in character classes without resorting to a compatibility mode (see this issue for the ERG), and PyDelphin will warn if this happens. The regex dependency is satisfied if you install PyDelphin with the [repp] extra (see Requirements, Installation, and Testing).

Module Constants¶

delphin.repp.DEFAULT_TOKENIZER = '[ \\t]+'¶: The tokenization pattern used if none is given in a REPP module.

Classes¶

class delphin.repp.REPP(operations: list[_REPPOperation] | None = None, name: str | None = None, modules: dict[str, REPP] | None = None, active: Iterable[str] | None = None)[source]¶

A Regular Expression Pre-Processor (REPP).

The normal way to create a new REPP is to read a .rpp file via the from_file() classmethod. For REPPs that are defined in code, there is the from_string() classmethod, which parses the same definitions but does not require file I/O. Both methods, as does the class’s __init__() method, allow for pre-loaded and named external modules to be provided, which allow for external group calls (also see from_file() or implicit module loading). By default, all external submodules are deactivated, but they can be activated by adding the module names to active or, later, via the activate() method.

A third classmethod, from_config(), reads a PET-style configuration file (e.g., repp.set) which may specify the available and active modules, and therefore does not take the modules and active parameters.

Parameters:

name (str, optional) – the name assigned to this module
modules (dict, optional) – a mapping from identifiers to REPP modules
active (iterable, optional) – an iterable of default module activations

activate(mod: str) → None[source]¶: Set external module mod to active.

apply(s: str, active: Iterable[str] | None = None) → REPPResult[source]¶

Apply the REPP’s rewrite rules to the input string s.

Parameters:

s (str) – the input string to process
active (optional) – a collection of external module names that may be applied if called

Returns:

a REPPResult object containing the processed: string and characterization maps

deactivate(mod: str) → None[source]¶: Set external module mod to inactive.

classmethod from_config(path: str | Path, directory=None)[source]¶

Instantiate a REPP from a PET-style .set configuration file.

The path parameter points to the configuration file. Submodules are loaded from directory. If directory is not given, it is the directory part of path.

Parameters:

path (str) – the path to the REPP configuration file
directory (str, optional) – the directory in which to search for submodules

classmethod from_file(path, directory=None, modules=None, active=None)[source]¶

Instantiate a REPP from a .rpp file.

The path parameter points to the top-level module. Submodules are loaded from directory. If directory is not given, it is the directory part of path.

A REPP module may utilize external submodules, which may be defined in two ways. The first method is to map a module name to an instantiated REPP instance in modules. The second method assumes that an external group call >abc corresponds to a file abc.rpp in directory and loads that file. The second method only happens if the name (e.g., abc) does not appear in modules. Only one module may define a tokenization pattern.

Parameters:

path (str) – the path to the base REPP file to load
directory (str, optional) – the directory in which to search for submodules
modules (dict, optional) – a mapping from identifiers to REPP modules
active (iterable, optional) – an iterable of default module activations

classmethod from_string(s, name=None, modules=None, active=None)[source]¶

Instantiate a REPP from a string.

Parameters:

name (str, optional) – the name of the REPP module
modules (dict, optional) – a mapping from identifiers to REPP modules
active (iterable, optional) – an iterable of default module activations

tokenize(s: str, pattern: str | None = None, active: Iterable[str] | None = None) → YYTokenLattice[source]¶

Rewrite and tokenize the input string s.

Parameters:

s (str) – the input string to process
pattern (str, optional) – the regular expression pattern on which to split tokens; defaults to [ ]+
active (optional) – a collection of external module names that may be applied if called

Returns:

a YYTokenLattice containing the tokens and their characterization information

tokenize_result(result: REPPResult, pattern: str = '[ \\t]+') → YYTokenLattice[source]¶

Tokenize the result of rule application.

Parameters:

result – a REPPResult object
pattern (str, optional) – the regular expression pattern on which to split tokens; defaults to [ ]+

Returns:

a YYTokenLattice containing the tokens and their characterization information

trace(s: str, active: Iterable[str] | None = None, verbose: bool = False) → Iterator[REPPStep | REPPResult][source]¶

Rewrite string s like apply(), but yield each rewrite step.

Parameters:

s (str) – the input string to process
active (optional) – a collection of external module names that may be applied if called
verbose (bool, optional) – if False, only output rules or groups that matched the input

Yields:

a REPPStep object for each intermediate rewrite: step, and finally a REPPResult object after the last rewrite

class delphin.repp.REPPResult(string, startmap, endmap)[source]¶

The final result of REPP application.

string¶

resulting string after all rules have applied

Type:: str

startmap¶

integer array of start offsets

Type:: array

endmap¶

integer array of end offsets

Type:: array

endmap: array¶: Alias for field number 2

startmap: array¶: Alias for field number 1

string: str¶: Alias for field number 0

class delphin.repp.REPPStep(input, output, operation, applied, startmap, endmap)[source]¶

A single rule application in REPP.

input¶

input string (prior to application)

Type:: str

output¶

output string (after application)

Type:: str

operation¶

operation performed

Type:: delphin.repp._REPPOperation

applied¶

True if the rule was applied

Type:: bool

startmap¶

integer array of start offsets

Type:: array

endmap¶

integer array of end offsets

Type:: array

mask¶

integer array of mask indicators

Type:: array

applied: bool¶: Alias for field number 3

endmap: array¶: Alias for field number 5

input: str¶: Alias for field number 0

mask: array¶: Alias for field number 6

operation: _REPPOperation¶: Alias for field number 2

output: str¶: Alias for field number 1

startmap: array¶: Alias for field number 4

Exceptions¶

exception delphin.repp.REPPError(*args, **kwargs)[source]¶

Bases: PyDelphinException

Raised when there is an error in tokenizing with REPP.

exception delphin.repp.REPPWarning(*args, **kwargs)[source]¶

Bases: PyDelphinWarning

Issued when REPP may not behave as expected.