delphin.repp¶
Regular Expression Preprocessor (REPP)
A Regular-Expression Preprocessor [REPP] is a method of applying a system of regular expressions for transformation and tokenization while retaining character indices from the original input string.
Rebecca Dridan and Stephan Oepen. Tokenization: Returning to a long solved problem—a survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P12-2074.
Note
Requires regex (https://bitbucket.org/mrabarnett/mrab-regex/),
for advanced regular expression features such as group-local inline
flags. Without it, PyDelphin will fall back to the re
module in the standard library which may give some unexpected
results. The regex library, however, will not parse unescaped
brackets in character classes without resorting to a compatibility
mode (see this issue for the ERG), and PyDelphin will warn if
this happens. The regex dependency is satisfied if you install
PyDelphin with the [repp] extra (see Requirements, Installation, and Testing).
Module Constants¶
- delphin.repp.DEFAULT_TOKENIZER = '[ \\t]+'¶
The tokenization pattern used if none is given in a REPP module.
Classes¶
- class delphin.repp.REPP(operations: list[_REPPOperation] | None = None, name: str | None = None, modules: dict[str, REPP] | None = None, active: Iterable[str] | None = None)[source]¶
A Regular Expression Pre-Processor (REPP).
The normal way to create a new REPP is to read a .rpp file via the
from_file()classmethod. For REPPs that are defined in code, there is thefrom_string()classmethod, which parses the same definitions but does not require file I/O. Both methods, as does the class’s__init__()method, allow for pre-loaded and named external modules to be provided, which allow for external group calls (also seefrom_file()or implicit module loading). By default, all external submodules are deactivated, but they can be activated by adding the module names to active or, later, via theactivate()method.A third classmethod,
from_config(), reads a PET-style configuration file (e.g.,repp.set) which may specify the available and active modules, and therefore does not take the modules and active parameters.- Parameters:
- apply(s: str, active: Iterable[str] | None = None) REPPResult[source]¶
Apply the REPP’s rewrite rules to the input string s.
- Parameters:
s (str) – the input string to process
active (optional) – a collection of external module names that may be applied if called
- Returns:
- a
REPPResultobject containing the processed string and characterization maps
- a
- classmethod from_config(path: str | Path, directory=None)[source]¶
Instantiate a REPP from a PET-style
.setconfiguration file.The path parameter points to the configuration file. Submodules are loaded from directory. If directory is not given, it is the directory part of path.
- classmethod from_file(path, directory=None, modules=None, active=None)[source]¶
Instantiate a REPP from a
.rppfile.The path parameter points to the top-level module. Submodules are loaded from directory. If directory is not given, it is the directory part of path.
A REPP module may utilize external submodules, which may be defined in two ways. The first method is to map a module name to an instantiated REPP instance in modules. The second method assumes that an external group call
>abccorresponds to a fileabc.rppin directory and loads that file. The second method only happens if the name (e.g.,abc) does not appear in modules. Only one module may define a tokenization pattern.
- classmethod from_string(s, name=None, modules=None, active=None)[source]¶
Instantiate a REPP from a string.
- tokenize(s: str, pattern: str | None = None, active: Iterable[str] | None = None) YYTokenLattice[source]¶
Rewrite and tokenize the input string s.
- Parameters:
- Returns:
a
YYTokenLatticecontaining the tokens and their characterization information
- tokenize_result(result: REPPResult, pattern: str = '[ \\t]+') YYTokenLattice[source]¶
Tokenize the result of rule application.
- Parameters:
result – a
REPPResultobjectpattern (str, optional) – the regular expression pattern on which to split tokens; defaults to
[ ]+
- Returns:
a
YYTokenLatticecontaining the tokens and their characterization information
- trace(s: str, active: Iterable[str] | None = None, verbose: bool = False) Iterator[REPPStep | REPPResult][source]¶
Rewrite string s like
apply(), but yield each rewrite step.- Parameters:
- Yields:
- a
REPPStepobject for each intermediate rewrite step, and finally a
REPPResultobject after the last rewrite
- a
- class delphin.repp.REPPResult(string, startmap, endmap)[source]¶
The final result of REPP application.
- startmap¶
integer array of start offsets
- Type:
array
- endmap¶
integer array of end offsets
- Type:
array
- class delphin.repp.REPPStep(input, output, operation, applied, startmap, endmap)[source]¶
A single rule application in REPP.
- operation¶
operation performed
- Type:
delphin.repp._REPPOperation
- startmap¶
integer array of start offsets
- Type:
array
- endmap¶
integer array of end offsets
- Type:
array
- mask¶
integer array of mask indicators
- Type:
array
- operation: _REPPOperation¶
Alias for field number 2
Exceptions¶
- exception delphin.repp.REPPError(*args, **kwargs)[source]¶
Bases:
PyDelphinExceptionRaised when there is an error in tokenizing with REPP.
- exception delphin.repp.REPPWarning(*args, **kwargs)[source]¶
Bases:
PyDelphinWarningIssued when REPP may not behave as expected.