delphin.repp
Regular Expression Preprocessor (REPP)
A Regular-Expression Preprocessor [REPP] is a method of applying a system of regular expressions for transformation and tokenization while retaining character indices from the original input string.
Rebecca Dridan and Stephan Oepen. Tokenization: Returning to a long solved problem—a survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P12-2074.
Note
Requires regex
(https://bitbucket.org/mrabarnett/mrab-regex/),
for advanced regular expression features such as group-local inline
flags. Without it, PyDelphin will fall back to the re
module in the standard library which may give some unexpected
results. The regex
library, however, will not parse unescaped
brackets in character classes without resorting to a compatibility
mode (see this issue for the ERG), and PyDelphin will warn if
this happens. The regex
dependency is satisfied if you install
PyDelphin with the [repp]
extra (see Requirements, Installation, and Testing).
Module Constants
- delphin.repp.DEFAULT_TOKENIZER = '[ \\t]+'
The tokenization pattern used if none is given in a REPP module.
Classes
- class delphin.repp.REPP(operations=None, name=None, modules=None, active=None)[source]
A Regular Expression Pre-Processor (REPP).
The normal way to create a new REPP is to read a .rpp file via the
from_file()
classmethod. For REPPs that are defined in code, there is thefrom_string()
classmethod, which parses the same definitions but does not require file I/O. Both methods, as does the class’s__init__()
method, allow for pre-loaded and named external modules to be provided, which allow for external group calls (also seefrom_file()
or implicit module loading). By default, all external submodules are deactivated, but they can be activated by adding the module names to active or, later, via theactivate()
method.A third classmethod,
from_config()
, reads a PET-style configuration file (e.g.,repp.set
) which may specify the available and active modules, and therefore does not take the modules and active parameters.- Parameters:
- apply(s, active=None)[source]
Apply the REPP’s rewrite rules to the input string s.
- Parameters:
s (str) – the input string to process
active (optional) – a collection of external module names that may be applied if called
- Returns:
- a
REPPResult
object containing the processed string and characterization maps
- a
- classmethod from_config(path, directory=None)[source]
Instantiate a REPP from a PET-style
.set
configuration file.The path parameter points to the configuration file. Submodules are loaded from directory. If directory is not given, it is the directory part of path.
- classmethod from_file(path, directory=None, modules=None, active=None)[source]
Instantiate a REPP from a
.rpp
file.The path parameter points to the top-level module. Submodules are loaded from directory. If directory is not given, it is the directory part of path.
A REPP module may utilize external submodules, which may be defined in two ways. The first method is to map a module name to an instantiated REPP instance in modules. The second method assumes that an external group call
>abc
corresponds to a fileabc.rpp
in directory and loads that file. The second method only happens if the name (e.g.,abc
) does not appear in modules. Only one module may define a tokenization pattern.
- classmethod from_string(s, name=None, modules=None, active=None)[source]
Instantiate a REPP from a string.
- tokenize(s, pattern=None, active=None)[source]
Rewrite and tokenize the input string s.
- Parameters:
- Returns:
a
YYTokenLattice
containing the tokens and their characterization information
- tokenize_result(result, pattern='[ \\t]+')[source]
Tokenize the result of rule application.
- Parameters:
result – a
REPPResult
objectpattern (str, optional) – the regular expression pattern on which to split tokens; defaults to
[ ]+
- Returns:
a
YYTokenLattice
containing the tokens and their characterization information
- trace(s, active=None, verbose=False)[source]
Rewrite string s like
apply()
, but yield each rewrite step.- Parameters:
- Yields:
- a
REPPStep
object for each intermediate rewrite step, and finally a
REPPResult
object after the last rewrite
- a
- class delphin.repp.REPPResult(string, startmap, endmap)[source]
The final result of REPP application.
- startmap
integer array of start offsets
- Type:
array
- endmap
integer array of end offsets
- Type:
array
- endmap
Alias for field number 2
- startmap
Alias for field number 1
- string
Alias for field number 0
- class delphin.repp.REPPStep(input, output, operation, applied, startmap, endmap)[source]
A single rule application in REPP.
- operation
operation performed
- Type:
delphin.repp._REPPOperation
- startmap
integer array of start offsets
- Type:
array
- endmap
integer array of end offsets
- Type:
array
- mask
integer array of mask indicators
- Type:
array
- applied
Alias for field number 3
- endmap
Alias for field number 5
- input
Alias for field number 0
- mask
Alias for field number 6
- operation
Alias for field number 2
- output
Alias for field number 1
- startmap
Alias for field number 4
Exceptions
- exception delphin.repp.REPPError(*args, **kwargs)[source]
Bases:
PyDelphinException
Raised when there is an error in tokenizing with REPP.
- exception delphin.repp.REPPWarning(*args, **kwargs)[source]
Bases:
PyDelphinWarning
Issued when REPP may not behave as expected.