delphin.repp¶
Regular Expression Preprocessor (REPP)
A Regular-Expression Preprocessor [REPP] is a method of applying a system of regular expressions for transformation and tokenization while retaining character indices from the original input string.
- REPP
Rebecca Dridan and Stephan Oepen. Tokenization: Returning to a long solved problem—a survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P12-2074.
Note
Requires regex
(https://bitbucket.org/mrabarnett/mrab-regex/),
for advanced regular expression features such as group-local inline
flags. Without it, PyDelphin will fall back to the re
module in the standard library which may give some unexpected
results. The regex
library, however, will not parse unescaped
brackets in character classes without resorting to a compatibility
mode (see this issue for the ERG), and PyDelphin will warn if
this happens. The regex
dependency is satisfied if you install
PyDelphin with the [repp]
extra (see Requirements, Installation, and Testing).
Module Constants¶
-
delphin.repp.
DEFAULT_TOKENIZER
= '[ \\t]+'¶ The tokenization pattern used if none is given in a REPP module.
Classes¶
-
class
delphin.repp.
REPP
(name=None, modules=None, active=None)[source]¶ A Regular Expression Pre-Processor (REPP).
The normal way to create a new REPP is to read a .rpp file via the
from_file()
classmethod. For REPPs that are defined in code, there is thefrom_string()
classmethod, which parses the same definitions but does not require file I/O. Both methods, as does the class’s__init__()
method, allow for pre-loaded and named external modules to be provided, which allow for external group calls (also seefrom_file()
or implicit module loading). By default, all external submodules are deactivated, but they can be activated by adding the module names to active or, later, via theactivate()
method.A third classmethod,
from_config()
, reads a PET-style configuration file (e.g.,repp.set
) which may specify the available and active modules, and therefore does not take the modules and active parameters.- Parameters
-
apply
(s, active=None)[source]¶ Apply the REPP’s rewrite rules to the input string s.
- Parameters
s (str) – the input string to process
active (optional) – a collection of external module names that may be applied if called
- Returns
- a
REPPResult
object containing the processed string and characterization maps
- a
-
classmethod
from_config
(path, directory=None)[source]¶ Instantiate a REPP from a PET-style
.set
configuration file.The path parameter points to the configuration file. Submodules are loaded from directory. If directory is not given, it is the directory part of path.
-
classmethod
from_file
(path, directory=None, modules=None, active=None)[source]¶ Instantiate a REPP from a
.rpp
file.The path parameter points to the top-level module. Submodules are loaded from directory. If directory is not given, it is the directory part of path.
A REPP module may utilize external submodules, which may be defined in two ways. The first method is to map a module name to an instantiated REPP instance in modules. The second method assumes that an external group call
>abc
corresponds to a fileabc.rpp
in directory and loads that file. The second method only happens if the name (e.g.,abc
) does not appear in modules. Only one module may define a tokenization pattern.
-
classmethod
from_string
(s, name=None, modules=None, active=None)[source]¶ Instantiate a REPP from a string.
-
tokenize
(s, pattern=None, active=None)[source]¶ Rewrite and tokenize the input string s.
- Parameters
- Returns
a
YYTokenLattice
containing the tokens and their characterization information
-
tokenize_result
(result, pattern='[ \\t]+')[source]¶ Tokenize the result of rule application.
- Parameters
result – a
REPPResult
objectpattern (str, optional) – the regular expression pattern on which to split tokens; defaults to
[ ]+
- Returns
a
YYTokenLattice
containing the tokens and their characterization information
-
trace
(s, active=None, verbose=False)[source]¶ Rewrite string s like
apply()
, but yield each rewrite step.- Parameters
- Yields
- a
REPPStep
object for each intermediate rewrite step, and finally a
REPPResult
object after the last rewrite
- a
-
class
delphin.repp.
REPPResult
(string, startmap, endmap)[source]¶ The final result of REPP application.
-
startmap
¶ integer array of start offsets
- Type
array
-
endmap
¶ integer array of end offsets
- Type
array
-
property
endmap
Alias for field number 2
-
property
startmap
Alias for field number 1
-
property
string
Alias for field number 0
-
-
class
delphin.repp.
REPPStep
(input, output, operation, applied, startmap, endmap)[source]¶ A single rule application in REPP.
-
operation
¶ operation performed
-
startmap
¶ integer array of start offsets
- Type
array
-
endmap
¶ integer array of end offsets
- Type
array
-
property
applied
Alias for field number 3
-
property
endmap
Alias for field number 5
-
property
input
Alias for field number 0
-
property
operation
Alias for field number 2
-
property
output
Alias for field number 1
-
property
startmap
Alias for field number 4
-
Exceptions¶
-
exception
delphin.repp.
REPPError
(*args, **kwargs)[source]¶ Bases:
delphin.exceptions.PyDelphinException
Raised when there is an error in tokenizing with REPP.
-
exception
delphin.repp.
REPPWarning
(*args, **kwargs)[source]¶ Bases:
delphin.exceptions.PyDelphinWarning
Issued when REPP may not behave as expected.