Package net.loomchild.segment.srx
Class SrxTextIterator
java.lang.Object
net.loomchild.segment.AbstractTextIterator
net.loomchild.segment.srx.SrxTextIterator
- All Implemented Interfaces:
Iterator<String>
,TextIterator
Represents text iterator splitting text according to rules in SRX file.
The algorithm idea is as follows:
1. Rule matcher list is created based on SRX file and language. Each rule matcher is responsible for matching before break and after break regular expressions of one break rule. 2. Each rule matcher is matched to the text. If the rule was not found the rule matcher is removed from the list. 3. First rule matcher in terms of its break position in text is selected. 4. List of exception rules corresponding to break rule is retrieved. 5. If none of exception rules is matching in break position then the text is marked as split and new segment is created. In addition all rule matchers are moved so they start after the end of new segment (which is the same as break position of the matched rule). 6. All the rules that have break position behind last matched rule break position are moved until they pass it. 7. If segment was not found the whole process is repeated.In streaming version of this algorithm character buffer is searched. When the end of it is reached or break position is in the margin (break position > buffer size - margin) and there is more text, the buffer is moved in the text until it starts after last found segment. If this happens rule matchers are reinitialized and the text is searched again. Streaming version has a limitation that read buffer must be at least as long as any segment in the text. As this algorithm uses lookbehind extensively but Java does not permit infinite regular expressions in lookbehind, so some patterns are finitized. For example a* pattern will be changed to something like a{0,100}.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final String
Reader buffer size.static final int
Default size of read buffer when using streaming version of this class.static final int
Default margin size.static final int
Default max lookbehind construct length parameter.private SrxDocument
private int
private int
static final String
Margin size.static final String
Maximum length of a regular expression construct that occurs in lookbehind.private RuleManager
private List<RuleMatcher>
private String
private int
private TextManager
-
Constructor Summary
ConstructorsConstructorDescriptionSrxTextIterator
(SrxDocument document, String languageCode, Reader reader) Creates streaming text iterator with no additional parameters.SrxTextIterator
(SrxDocument document, String languageCode, Reader reader, Map<String, Object> parameterMap) Creates text iterator that obtains language rules from given document using given language code.SrxTextIterator
(SrxDocument document, String languageCode, CharSequence text) Creates text iterator with no additional parameters.SrxTextIterator
(SrxDocument document, String languageCode, CharSequence text, Map<String, Object> parameterMap) Creates text iterator that obtains language rules form given document using given language code. -
Method Summary
Modifier and TypeMethodDescriptionprivate void
Move matchers that start before previous segment end.private RuleMatcher
boolean
hasNext()
private void
init
(SrxDocument document, String languageCode, TextManager textManager, Map<String, Object> parameterMap) Initializes splitter.private void
Initializes matcher list according to rules from ruleManager and text from textManager.private boolean
isException
(RuleMatcher ruleMatcher) Returns true if there are no exception rules preventing given rule matcher from breaking the text.private void
Moves all matchers to the next position if their break position is smaller than last segment end position.next()
Finds the next segment in the text and returns it.Methods inherited from class net.loomchild.segment.AbstractTextIterator
remove, toString
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface java.util.Iterator
forEachRemaining
-
Field Details
-
MARGIN_PARAMETER
Margin size. Used in streaming splitter. If rule is matched but its position is in the margin (position > bufferLength - margin) then the matching is ignored, and more text is read and rule is matched again.- See Also:
-
BUFFER_LENGTH_PARAMETER
Reader buffer size. Segments cannot be longer than this value.- See Also:
-
MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER
Maximum length of a regular expression construct that occurs in lookbehind.- See Also:
-
DEFAULT_MARGIN
public static final int DEFAULT_MARGINDefault margin size.- See Also:
-
DEFAULT_BUFFER_LENGTH
public static final int DEFAULT_BUFFER_LENGTHDefault size of read buffer when using streaming version of this class. Any segment cannot be longer than buffer size.- See Also:
-
DEFAULT_MAX_LOOKBEHIND_CONSTRUCT_LENGTH
public static final int DEFAULT_MAX_LOOKBEHIND_CONSTRUCT_LENGTHDefault max lookbehind construct length parameter.- See Also:
-
document
-
segment
-
start
private int start -
end
private int end -
textManager
-
ruleManager
-
ruleMatcherList
-
margin
private int margin
-
-
Constructor Details
-
SrxTextIterator
public SrxTextIterator(SrxDocument document, String languageCode, CharSequence text, Map<String, Object> parameterMap) Creates text iterator that obtains language rules form given document using given language code. This constructor version is not streaming because it receives whole text as a string. Supported parameters:MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER
.- Parameters:
document
- SRX documentlanguageCode
- text language code of text used to retrieve the rulestext
-parameterMap
- additional segmentation parameters
-
SrxTextIterator
Creates text iterator with no additional parameters.- Parameters:
document
- SRX documentlanguageCode
- text language code of text used to retrieve the rulestext
-- See Also:
-
SrxTextIterator
public SrxTextIterator(SrxDocument document, String languageCode, Reader reader, Map<String, Object> parameterMap) Creates text iterator that obtains language rules from given document using given language code. This is streaming constructor - it reads text from reader using buffer with given size and margin. Single segment cannot be longer than buffer size. If rule is matched but its position is in the margin (position > bufferLength - margin) then the matching is ignored, and more text is read and rule is matched again. This is needed because incomplete rule can be located at the end of the buffer and never matched. Supported parameters:BUFFER_LENGTH_PARAMETER
,MARGIN_PARAMETER
,MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER
.- Parameters:
document
- SRX documentlanguageCode
- text language code of text used to retrieve the rulesreader
- reader from which read the textparameterMap
- additional segmentation parameters
-
SrxTextIterator
Creates streaming text iterator with no additional parameters.- Parameters:
document
- SRX documentlanguageCode
- text language code of text used to retrieve the rulesreader
- reader from which read the text- See Also:
-
-
Method Details
-
next
Finds the next segment in the text and returns it.- Returns:
- next segment or null if it doesn't exist
- Throws:
IllegalStateException
- if buffer is too small to hold the segmentIORuntimeException
- if IO error occurs when reading the text
-
hasNext
public boolean hasNext()- Returns:
- true if there are more segments
-
init
private void init(SrxDocument document, String languageCode, TextManager textManager, Map<String, Object> parameterMap) Initializes splitter.- Parameters:
document
- SRX documentlanguageCode
- text language codetextManager
- text manager containing the textparameterMap
- additional segmentation parameters
-
initMatchers
private void initMatchers()Initializes matcher list according to rules from ruleManager and text from textManager. -
moveMatchers
private void moveMatchers()Moves all matchers to the next position if their break position is smaller than last segment end position. -
cutMatchers
private void cutMatchers()Move matchers that start before previous segment end. -
getMinMatcher
- Returns:
- first matcher in the text or null if there are no matchers
-
isException
Returns true if there are no exception rules preventing given rule matcher from breaking the text.- Parameters:
ruleMatcher
- rule matcher- Returns:
- true if rule matcher breaks the text
-