Package com.optimaize.langdetect.text
Class RemoveMinorityScriptsTextFilter
java.lang.Object
com.optimaize.langdetect.text.RemoveMinorityScriptsTextFilter
- All Implemented Interfaces:
TextFilter
Removes text written in scripts that are not the dominant script of the text.
TODO this does not do special handling for Japanese (3 scripts) and Korean (2 scripts), they should be
counted together and kept.
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate Map<Character.UnicodeScript,
Long> countByScript
(CharSequence text) filter
(CharSequence text) private long
findMost
(Map<Character.UnicodeScript, Long> counts) forThreshold
(double threshold) If a script has less than this fraction of content compared to the most used one, its text is removed.private void
increment
(Map<Character.UnicodeScript, Long> counter, Character.UnicodeScript unicodeScript) private String
remove
(CharSequence text, Set<Character.UnicodeScript> toRemove)
-
Field Details
-
threshold
private final double threshold
-
-
Constructor Details
-
RemoveMinorityScriptsTextFilter
private RemoveMinorityScriptsTextFilter(double threshold)
-
-
Method Details
-
forThreshold
If a script has less than this fraction of content compared to the most used one, its text is removed. Example: Latin 10%, Cyrillic 80%, Common 10% (punctuation n'stuff). Now 10 is put in relation to 80.- Parameters:
threshold
- 0-1, suggested value is 0.3. If smaller then removed, equal remains.
-
filter
- Specified by:
filter
in interfaceTextFilter
-
remove
-
findMost
-
countByScript
-
increment
private void increment(Map<Character.UnicodeScript, Long> counter, Character.UnicodeScript unicodeScript)
-