public final class StandardTokenizerImpl extends Object
Tokens produced are of the following types:
Modifier and Type | Field and Description |
---|---|
static int |
EMOJI_TYPE
Emoji token type
|
static int |
HANGUL_TYPE
Hangul token type
|
static int |
HIRAGANA_TYPE
Hiragana token type
|
static int |
IDEOGRAPHIC_TYPE
Ideographic token type
|
static int |
KATAKANA_TYPE
Katakana token type
|
static int |
NUMERIC_TYPE
Numbers
|
static int |
SOUTH_EAST_ASIAN_TYPE
Chars in class \p{Line_Break = Complex_Context} are from South East Asian
scripts (Thai, Lao, Myanmar, Khmer, etc.).
|
static int |
WORD_TYPE
Alphanumeric sequences
|
static int |
YYEOF
This character denotes the end of file
|
static int |
YYINITIAL
lexical states
|
Constructor and Description |
---|
StandardTokenizerImpl(Reader in)
Creates a new scanner
|
Modifier and Type | Method and Description |
---|---|
int |
getNextToken()
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
|
void |
getText(CharTermAttribute t)
Fills CharTermAttribute with the current token text.
|
void |
setBufferSize(int numChars)
Sets the scanner buffer size in chars
|
void |
yybegin(int newState)
Enters a new lexical state
|
int |
yychar()
Character count processed so far
|
char |
yycharat(int pos)
Returns the character at position pos from the
matched text.
|
void |
yyclose()
Closes the input stream.
|
int |
yylength()
Returns the length of the matched text region.
|
void |
yypushback(int number)
Pushes the specified amount of characters back into the input stream.
|
void |
yyreset(Reader reader)
Resets the scanner to read from a new input stream.
|
int |
yystate()
Returns the current lexical state.
|
String |
yytext()
Returns the text matched by the current regular expression.
|
public static final int YYEOF
public static final int YYINITIAL
public static final int WORD_TYPE
public static final int NUMERIC_TYPE
public static final int SOUTH_EAST_ASIAN_TYPE
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
public static final int IDEOGRAPHIC_TYPE
public static final int HIRAGANA_TYPE
public static final int KATAKANA_TYPE
public static final int HANGUL_TYPE
public static final int EMOJI_TYPE
public StandardTokenizerImpl(Reader in)
in
- the java.io.Reader to read input from.public final int yychar()
public final void getText(CharTermAttribute t)
public final void setBufferSize(int numChars)
public final void yyclose() throws IOException
IOException
public final void yyreset(Reader reader)
reader
- the new input streampublic final int yystate()
public final void yybegin(int newState)
newState
- the new lexical statepublic final String yytext()
public final char yycharat(int pos)
pos
- the position of the character to fetch.
A value from 0 to yylength()-1.public final int yylength()
public void yypushback(int number)
number
- the number of characters to be read again.
This number must not be greater than yylength()!public int getNextToken() throws IOException
IOException
- if any I/O-Error occursCopyright © 2000-2021 Apache Software Foundation. All Rights Reserved.