andyr.jtokeniser
Class SentenceTokeniser
java.lang.Object
andyr.jtokeniser.Tokeniser
andyr.jtokeniser.SentenceTokeniser
public class SentenceTokeniser
- extends Tokeniser
The SentenceTokeniser
class uses a BreakIterator
to find each word
instance according to a specified locale. The
The following is one example of the use of the tokeniser. The code:
SentenceTokeniser st = new SentenceTokeniser("Dr Pearson wasn't a physcian. She has a PhD. instead.");
while (bit.hasMoreTokens()) {
System.out.println(bit.nextToken());
}
prints the following output:
Dr Pearson wasn't a physcian.
She has a PhD. instead.
Now, the tokeniser isn't perfect as there are cases (albeit relatively rare) that will
confuse the algorithms used. For example, if we alter the above example and change "Dr"
to "Dr." with the period to signify the abbreviation.
SentenceTokeniser st = new SentenceTokeniser("Dr. Pearson wasn't a physcian. She has a PhD. instead.");
while (bit.hasMoreTokens()) {
System.out.println(bit.nextToken());
}
prints the following output:
Dr.
Pearson wasn't a physcian.
She has a PhD. instead.
- Version:
- 1.2 (01-Aug-2005)
- Author:
- Andrew Roberts
Constructor Summary |
SentenceTokeniser(java.lang.String input)
Creates a SentenceTokeniser that tokenises the input into sentences. |
SentenceTokeniser(java.lang.String input,
java.util.Locale locale)
Creates a SentenceTokeniser that tokenises the input into
sentences according to a given locale. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SentenceTokeniser
public SentenceTokeniser(java.lang.String input,
java.util.Locale locale)
- Creates a
SentenceTokeniser
that tokenises the input into
sentences according to a given locale. The segmentation algorithm is
dependent on the language being processed, so it is important to use the
correct locale.
The "tokens" in this instance are actually the sentences themselves. So, a call
to nextToken()
will actually return the next sentence, and not the
next word.
- Parameters:
input
- a string from which the tokens will be extracted.locale
- the locale that the BreakIterator will use for finding sentence
boundaries.
SentenceTokeniser
public SentenceTokeniser(java.lang.String input)
- Creates a
SentenceTokeniser
that tokenises the input into sentences. The tokeniser will use the default locale
as returned by Locale.getDefault().
.
- Parameters:
input
- a string from which the tokens will be extracted.