andyr.jtokeniser
Class SentenceTokeniser

java.lang.Object
  extended by andyr.jtokeniser.Tokeniser
      extended by andyr.jtokeniser.SentenceTokeniser

public class SentenceTokeniser
extends Tokeniser

The SentenceTokeniser class uses a BreakIterator to find each word instance according to a specified locale. The

The following is one example of the use of the tokeniser. The code:

     SentenceTokeniser st = new SentenceTokeniser("Dr Pearson wasn't a physcian. She has a PhD. instead.");
     while (bit.hasMoreTokens()) {
         System.out.println(bit.nextToken());
     }
 

prints the following output:

     Dr Pearson wasn't a physcian.
     She has a PhD. instead.
 
Now, the tokeniser isn't perfect as there are cases (albeit relatively rare) that will confuse the algorithms used. For example, if we alter the above example and change "Dr" to "Dr." with the period to signify the abbreviation.
     SentenceTokeniser st = new SentenceTokeniser("Dr. Pearson wasn't a physcian. She has a PhD. instead.");
     while (bit.hasMoreTokens()) {
         System.out.println(bit.nextToken());
     }
 

prints the following output:

     Dr.
     Pearson wasn't a physcian.
     She has a PhD. instead.
 

Version:
1.2 (01-Aug-2005)
Author:
Andrew Roberts

Field Summary
 
Fields inherited from class andyr.jtokeniser.Tokeniser
currentTokenPosition, tokens
 
Constructor Summary
SentenceTokeniser(java.lang.String input)
          Creates a SentenceTokeniser that tokenises the input into sentences.
SentenceTokeniser(java.lang.String input, java.util.Locale locale)
          Creates a SentenceTokeniser that tokenises the input into sentences according to a given locale.
 
Method Summary
 
Methods inherited from class andyr.jtokeniser.Tokeniser
countTokens, getTokens, hasMoreTokens, nextToken, numberOfTokens
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SentenceTokeniser

public SentenceTokeniser(java.lang.String input,
                         java.util.Locale locale)
Creates a SentenceTokeniser that tokenises the input into sentences according to a given locale. The segmentation algorithm is dependent on the language being processed, so it is important to use the correct locale. The "tokens" in this instance are actually the sentences themselves. So, a call to nextToken() will actually return the next sentence, and not the next word.

Parameters:
input - a string from which the tokens will be extracted.
locale - the locale that the BreakIterator will use for finding sentence boundaries.

SentenceTokeniser

public SentenceTokeniser(java.lang.String input)
Creates a SentenceTokeniser that tokenises the input into sentences. The tokeniser will use the default locale as returned by Locale.getDefault()..

Parameters:
input - a string from which the tokens will be extracted.