Stopwords support in Weka have always been a bit poor, to say the least. Initially, there was only a hard coded list, based on the Rainbow tool. However, simply having stopwords for the English language was a bit limited. Being able to supply your own list of stopwords in the StringToWordVector filter made the whole thing already a bit more flexible. But, you still couldn’t supply your own stopwords algorithm. Yesterday, I sat down and implemented a new class hierarchy centered around the weka.core.stopwords.StopwordsHandler interface. I added the following algorithms:
- Null – never flags a word as stopword
- Rainbow – previous hard coded list of stopwords
- WordsFromList – loads words from a file
- RegExpFromList – applies regular expressions loaded from a file
- MultiStopwords – applies multiple stopwords algorithms sequentially
Eibe reworked the StringToWordVector filter today to make use of the new class hierarchy.