Stopwords for Weka

Stopwords support in Weka have always been a bit poor, to say the least. Initially, there was only a hard coded list, based on the Rainbow tool. However, simply having stopwords for the English language was a bit limited. Being able to supply your own list of stopwords in the StringToWordVector filter made the whole thing already a bit more flexible. But, you still couldn’t supply your own stopwords algorithm. Yesterday, I sat down and implemented a new class hierarchy centered around the weka.core.stopwords.StopwordsHandler interface. I added the following algorithms:

  • Null – never flags a word as stopword
  • Rainbow – previous hard coded list of stopwords
  • WordsFromList – loads words from a file
  • RegExpFromList – applies regular expressions loaded from a file
  • MultiStopwords – applies multiple stopwords algorithms sequentially

Eibe reworked the StringToWordVector filter today to make use of the new class hierarchy.