Stopwords for Weka – FracPete's Projects

fracpete July 21, 2014

Stopwords support in Weka have always been a bit poor, to say the least. Initially, there was only a hard coded list, based on the Rainbow tool. However, simply having stopwords for the English language was a bit limited. Being able to supply your own list of stopwords in the StringToWordVector filter made the whole thing already a bit more flexible. But, you still couldn’t supply your own stopwords algorithm. Yesterday, I sat down and implemented a new class hierarchy centered around the weka.core.stopwords.StopwordsHandler interface. I added the following algorithms:

Null – never flags a word as stopword
Rainbow – previous hard coded list of stopwords
WordsFromList – loads words from a file
RegExpFromList – applies regular expressions loaded from a file
MultiStopwords – applies multiple stopwords algorithms sequentially

Eibe reworked the StringToWordVector filter today to make use of the new class hierarchy.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31