Text mining with MEKA

MEKA didn’t have any proper support for dealing with textual data. So far, it was not possible to import text files, like Weka with its TextDirectoryLoader. Due to a recent post on the MEKA mailing list, I decided to spend my lunch break today in putting together similar functionality within MEKA. The result is the meka.core.converters.MultiLabelTextDirectoryLoader.

Example:

/text-dataset
   /class1
     /0
       3.txt
       5.txt
     /1
       1.txt
       2.txt
       4.txt
  /class2
     /0
       1.txt
       4.txt
     /1
       2.txt
       3.txt
       5.txt

Will generate something like this:

@relation 'example: -C 2'

@attribute @@class-class1@@ {0,1}
@attribute @@class-class2@@ {0,1}
@attribute file-ID string
@attribute text string

@data
1,0,1.txt,'file 1\n'
1,1,2.txt,'file 2\n'
0,1,3.txt,'file 3\n'
1,0,4.txt,'file 4\n'
0,1,5.txt,'file 5\n'

In order to make this user-friendly, I also added a Import text… menu item to the Explorer’s File menu.