python - How to build a IMS open source corpus workbench and NLTK readable corpus? -


currently i've bunch of .txtfiles. within each .txt files, each sentence separated newline. how change ims cwb format it's readable cwb? , nltk format.

can lead me howto page that? or there guide page that, i've tried reading through manual dont know. www.cwb.sourceforge.net/files/cwb_encoding_tutorial.pdf

does mean create data , registry directory , run cwb-encode command , converted vrt file? convert 1 file @ time? how script run through multiple file in directory?

it's easy produce cwb's "verticalized" format nltk-readable corpus:

from nltk.corpus import brown  out = open('corpus.vrt','w') sentence in nltk.brown.sents():      print >>out,'<s>'      word in sentence:           print >>out,word      print >>out,'</s>' out.close() 

from there, can follow instructions on cwb website.


Comments

Popular posts from this blog

apache - Add omitted ? to URLs -

redirect - bbPress Forum - rewrite to wwww.mysite prohibits login -

php - How can I stop spam on my custom forum/blog? -