python - How to build a IMS open source corpus workbench and NLTK readable corpus? -
currently i've bunch of .txtfiles. within each .txt files, each sentence separated newline. how change ims cwb format it's readable cwb? , nltk format.
can lead me howto page that? or there guide page that, i've tried reading through manual dont know. www.cwb.sourceforge.net/files/cwb_encoding_tutorial.pdf
does mean create data , registry directory , run cwb-encode command , converted vrt file? convert 1 file @ time? how script run through multiple file in directory?
it's easy produce cwb's "verticalized" format nltk-readable corpus:
from nltk.corpus import brown out = open('corpus.vrt','w') sentence in nltk.brown.sents(): print >>out,'<s>' word in sentence: print >>out,word print >>out,'</s>' out.close()
from there, can follow instructions on cwb website.
Comments
Post a Comment