regex - php sentence boundaries detection -
i divide text sentences in php. i'm using regex, brings ~95% accuracy , improve using better approach. i've seen nlp tools in perl, java, , c didn't see fits php. know of such tool?
an enhanced regex solution
assuming care handling: mr.
, mrs.
etc. abbreviations, following single regex solution works pretty well:
<?php // test.php rev:20160820_1800 $split_sentences = '%(?#!php/i split_sentences rev:20160820_1800) # split sentences on whitespace between them. # see: http://stackoverflow.com/a/5844564/433790 (?<= # sentence split location preceded [.!?] # either end of sentence punct, | [.!?][\'"] # or end of sentence punct , quote. ) # end positive lookbehind. (?<! # don\'t split after these: mr\. # either "mr." | mrs\. # or "mrs." | ms\. # or "ms." | jr\. # or "jr." | dr\. # or "dr." | prof\. # or "prof." | sr\. # or "sr." | t\.v\.a\. # or "t.v.a." # or... (you idea). ) # end negative lookbehind. \s+ # split on whitespace between sentences, (?=\s) # (but not @ end of string). %xi'; // end $split_sentences. $text = 'this sentence one. sentence two! sentence thr'. 'ee? sentence "four". sentence "five"! sentence "'. 'six"? sentence "seven." sentence \'eight!\' dr. '. 'jones said: "mrs. smith have lovely daught'. 'er!" t.v.a. big project! '; // note ws @ end. $sentences = preg_split($split_sentences, $text, -1, preg_split_no_empty); ($i = 0; $i < count($sentences); ++$i) { printf("sentence[%d] = [%s]\n", $i + 1, $sentences[$i]); } ?>
note can add or take away abbreviations expression. given following test paragraph:
this sentence one. sentence two! sentence three? sentence "four". sentence "five"! sentence "six"? sentence "seven." sentence 'eight!' dr. jones said: "mrs. smith have lovely daughter!" t.v.a. big project!
here output script:
sentence[1] = [this sentence one.]
sentence[2] = [sentence two!]
sentence[3] = [sentence three?]
sentence[4] = [sentence "four".]
sentence[5] = [sentence "five"!]
sentence[6] = [sentence "six"?]
sentence[7] = [sentence "seven."]
sentence[8] = [sentence 'eight!']
sentence[9] = [dr. jones said: "mrs. smith have lovely daughter!"]
sentence[10] = [the t.v.a. big project!]
the essential regex solution
the author of question commented above solution "overlooks many options" , not generic enough. i'm not sure means, essence of above expression clean , simple can get. here is:
$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\s)/'; $sentences = preg_split($re, $text, -1, preg_split_no_empty);
note both solutions correctly identify sentences ending quotation mark after ending punctuation. if don't care matching sentences ending in quotation mark regex can simplified just: /(?<=[.!?])\s+(?=\s)/
.
edit: 20130820_1000 added t.v.a.
(another punctuated word ignored) regex , test string. (to answer papyref's comment question)
edit: 20130820_1800 tidied , renamed regex , added shebang. fixed regexes prevent splitting text on trailing whitespace.
Comments
Post a Comment