regex - php sentence boundaries detection -

June 15, 2012

i divide text sentences in php. i'm using regex, brings ~95% accuracy , improve using better approach. i've seen nlp tools in perl, java, , c didn't see fits php. know of such tool?

an enhanced regex solution

assuming care handling: mr. , mrs. etc. abbreviations, following single regex solution works pretty well:

<?php // test.php rev:20160820_1800 $split_sentences = '%(?#!php/i split_sentences rev:20160820_1800)     # split sentences on whitespace between them.     # see: http://stackoverflow.com/a/5844564/433790     (?<=          # sentence split location preceded       [.!?]       # either end of sentence punct,     | [.!?][\'"]  # or end of sentence punct , quote.     )             # end positive lookbehind.     (?<!          # don\'t split after these:       mr\.        # either "mr."     | mrs\.       # or "mrs."     | ms\.        # or "ms."     | jr\.        # or "jr."     | dr\.        # or "dr."     | prof\.      # or "prof."     | sr\.        # or "sr."     | t\.v\.a\.   # or "t.v.a."                  # or... (you idea).     )             # end negative lookbehind.     \s+           # split on whitespace between sentences,     (?=\s)        # (but not @ end of string).     %xi';  // end $split_sentences.  $text = 'this sentence one. sentence two! sentence thr'.         'ee? sentence "four". sentence "five"! sentence "'.         'six"? sentence "seven." sentence \'eight!\' dr. '.         'jones said: "mrs. smith have lovely daught'.         'er!" t.v.a. big project! '; // note ws @ end.  $sentences = preg_split($split_sentences, $text, -1, preg_split_no_empty); ($i = 0; $i < count($sentences); ++$i) {     printf("sentence[%d] = [%s]\n", $i + 1, $sentences[$i]); } ?>

note can add or take away abbreviations expression. given following test paragraph:

this sentence one. sentence two! sentence three? sentence "four". sentence "five"! sentence "six"? sentence "seven." sentence 'eight!' dr. jones said: "mrs. smith have lovely daughter!" t.v.a. big project!

here output script:

sentence[1] = [this sentence one.]
sentence[2] = [sentence two!]
sentence[3] = [sentence three?]
sentence[4] = [sentence "four".]
sentence[5] = [sentence "five"!]
sentence[6] = [sentence "six"?]
sentence[7] = [sentence "seven."]
sentence[8] = [sentence 'eight!']
sentence[9] = [dr. jones said: "mrs. smith have lovely daughter!"]
sentence[10] = [the t.v.a. big project!]

the essential regex solution

the author of question commented above solution "overlooks many options" , not generic enough. i'm not sure means, essence of above expression clean , simple can get. here is:

$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\s)/'; $sentences = preg_split($re, $text, -1, preg_split_no_empty);

note both solutions correctly identify sentences ending quotation mark after ending punctuation. if don't care matching sentences ending in quotation mark regex can simplified just: /(?<=[.!?])\s+(?=\s)/.

edit: 20130820_1000 added t.v.a. (another punctuated word ignored) regex , test string. (to answer papyref's comment question)

edit: 20130820_1800 tidied , renamed regex , added shebang. fixed regexes prevent splitting text on trailing whitespace.

Search This Blog

Assebmley