stop words - "protected phrase" in Solr -
a customer of mine photo agency specialized in photojournalism (well, , gossip), many of customers' searches revolve around specific people.
we index 1.5m documents, full-text search on headline , caption; , full-text search without stemming on tags. have decent list of stop words, , provide list of protected words feel not stemmed correctly. using dismax search on headline, caption , tags, different boosts) working pretty nicely.
however, few people proving tricky right. instance, al gore. in italian "al" stop word, simple query `al gore' (without quotes) becomes:
+((disjunctionmaxquery((caption_text:gor | tags_text:gore^100.0 | headline_text:gor)))~1) ()
that return hits ex vp, of course "lesley gore" , "tipper gore"; , also, because of stemming, hits "gori" , more. leaving aside sorting second, clutter results, , i'd better.
wrapping search terms in quotes doesn't help, "al" gets stripped away anyway. marking "gore" protected word gets me halfway there, limiting number of false positives. tried playing synonymfilterfactory too, didn't far--i have synonymfilterfactory first filter, "al" gets removed anyway.
what think need way of tokenizing "al gore" single token. there allow me that, set of configurable "phrases"? there approach i'm overlooking? solr.commongramsfilterfactory perhaps?
some more background info: using solr 1.4.0. relevant portions of schema.xml
<!-- used headline , caption --> <fieldtype name="text" class="solr.textfield" omitnorms="false"> <analyzer type="index"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.it.txt"/> <filter class="solr.worddelimiterfilterfactory" generatewordparts="1" generatenumberparts="1" catenatewords="1" catenatenumbers="1" catenateall="0"/> <filter class="solr.lowercasefilterfactory"/> <filter class="solr.snowballporterfilterfactory" language="italian" protected="protwords.txt"/> <filter class="solr.removeduplicatestokenfilterfactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.synonymfilterfactory" synonyms="synonyms.txt" ignorecase="true" expand="true"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.it.txt"/> <filter class="solr.worddelimiterfilterfactory" generatewordparts="1" generatenumberparts="1" catenatewords="0" catenatenumbers="0" catenateall="0"/> <filter class="solr.lowercasefilterfactory"/> <filter class="solr.snowballporterfilterfactory" language="italian" protected="protwords.txt"/> <filter class="solr.removeduplicatestokenfilterfactory"/> </analyzer> </fieldtype> <fieldtype name="tagstext" class="solr.textfield" sortmissinglast="true" omitnorms="false"> <analyzer type="index"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.it.txt"/> <filter class="solr.worddelimiterfilterfactory" generatewordparts="1" generatenumberparts="1" catenatewords="1" catenatenumbers="1" catenateall="0"/> <filter class="solr.lowercasefilterfactory"/> <filter class="solr.removeduplicatestokenfilterfactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.synonymfilterfactory" synonyms="synonyms.txt" ignorecase="true" expand="true"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.it.txt"/> <filter class="solr.worddelimiterfilterfactory" generatewordparts="1" generatenumberparts="1" catenatewords="0" catenatenumbers="0" catenateall="0"/> <filter class="solr.lowercasefilterfactory"/> <filter class="solr.removeduplicatestokenfilterfactory"/> </analyzer> </fieldtype>
have looked commongramsfilterfactory will:
- combine multiple tokens single token
- usually used when searching phrase contains stop words
Comments
Post a Comment