Mihirkumar Joshi February 2016

Solr wild card search with space in the middle

Folks,

We want to do a solr wild card search with space in the middle.

e.g If we search for "Please\ Help*" then it should retrieve all the document which is having "Please Help" followed by documents which is having "Please" and "Help" words.

We see if we search "Please\ Help*" then it is only return document which is having "Please Help" and not returning search for individual tokens like "Please" and "help".

Given below is the field defination which we are using for indexing and search

<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true"> 
  <analyzer type="index">         
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"/>
    <filter class="solr.LengthFilterFactory" min="2" max="100"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
 </analyzer>  
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
  </analyzer>  
</fieldType>

Answers


MatsLindh February 2016

When you're using a wildcard search, the analysis stage of the query is not invoked. This means that "Please help*" does not go through the Shingle Filter, etc., and therefor doesn't give any hits.

As mentioned in the comments to your question - use an EdgeNgramFilter in the indexing phase instead, and then just submit your query as "Please help". This will then retrieve all documents where the field starts with "Please help", as it will create several versions of the same token (such as "P", "Pl", "Ple", "Plea", "Pleas", "Please", "Please ", "Please H", etc.).

You'll have to adjust the sequence of the filters to match what you need.

You can also use a KeywordTokenizer to get the complete input indexed as a single token (with a LowercaseFilter if you want to), and then use that to match the one, single token against your wildcard search (as no other analysis will need to take place).

Post Status

Asked in February 2016
Viewed 2,706 times
Voted 6
Answered 1 times

Search




Leave an answer