Issue Details (XML | Word | Printable)

Key: CODEBASE-124
Type: Improvement Improvement
Status: Open Open
Priority: Major Major
Assignee: Matt Mitchell
Reporter: Naomi Dushay
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Blacklight Plugin

better suggestions for did-you-mean

Created: 18/May/09 01:51 PM   Updated: 08/Jan/10 02:15 PM
Component/s: None
Affects Version/s: 2.4
Fix Version/s: 2.5


 Description  « Hide
Suggestions provided start with shortest edit-distance. Would probably be good to get more than one suggestion, and pick the one with the highest frequency.

kvetsh --> kvetch.

Also experiment with n-gram vs. jarowinkler distance algorithms.

Need more test data (Stanford or elsewhere)

 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Naomi Dushay added a comment - 25/Aug/09 04:20 PM
Matt,

Well, that explains a lot. I thought your code looked like it would return mult suggestions and I couldn't figure out why I was only getting one. This is one of those times when I am frustrated with ruby's limited data structures.

Could we do something kludgy, like:

if ruby response has multiple spell check suggestions
 then
   make a second solr request just for the spell check (no rows, no facets), returned as xml
   use existing gems or very simple parsing code to get what we want from the xml spell check response

OR

modify the spellcheckcomponent's ruby mapping so it returns the suggestions as an array? This could be in a jar to throw in the solr/lib directory.

- Naomi


On Aug 25, 2009, at 10:58 AM, Mitchell, Matthew (mwm4n) wrote:

Hi Naomi,

Funny I was just working on rsolr-ext. There is one big problem... The spellcheck handler returns "bad" ruby, to be more precise, there are hashes that have duplicate keys ("suggestion") and when evaled in Ruby, only the first one gets used. This is a known issue and a fix is being worked on. I recently sent email to the list about this. Hopefully it'll be updated soon! Here is the issue:

https://issues.apache.org/jira/browse/SOLR-1071

We might have to wait and see what they come up with as a final response format. Kind of a bummer. The only other solution would be to throw an xml parser in rsolr to handle xml solr responses. But that'd be a lot of work and the performance would not be nearly as good as plain ruby.

Thoughts?

Matt

On 8/25/09 1:28 PM, "Naomi Dushay" <ndushay@stanford.edu> wrote:

Matt,

(I was going to enter all this Jira and assign the ticket to you, THEN email you, but projectblacklight.org is down right now.)

When I analyzed our spelling suggestions, I learned the following:

the suggestions are given in the order of edit distance, rather than frequency of occurrence. So if you get only a single suggestion, you get the closest edit distance, which may not be a very good suggestion.

I was planning on making the code improvements for this ... but the code now lives in the rsolr gem. Does it make more sense for you to do the improvements? Or should I change the rsolr-ext gem code in blacklight and let you port the changes back to the home of that gem?

My plans were:
1. get 5 suggestions in solr response (for each term)
2. take the 2 suggestions with the highest frequency out of the 5 provided for each search term.
If there is a tie, take the one with the closer edit distance.

for 2. I think the number of suggestions desired per term should be a parameter passed into the words() method, with a default of 1. This also implies that
a. the number of desired suggestions should be configurable in blacklight code (with a default of 1 or 2)
b. the request handler in solrconfig should have spellcheck.count set to 5 or something.
c. there is no need for collation to be turned on.

Here are some test cases:

1. single term should have multiple suggestions when params are set and there are mult suggestions
2. suggestions should be most popular first
3. suggestion ties should be broken by edit distance
4 if there are no suggestions, the code should behave well.
5. multi term queries should have suggestions for each term, when they exist.

For our 30 record demo index, here are some good examples (if you're not using mock data):

'histo',{'numFound'=>4,'startOffset'=>4,'endOffset'=>9,'origFreq'=>0,
'suggestion'=>{'frequency'=>1,'word'=>'hist'},
'suggestion'=>{'frequency'=>11,'word'=>'history'},
'suggestion'=>{'frequency'=>1,'word'=>'kista'},
'suggestion'=>{'frequency'=>1,'word'=>'isti'}},

'politica',{'numFound'=>3,'startOffset'=>0,'endOffset'=>8,'origFreq'=>0,
'suggestion'=>{'frequency'=>3,'word'=>'politics'},
'suggestion'=>{'frequency'=>3,'word'=>'political'},
'suggestion'=>{'frequency'=>1,'word'=>'policy'}},

'chin',{'numFound'=>5,'startOffset'=>0,'endOffset'=>4,'origFreq'=>0,
'suggestion'=>{'frequency'=>4,'word'=>'china'},
'suggestion'=>{'frequency'=>1,'word'=>'chen'},
'suggestion'=>{'frequency'=>2,'word'=>'chos'},
'suggestion'=>{'frequency'=>2,'word'=>'khan'},
'suggestion'=>{'frequency'=>1,'word'=>'lhan'}},


- Naomi


Bess Sadler added a comment - 11/Sep/09 01:05 PM
This is a problem w/ the way solr is returning suggestions, and they are going to fix it. Once it's fixed in solr it should be easy for us to fix it.

Matt Mitchell added a comment - 08/Jan/10 02:15 PM
Someone (Matt?) submit a bug-report/patch to the Solr guys?