
Clone this issue
|
|
If you were logged in you would be able to see more operations.
|
|
|
|
Suggestions provided start with shortest edit-distance. Would probably be good to get more than one suggestion, and pick the one with the highest frequency.
kvetsh --> kvetch.
Also experiment with n-gram vs. jarowinkler distance algorithms.
Need more test data (Stanford or elsewhere)
|
|
Description
|
Suggestions provided start with shortest edit-distance. Would probably be good to get more than one suggestion, and pick the one with the highest frequency.
kvetsh --> kvetch.
Also experiment with n-gram vs. jarowinkler distance algorithms.
Need more test data (Stanford or elsewhere) |
Show » |
|
Well, that explains a lot. I thought your code looked like it would return mult suggestions and I couldn't figure out why I was only getting one. This is one of those times when I am frustrated with ruby's limited data structures.
Could we do something kludgy, like:
if ruby response has multiple spell check suggestions
then
make a second solr request just for the spell check (no rows, no facets), returned as xml
use existing gems or very simple parsing code to get what we want from the xml spell check response
OR
modify the spellcheckcomponent's ruby mapping so it returns the suggestions as an array? This could be in a jar to throw in the solr/lib directory.
- Naomi
On Aug 25, 2009, at 10:58 AM, Mitchell, Matthew (mwm4n) wrote:
Hi Naomi,
Funny I was just working on rsolr-ext. There is one big problem... The spellcheck handler returns "bad" ruby, to be more precise, there are hashes that have duplicate keys ("suggestion") and when evaled in Ruby, only the first one gets used. This is a known issue and a fix is being worked on. I recently sent email to the list about this. Hopefully it'll be updated soon! Here is the issue:
https://issues.apache.org/jira/browse/SOLR-1071
We might have to wait and see what they come up with as a final response format. Kind of a bummer. The only other solution would be to throw an xml parser in rsolr to handle xml solr responses. But that'd be a lot of work and the performance would not be nearly as good as plain ruby.
Thoughts?
Matt
On 8/25/09 1:28 PM, "Naomi Dushay" <ndushay@stanford.edu> wrote:
Matt,
(I was going to enter all this Jira and assign the ticket to you, THEN email you, but projectblacklight.org is down right now.)
When I analyzed our spelling suggestions, I learned the following:
the suggestions are given in the order of edit distance, rather than frequency of occurrence. So if you get only a single suggestion, you get the closest edit distance, which may not be a very good suggestion.
I was planning on making the code improvements for this ... but the code now lives in the rsolr gem. Does it make more sense for you to do the improvements? Or should I change the rsolr-ext gem code in blacklight and let you port the changes back to the home of that gem?
My plans were:
1. get 5 suggestions in solr response (for each term)
2. take the 2 suggestions with the highest frequency out of the 5 provided for each search term.
If there is a tie, take the one with the closer edit distance.
for 2. I think the number of suggestions desired per term should be a parameter passed into the words() method, with a default of 1. This also implies that
a. the number of desired suggestions should be configurable in blacklight code (with a default of 1 or 2)
b. the request handler in solrconfig should have spellcheck.count set to 5 or something.
c. there is no need for collation to be turned on.
Here are some test cases:
1. single term should have multiple suggestions when params are set and there are mult suggestions
2. suggestions should be most popular first
3. suggestion ties should be broken by edit distance
4 if there are no suggestions, the code should behave well.
5. multi term queries should have suggestions for each term, when they exist.
For our 30 record demo index, here are some good examples (if you're not using mock data):
'histo',{'numFound'=>4,'startOffset'=>4,'endOffset'=>9,'origFreq'=>0,
'suggestion'=>{'frequency'=>1,'word'=>'hist'},
'suggestion'=>{'frequency'=>11,'word'=>'history'},
'suggestion'=>{'frequency'=>1,'word'=>'kista'},
'suggestion'=>{'frequency'=>1,'word'=>'isti'}},
'politica',{'numFound'=>3,'startOffset'=>0,'endOffset'=>8,'origFreq'=>0,
'suggestion'=>{'frequency'=>3,'word'=>'politics'},
'suggestion'=>{'frequency'=>3,'word'=>'political'},
'suggestion'=>{'frequency'=>1,'word'=>'policy'}},
'chin',{'numFound'=>5,'startOffset'=>0,'endOffset'=>4,'origFreq'=>0,
'suggestion'=>{'frequency'=>4,'word'=>'china'},
'suggestion'=>{'frequency'=>1,'word'=>'chen'},
'suggestion'=>{'frequency'=>2,'word'=>'chos'},
'suggestion'=>{'frequency'=>2,'word'=>'khan'},
'suggestion'=>{'frequency'=>1,'word'=>'lhan'}},
- Naomi