Previously, in Part 5 of this series, I blogged about some difficulties in working with Solr. I am following up with some more lessons learned.
If you note Part 4 in this series, you’ll see that I speak about upgrading to Solr 1.4.1. Well, for me, that wasn’t enough. Allow me to explain in a story I like to call: 2 JARs and a WAR. 🙂
You’ll note that CF supports two request handlers – standard and dismax. In the CFSearch tag, you specify “type”. Typically, all you ever need is ‘standard’, which supports wildcards all the good stuff Solr comes with out of the box. But there is ONE BIG caveat, boosting is busted.
What is boosting? Lets say you are indexing three fields, author, summary, keywords. You definitely want matches against keywords to give a higher score- basically its weight to be greater than author for example. According to the Solr book (the only one on 1.4!), you need to specify what the weights are during initial indexing – in a feature called index time boosts. Solr seems to be sly about not telling you this outright:
“At index-time, you have the option to boost a particular document (entirely or just a field). This is internally stored as part of the norms number, which must be enabled for this to work. It’s uncommon to perform index-time boosting.
At query-time, we have described earlier how to boost a particular clause of a query higher or lower if needed. Later the powerful Disjunction-Max (dismax for short) query will be demonstrated, which can apply searches to multiple fields with different boosting levels automatically.”
Basically, for some odd reason, you cannot boost upon search. In fact, one of the features of dismax is “Searches across multiple fields with different boosts through Lucene’s DisjunctionMaxQuery.” What’s odd is that dismax is somewhat deprecated and everything seems to reference ‘standard.’
You would think at this point, no big deal right, I’ll just switch from standard to dismax and thats it. Wrong. What the docs also don’t clearly spell out is that with dismax, wildcards are NOT supported. That’s correct. NOT!
So you have some features in standard, which are critical for free-text searching, but also you need proper scoring and boosting, which is available in dismax. To make matters worse, if you search with “polymer*” as an example, the wildcard at the end in dismax throws a CF error:
And if you use the Solr Admin interface, it doesn’t throw an error, so its almost impossible to figure out that the wildcard doesn’t work in dismax, because Solr by itself doesn’t throw an error!
Needless to say, at this point, things looked pretty hopeless. I felt like Solr wasn’t going to make it. I had a simple use case, and Solr couldn’t handle it without some big compromises.
So I decided to take a leap. Heck, I had already upgraded from Solr 1.4 to 1.4.1, why not something more recent? I knew from my digging around that Solr 1.5 in currently in development, and so is Solr 3.0, and Solr 4.0! More on that later.
I decided to upgrade to Solr 1.5 Dev. Why? Well, in reading the change log, something caught my eye. There was a new request handler submitted by Lucid (more on that later as well!). It was called “edismax” and it had everything standard has, and everything dismax has put together along with a ton of improvement. Could this be? Could this work!?
So I decided to give it a try. Here are some steps.
- Stop Solr Service
- Build Solr 1.5 Dev from Maven (thanks to Joseph Lamoree for help there)
- Like in Step 4, copy the WAR over
- Stop CF Instance / Service.
- Since this is a major version upgrade, you also needed to copy over the Solr Core and SolrJ JARs to <instance>cfusion-earcfusion-warWEB-INFcfusionlib replacing apache-solr-core.jar and apache-solr-solrj.jar respectively.
Now start up CF and Solr. You are running the Solr 1.5 Dev WAR, and CF is using the newer JAR files to communicate to Solr 1.5.
Now, in your SolrConfig.xml file (see earlier notes), added a new requesthandler. I am pasting a bare bones version below.
<requestHandler name=”edismax” class=”solr.SearchHandler” >
<str name=”qf”>custom1^0.1 custom2^8.0 custom3^2.0 custom4^10.0 key^1.5 uid^1.5 title^8.0 contents^2.0</str>
<str name=”pf”>custom1^0.1 custom2^8.0 custom3^2.0 custom4^10.0 key^1.5 uid^1.5 title^8.0 contents^2.0</str>
<str name=”mm”>2<-1 5<-2 6<90%</str>
<!– example highlighter config, enable per-query with hl=true
<str name=”hl.fl”>summary title</str> –>
<!– omp = Only More Popular –>
<!– exr = Extended Results –>
<!– The number of suggestions to return –>
You’ll note a couple things. First, defType is set to edismax and it is also using a new class. qf and pf (you can read in the Solr docs what they mean) is where I am specifying which fields need to be boosted at search time, and VERY importantly, all spellcheck options have to be “false”. I learned this the hard way. You can still do spell checking and return back the suggestions from CFSearch, but they must be false for wildcards to work, at least as the defaults in SolrConfig.xml.
Next, inside CFSearch, set type to edismax, and this may be a side note, but I was still getting errors, so I added URLDecode(criteria) where I was passing criteria to CFSearch, and voila!
Now I was running Solr 1.5 Dev 64-bit, CF was communicating with no errors, I was getting all the right scores, I could do trailing and prefixed wildcards, and I also no longer needed to do NGraming, so my collection size shrank. Woo-hoo!
To be continued…