Web Software Architecture and Engineering – Life on the Bleeding Edge

Archive for September, 2010

Lessons Learned: Moving from Verity to Solr (Part 6)

Previously, in Part 5 of this series, I blogged about some difficulties in working with Solr. I am following up with some more lessons learned.
If you note Part 4 in this series, you’ll see that I speak about upgrading to Solr 1.4.1. Well, for me, that wasn’t enough. Allow me to explain in a story I like to call: 2 JARs and a WAR. :)
You’ll note that CF supports two request handlers – standard and dismax. In the CFSearch tag, you specify “type”. Typically, all you ever need is ‘standard’, which supports wildcards all the good stuff Solr comes with out of the box. But there is ONE BIG caveat, boosting is busted.
What is boosting? Lets say you are indexing three fields, author, summary, keywords. You definitely want matches against keywords to give a higher score- basically its weight to be greater than author for example. According to the Solr book (the only one on 1.4!), you need to specify what the weights are during initial indexing – in a feature called index time boosts. Solr seems to be sly about not telling you this outright:
“At index-time, you have the option to boost a particular document (entirely or just a field). This is internally stored as part of the norms number, which must be enabled for this to work. It’s uncommon to perform index-time boosting.
At query-time, we have described earlier how to boost a particular clause of a query higher or lower if needed. Later the powerful Disjunction-Max (dismax for short) query will be demonstrated, which can apply searches to multiple fields with different boosting levels automatically.”
Basically, for some odd reason, you cannot boost upon search. In fact, one of the features of dismax is “Searches across multiple fields with different boosts through Lucene’s DisjunctionMaxQuery.” What’s odd is that dismax is somewhat deprecated and everything seems to reference ‘standard.’
You would think at this point, no big deal right, I’ll just switch from standard to dismax and thats it. Wrong. What the docs also don’t clearly spell out is that with dismax, wildcards are NOT supported. That’s correct. NOT!
So you have some features in standard, which are critical for free-text searching, but also you need proper scoring and boosting, which is available in dismax. To make matters worse, if you search with “polymer*” as an example, the wildcard at the end in dismax throws a CF error:
“String_index_out_of_range_1__javalangStringIndexOutOfBoundsException_String_index_out_of_range_1___at_javalangAbstractStringBuilderreplaceAbstractStringBuilderjava797___at_javalangStringBuilderreplaceStringBuilderjava271___at_orgapachesolrhandlercomponentSpellCheckComponenttoNamedListSpellCheckComponentjava458___at_orgapachesolrhandlercomponentSpellCheckComponentprocessSpellCheckComponentjava142___at_orgapachesolrhandlercomponentSearchHandlerhandleRequestBodySearchHandlerjava195___at_orgapachesolrhandlerRequestHandlerBasehandleRequestRequestHandlerBasejava131___at_orgapachesolrcoreSolrCoreexecuteSolrCorejava1321___at_orgapachesolrservletSolrDispatchFilterexecuteSolrDispatchFilterjava341___at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava244___at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089___at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365___at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216___”
And if you use the Solr Admin interface, it doesn’t throw an error, so its almost impossible to figure out that the wildcard doesn’t work in dismax, because Solr by itself doesn’t throw an error!
Needless to say, at this point, things looked pretty hopeless. I felt like Solr wasn’t going to make it. I had a simple use case, and Solr couldn’t handle it without some big compromises.
So I decided to take a leap. Heck, I had already upgraded from Solr 1.4 to 1.4.1, why not something more recent? I knew from my digging around that Solr 1.5 in currently in development, and so is Solr 3.0, and Solr 4.0! More on that later.
I decided to upgrade to Solr 1.5 Dev. Why? Well, in reading the change log, something caught my eye. There was a new request handler submitted by Lucid (more on that later as well!). It was called “edismax” and it had everything standard has, and everything dismax has put together along with a ton of improvement. Could this be? Could this work!?
So I decided to give it a try. Here are some steps.

  1. Stop Solr Service
  2. Build Solr 1.5 Dev from Maven (thanks to Joseph Lamoree for help there)
  3. Like in Step 4, copy the WAR over
  4. Stop CF Instance / Service.
  5. Since this is a major version upgrade, you also needed to copy over the Solr Core and SolrJ JARs to <instance>cfusion-earcfusion-warWEB-INFcfusionlib replacing apache-solr-core.jar and apache-solr-solrj.jar respectively.

Now start up CF and Solr. You are running the Solr 1.5 Dev WAR, and CF is using the newer JAR files to communicate to Solr 1.5.
Now, in your SolrConfig.xml file (see earlier notes), added a new requesthandler. I am pasting a bare bones version below.
  <requestHandler name=”edismax” class=”solr.SearchHandler” >
    <lst name=”defaults”>
        <str name=”defType”>edismax</str>
        <str name=”echoParams”>explicit</str>
        <float name=”tie”>0.01</float>
        <str name=”qf”>custom1^0.1 custom2^8.0 custom3^2.0 custom4^10.0 key^1.5 uid^1.5 title^8.0 contents^2.0</str>
        <str name=”pf”>custom1^0.1 custom2^8.0 custom3^2.0 custom4^10.0 key^1.5 uid^1.5 title^8.0 contents^2.0</str>
        <str name=”fl”>*,score</str>
        <str name=”mm”>2&lt;-1 5&lt;-2 6&lt;90%</str>
        <int name=”ps”>100</int>
        <str name=”q.alt”>*:*</str>
        <!– example highlighter config, enable per-query with hl=true 
        <str name=”hl.fl”>summary title</str> –>  
        <str name=”spellcheck”>false</str>
        <!– omp = Only More Popular –>
        <str name=”spellcheck.onlyMorePopular”>false</str>
        <!– exr = Extended Results –>
        <str name=”spellcheck.extendedResults”>false</str>
        <!–  The number of suggestions to return –>
        <str name=”spellcheck.count”>5</str>
    </lst>
    <arr name=”last-components”>
        <str>spellcheck</str>
    </arr>
  </requestHandler>
You’ll note a couple things. First, defType is set to edismax and it is also using a new class. qf and pf (you can read in the Solr docs what they mean) is where I am specifying which fields need to be boosted at search time, and VERY importantly, all spellcheck options have to be “false”. I learned this the hard way. You can still do spell checking and return back the suggestions from CFSearch, but they must be false for wildcards to work, at least as the defaults in SolrConfig.xml.
Next, inside CFSearch, set type to edismax, and this may be a side note, but I was still getting errors, so I added URLDecode(criteria) where I was passing criteria to CFSearch, and voila!
Now I was running Solr 1.5 Dev 64-bit, CF was communicating with no errors, I was getting all the right scores, I could do trailing and prefixed wildcards, and I also no longer needed to do NGraming, so my collection size shrank. Woo-hoo!
To be continued…

Lessons Learned: Moving from Verity to Solr (Part 5)

Previously, in Part 4 of this series, I blogged about some difficulties in working with Solr. I am following up with some more lessons learned.
In Part 3 of this series, I spoke about my frustration with wild cards. Solr, out of the box, will support wild cards like ? or * anywhere in the criteria, except as the first character. In Part 3, I detailed a known work-around.
Well, to my amazement, in pouring over Solr docs as I have been doing, I found with Solr 1.4 there is a way to do this natively! Woo-hoo!
In my schema.xml, for your fieldType (in my case fieldType name=”text”…, under analyzer type=”index”), you can add a new filter called: ReversedWildcardFilterFactory. More details @ http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ReversedWildcardFilterFactory.
So I added: <filter class=”solr.ReversedWildcardFilterFactory” withOriginal=”true” maxPosAsterisk=”3″ maxPosQuestion=”2″ maxFractionAsterisk=”0.33″/>
Basically what this does is it takes the fields of type Text when indexing, and stores a reverse of them in the index. You can find what the additional parameters mean online, but basically setting withOriginal = true says to keep the original when indexing.
So let’s give an example. Let’s say you are indexing the word “bookshelf”. Before this setting, you could search for “book*” and this record would show just fine. But you could not search for “*shelf”. Now when you index, Solr will store “bookshelf” and “flehskoob”. When you search for “*shelf”, it takes note of the leading *, reverses it to “flehs*”, which would match “flehskoob”, and bingo, you have a match!
Note: Since you are storing additional data, obviously this will cause your collection size to grow, but its worth it!

Lessons Learned: Moving from Verity to Solr (Part 4)

Previously, in Part 3 of this series, I blogged about some difficulties in working with Solr. I am following up with some more lessons learned.
This is a big one. Upgrading to the latest Solr. The Solr that comes with Adobe ColdFusion 9.0.1 is a slightly customized (from what I can tell) version of a Solr pre-1.4.0 release. It’s almost a year old!
I was having some trouble with some of my custom enhancements, so i decided to upgrade to Solr 1.4.1.
Before I dive into details, I have to give a shout to Vinu Kumar and Kunal Saini (Adobe Engineers) who confirmed some details and pointed me in the right direction.

  1. Shut down the Solr Search Service (obvious)
  2. You’ll notice in your $coldfusionSolrInstalldirwebapps dir, there is a WAR file called solr.war. Back this file up somewhere outside this directory.
  3. Download latest Solr zip. Go to apache-solr-1.4.1dist. Notice that apache-solr-1.4.1.war file? Copy to the above directory and rename to solr.war.
  4. Delete all files in the $coldfusionSolrInstalldirwork directory. I believe this is where it expands the WAR’s files.
  5. Start the Solr Search service.

That’s it! You should see new files in the $coldfusionSolrInstalldirwork directory now. But there has to be a catch, right?
Yes there is, but its a small one. Creating new Solr collections via CFAdmin will fail. Why? Because its looking for those tweaks. Is there a workaround? Yes!
And I believe the following work around is reasonable, as most people will not be creating collections all the time. Usually they are created, and its set.
So, all you must do is the following:

  1. Think of a new collection name. Easy. :)
  2. When you create a Solr collection (or a Verity one), CFAdmin asks for a “path”. Note that path.
  3. Copy $coldfusionSolrInstalldirmulticore emplateconf*.* to <path><new collection name>conf*.*
  4. Now go to CFAdmin, point it to path=<path> with name=<new collection name>, and voila!

Essentially what happens here, is that once it notices a conf directory exists (which has the CF customization) it no longer freaks out.
That’s it. I’m running Solr 1.4.1 on Windows 2008 R2 64-bit! This is so awesome since I didn’t need to make any other changes, such as JVM settings, and other tweaks, they all carried over!

WTH?! Bloglines to Shutdown!

Ugh. Bloglines, which I use for reading 1000+ blogs, is shutting down in 3 weeks. They say you can export the OPML file, but what about all my saved/marked content? This sucks.

Lessons Learned: Moving from Verity to Solr (Part 3)

Previously, in Part 2 of this series, I blogged about some difficulties in working with Solr. I am following up with some more lessons learned.
This one deals with wildcards. If you look on page 359 of the 2nd WACK book, it states: “A search for ?ar?et would find both Carpet and Target, but not Learjet.”
Thanks for Ray Camden for confirming this would actually NOT work. I believe he said it would go in the Errata. Why?
Well, starting with a wildcard value, either * (star) or ? (question mark) will fail with Solr. You will get this nice error: “Error executing query : orgapachelucenequeryParserParseException_Cannot_parse_XXX__or__not_allowed_as_first_character_in_WildcardQuery”.
So although it would nice to search for “*ing”, it would be impractical according to the Solr folks. Is there a way around this? Well, theoretically yes.
Lets say in column1 you wanted to search for ?ar?et just like in the example. Do the following:

  1. When you build your SQL query, add a column and do a REVERSE. For example: SELECT column1, column2, REVERSE(column1) AS reverseColumn1..
  2. Index the results.
  3. Then when searching, do a reverse of the term if it starts with a wildcard. In this example: “te?ra?”

 

Lessons Learned: Moving from Verity to Solr (Part 2)

Previously, in Part 1 of this series, I blogged about some difficulties in working with Solr. I am following up with some more lessons learned.

  • In order to index with more than one category, Adobe suggested that instead of category=”column1,column2″, which places the literal value “column1,column2″ in the category instead of getting the respective values, I try: category=”#queryName.column1#,#queryName.column2#”. When I did this, it transformed the values all right, but of the first record only. So all records in the index has the same value of the first record. My hack?
    1. Run the query as usual.
    2. Do a query of query. Do: SELECT *, column1 + ‘,’ + column2 AS indexCategory FROM queryName.
    3. Instead of category=”column1,column2″, use category=”indexCategory”. This will put the appropriate comma-delimited category in place.
    4. Basically since cfindex will work correctly with a single category, I cfquery to create a special column for this of containing the values I wanted.
  • Escape Special Characters. One of the categories I was using was a lookup of state codes. So it has values like CA, VA, NY, and OR. Notice something odd? That’s right, Solr didn’t like ‘OR’, since it is a reserved operator word. In fact, you can go here to see a list of reserved words and a notice on escaping characters for Lucene and Solr @  http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Escaping%20Special%20Characters. NOTE: This escaping was needed only when searching using CFSearch, not when indexing. Maybe someone should make a UDF for this, but my simple fix was: replace(VALUE,’OR’,’OR’). Note the .
  • More to come…

Lessons Learned: Moving from Verity to Solr (Part 1)

More than anything, these series of posts are some notes and tidbits I’ve learned as we move our large Verity collection over to Solr. These notes apply to CF 9.0.1.

  1. Rule #1 – Adobe Docs suck. Be creative in your searches. I found answers for question in the following places:
    • Blogs: including but not limited to Ray Camden’s blog and various Adobe Engineer blogs.
    • Adobe Press Releases, Release Notes, and Change Logs – Whether a feature is enabled has been hidden away.
    • Apache Solr’s docs
    • Google – Sorry, I mean Scroogle.org.
  2. Rule #2 -Tune your Solr Install. Just like your CF instance, modify the solr.lax file under the solr-install root directory. Look for two lines.
    • lax.nl.java.option.additional – this line contains the JVM args. We upped memory to 1024 from 256.
    • lax.nl.current.vm – we pointed this to the latest in.javaw.exe file under a 64-bit jdk. 64-bit Solr? You bet!
  3. Rule #3 – Increase Buffer Size – In CF Administrator, go to Solr Server -> Show Advanced Settings. Change Solr Buffer Limit from 40 to 80. For the why on this, use Scroogle.
  4. Rule #4 – Default Operator – When we used Verity, searching for ‘fire water’ would in effect search for ‘fire and water’. With Solr, ‘fire water’ searches for ‘fire OR water’. If you need to change the default operator between words in keyword searches, don’t despair. Go to where your solr data is located (the root directory of it), and go to confschema.xml. Around line 528 you should see: <solrQueryParser defaultOperator=”OR”/>, change to <solrQueryParser defaultOperator=”AND”/> if need be.
  5. Rule #5 – Support for Categories Seems Broken (as of 9/7/2010) – I am seeking more data on this. Let’s take an example, if you index a query with columns: keyA, columnB, columnC. In your cfindex, you set category = “columnB”… it works ok. But if you set category=”columnB,columnC” – it takes the literal value inside of quotes without transforming it and sets it as the category!
  6. Rule #6 – Support for Categories Sucks – Whoa again? Yes. This time when searching. Let’s say you did index with columnB above which can have two values: valueA, and valueB. And in your cfsearch, per docs, category takes a comma-delimited list of categories. Wrong! After much trial and error, I figured out that for valueA, you can use category=”valueA”, but for multiple categories, you have to use not commas, but search operators. So for either category, use “valueA OR valueB”. If you want both, use AND.
  7. Rule #7 – Operators are CASE-SENSITIVE! Be Warned. So we used to allow users to enter for keywords: ‘fire and water’. Now they must use ‘fire AND water’. The lower case AND does not count! I had to build a custom UDF to get this to work, part of a larger “solrClean” udf (as opposed to the famous verityClean UDF). I will release this code soon. This is NOT user friendly at all.
  8. Rule #8 – Custom fields are broken. Oh wait they are not! In your cfsearch, in ” and CF_CUSTOM2 <MATCHES> xyz”. With Solr, you must re-write this to be ” AND custom1:xyz”. Note the dropping of “CF_”.
  9. Rule #9 – Don’t use custom fields in search like #8 when returning suggestive results. It will say: Did you mean: custom instead of custom2. Ugh.
  10. To be continued…
Follow

Get every new post delivered to your Inbox.

Join 414 other followers