Web Software Architecture and Engineering – Life on the Bleeding Edge

More than anything, these series of posts are some notes and tidbits I’ve learned as we move our large Verity collection over to Solr. These notes apply to CF 9.0.1.

  1. Rule #1 – Adobe Docs suck. Be creative in your searches. I found answers for question in the following places:
    • Blogs: including but not limited to Ray Camden’s blog and various Adobe Engineer blogs.
    • Adobe Press Releases, Release Notes, and Change Logs – Whether a feature is enabled has been hidden away.
    • Apache Solr’s docs
    • Google – Sorry, I mean Scroogle.org.
  2. Rule #2 -Tune your Solr Install. Just like your CF instance, modify the solr.lax file under the solr-install root directory. Look for two lines.
    • lax.nl.java.option.additional – this line contains the JVM args. We upped memory to 1024 from 256.
    • lax.nl.current.vm – we pointed this to the latest in.javaw.exe file under a 64-bit jdk. 64-bit Solr? You bet!
  3. Rule #3 – Increase Buffer Size – In CF Administrator, go to Solr Server -> Show Advanced Settings. Change Solr Buffer Limit from 40 to 80. For the why on this, use Scroogle.
  4. Rule #4 – Default Operator – When we used Verity, searching for ‘fire water’ would in effect search for ‘fire and water’. With Solr, ‘fire water’ searches for ‘fire OR water’. If you need to change the default operator between words in keyword searches, don’t despair. Go to where your solr data is located (the root directory of it), and go to confschema.xml. Around line 528 you should see: <solrQueryParser defaultOperator=”OR”/>, change to <solrQueryParser defaultOperator=”AND”/> if need be.
  5. Rule #5 – Support for Categories Seems Broken (as of 9/7/2010) – I am seeking more data on this. Let’s take an example, if you index a query with columns: keyA, columnB, columnC. In your cfindex, you set category = “columnB”… it works ok. But if you set category=”columnB,columnC” – it takes the literal value inside of quotes without transforming it and sets it as the category!
  6. Rule #6 – Support for Categories Sucks – Whoa again? Yes. This time when searching. Let’s say you did index with columnB above which can have two values: valueA, and valueB. And in your cfsearch, per docs, category takes a comma-delimited list of categories. Wrong! After much trial and error, I figured out that for valueA, you can use category=”valueA”, but for multiple categories, you have to use not commas, but search operators. So for either category, use “valueA OR valueB”. If you want both, use AND.
  7. Rule #7 – Operators are CASE-SENSITIVE! Be Warned. So we used to allow users to enter for keywords: ‘fire and water’. Now they must use ‘fire AND water’. The lower case AND does not count! I had to build a custom UDF to get this to work, part of a larger “solrClean” udf (as opposed to the famous verityClean UDF). I will release this code soon. This is NOT user friendly at all.
  8. Rule #8 – Custom fields are broken. Oh wait they are not! In your cfsearch, in ” and CF_CUSTOM2 <MATCHES> xyz”. With Solr, you must re-write this to be ” AND custom1:xyz”. Note the dropping of “CF_”.
  9. Rule #9 – Don’t use custom fields in search like #8 when returning suggestive results. It will say: Did you mean: custom instead of custom2. Ugh.
  10. To be continued…
Advertisements

Comments on: "Lessons Learned: Moving from Verity to Solr (Part 1)" (22)

  1. I take _strong_ issue with your assertion that the Adobe Docs suck. Over the Reference and Dev guide you have close to 5000 pages of free documentation. It is certainly not perfect, but neither is the CFWACK, or _any_ documentation that I know of. The Adobe Docs (specifically the web version) even allow for live commenting to help point out mistakes and make improvements. To simply say the suck is to do a great diservice to a _huge_ resource.

    I’ll also point out that in terms of Solr, the docs really should cover the basics, and after a certain point you (you being the “Developer” you, not just you personally) should realize when it is appropriate to transition to the actual Solr docs. It doesn’t make sense for Adobe to completely rewrite what Solr already has.

    Just my 2 cents.

    Also: 4) I think actually X Y in Verity is a phrase search, not an AND search. But it’s been a while so I could definitely be wrong.

  2. @Ray,

    We’re gonna have to disagree on the Adobe Docs issue. The docs in many places are flat out wrong, out dated, or incomplete. Being the official source on how things work, its a shame, and its always been that way. I don’t know a single person in all my years who has given it a compliment. I’ve commented using their system numerous times to have the comments go into a black hole.

    Yes, I believe #4 is a phrase search… i guess i meant both are included for sure.

  3. Thanks for reporting these issues, will save me some time as I’m currently working on some Solr search functions that these will come in play for. Look forward to seeing your cleanSolr UDF!

    P.S. That captcha you are using is driving me crazy!! Not every can so easily read these… how about cfformprotect instead?!

  4. Hhm, I just tried doing a search with multiple categories, comma-delimited list and it seems to work fine for me, it seems to use an “OR” for the categories listed. You don’t mention what the problem was, did you want it to do an AND on the category list?

  5. Mary Jo,

    Are you sure? Can you share some code?

  6. While Sami’s language may be a bit strong, I have to agree with the sentiment: the Adobe docs leave a lot to be desired, at least in relation to the Search tags.

    I puzzled for a long time over references in the Verity section to enabling Field searches which failed to specify where exactly you were supposed to do it. Googling the phrase turned up some original Verity documentation parts of which I realised had just been copied and pasted out of context into the CF docs. From the original I worked out which file I was supposed to edit. I added a comment to the CF8 docs to help others, but for CF9 all comments seem to have been reset without any improvements being made to the content.

    Solr itself though seems to be miles better than Verity overall, but I’m discovering more differences in the way it operates than the docs suggest. For example, the “maxrows” attribute of cfsearch limits the number of rows returned by Verity searches. With Solr, if you specify more than 1 collection, the maxrows applies to each collection, so you may well get more results than you expect. I have filed this and had it verified as bug 84081.

    Finally, Sami, I’ve found that you can avoid having to change the schema/solrconfig settings to your desired defaults for each collection you create, by editing the files in the multicore/template/conf before creating them. The template settings are then applied to each new collection. I’ve done that to enable the full content highlighting, rather than just summary highlighting, plus a few other settings I would rather be on by default.

    At least Solr allows this flexibility and it all seems to work. With Verity it was hard to know what could be tweaked and hit and miss whether it would have any effect.

  7. FYI: The CF team has confirmed both bugs. They have offered a work-around for one, and I am going to try it out.

  8. FYI: The work-around does NOT work.

  9. Had an email from Adobe today to say that the Maxrows bug 84068 (not 84081 as I wrongly said) has been fixed, presumably in the next Hot Fix.

  10. Sami,

    Did you release your “solrClean” UDF?

    Thanks,

    Aaron

  11. Aaron,

    I plan to in the coming weeks after we launch the new version of our product. Thanks for the reminder!

  12. First, have to agree that the Adobe Docs suck. Am I glad they are there and at least have some good information? Certainly. But as stated, they leave a lot to be desired and are very often not well updated with new releases and often simply do not have all the information and examples that they should. I have to very often figure things out by trial-and-error that simply should be in the docs in the first place.

    Look forward to your UDF Sami. I’m still struggling with some Solr issues myself. I have a weird issue with my application where if I run a purge on the index, it doesn’t properly index all my data (I have about 5-6 separate data queries and each gets added to the index with a different category for searching on). It’s very repeatable, if I run a purge, at least 1-2 categories don’t show up in the index. If I don’t run a purge, they do. Very odd, and the only solution I found was to index my docs twice after a purge.

    (PS…arggh, that captcha is driving me crazy!! On my third attempt at it… use of captcha is bad enough, but case-sensitive captcha is just cruel!)

  13. Mary,

    I feel your pain.

    Many people have complained about the captcha, so I’ll indeed tweak it very soon. I didn’t realize it was case-sensitive!

    Again, more to come. We’re completing a one year re-write of our product, and I have tons to blog about.

  14. Well, I’m assuming it’s case-sensitive as there’s been a number of times that I was 99% sure that I at least got the letters right and it still rejected it. A lot of times though I can’t even figure the letters out, it has them so obscured (I’ve reloaded this form 3 times just trying to get one I can read). I assume it’s more a MangoBlog issue so I guess you can pass my gripe on to them. 😉 For those of us with less-than-perfect eyesight, I always dread seeing them on a site, and will often just pass on posting comments when I see them.

  15. A lot of this here is flat out wrong.

    Category support in Solr is wayyyyyyyyy better than in Verity.You just have to know how to use it. Verity collections were garbage that always

  16. @DanB: Could you perhaps be more explicit? Where exactly is he wrong.

  17. Here’s what I take issue with:

    Rule #6 – Support for Categories Sucks –couldn’t be more wrong.
    That couldn’t be more wrong. Verity collections with categories always got corrupted, and the searches always seemed kind of wonky to me. Forcing you to do very awkward things like locking or disabling your search efforts while your index is optimized or updated.
    The fact that you have to use a different syntax rather than a comma separated list of categories does not mean that category support sux. I have used the category features of Solr in depth, and the category support is impressive to say the least.
    On top of that, Solr supports “facet counts” for the categories. You can issue a search and have included in the number of results (as a side XML item) that apply to each of the categories contained in that collection.
    Verity is completely owned by SOLR in category support.

    Rule #5 – Support for Categories Seems Broken – Couldn’t be more wrong
    But I guess it is qualified with a date, but to my knowledge cf 9.01 uses the same version of solr 9.00 did, so again, using the wrong syntax to submit your query doesn’t mean that feature is broken.

    Rule #2 -Tune your Solr Install – this one leaves out one of the most important tweaks.
    I used Solr right out of the game on a fairly large collection (20K+ documents) and never had to do much to tweak on performance, and found that indexing took place _hundreds_ of times faster. The most
    A far more important tweak is updating the configuration XML to include context passages – which has a slight impact on collection size, no noticeable impact on performance, and will retain the context passages that most verity developers are used to (this one I know you know of):
    http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSe9cbe5cf462523a0-5bf1c839123792503fa-8000.html

    And as far as #1 goes: I some-what agree here, Adobe docs has tons of good info, but I cringe everytime I click a link that leads to adobe docs because of that stupid left hand navigation panel, and the fact that the page has to load two or three times before you can start browsing the contents makes me cringe, too.
    I know there is s simpler version out there, but my google results always seem to hit the over-done version.

    Rule #4: default operator –
    I am betting a lot of people parse the user’s search input before passing it into the search engine, so I don’t see how this should be changed. Two great tools to learn the new query syntax: Solr’s wiki, and accessing the solr service directly to issue queries:
    http://localhost:8983/solr/

  18. Hmm. So for 6 – he isn’t wrong. You just take issue with him saying it sucks. 🙂 I’d agree with that. The docs should be updated.

    #5 There were CF/SOLR fixes after 9.0.

  19. Andrius said:

    After we moved from Verity to Solr we started getting a lot of “java.io.IOException: FULL” errors on very large search queries. Fixed that by increasing maximum request header size. In file /solr/etc/jetty.xml added 65535 within the section … . More info on this: http://drupal.org/node/443980

  20. Andrius said:

    Also, our exact search results have become fuzzy which was not OK. Fixed that by modifying data type of the indexed data. In Solr collection configuration file /collections/collection_name/schema.xml changed ‘<field name="contents" type="text" ' to '<field name="contents" type="text_exact" '.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: