Web Software Architecture and Engineering – Life on the Bleeding Edge

Ok, so you’ve got Seeker installed, and you’re itchin’ to get it
working. If you remember, we ignored the “demos” subfolder and the
“docs” subfolder.
The docs subfolder contains two files, which provide basic
information on what’s going with Seeker, and some basic information on
how to install Seeker. The install documentation is a “bit light” so
hopefully you are following my instructions and you should be fine.
Here is how to understand the demos subfolder. Make sure you place
the demos subfolder somewhere where it is accessible via a URL on the
server Seeker was installed. For me that URL is
http://dev01/temp/seeker/demos/. I’ll be referencing everything off
this URL.
The first file you want to hit is lucene_test.cfm. This file should
spit something like this out to you: “Let’s see if we can run Lucene,
and what version we have… package org.apache.lucene, Lucene Search
Engine: core, version 2.3.2”. This means you have Seeker installed
properly and that ColdFusion can interact with the JAR file properly.
So far so good. If you hit an error, go over the install instructions
again.
Next, let’s test Lucene’s indexing of static text and html files.
Inside the demos folder, you’ll note two subfolders: “files” and
“filesIndex”. Files is where the text files to be indexed are stored,
and fileIndex is where Lucene creates its own index files for the files
folder. If you look at the code Ray put together, you can see this is
all customizable, and of course this is just for testing purposes at
this point. 
So let’s do that indexing of the files. Run
file_test.cfm. If this is the first time, you should see something
similar to this at the top: “Doing index of
D:wwwrootTEMPseekerdemosfiles to
D:wwwrootTEMPseekerdemosfilesindex. This means that Lucene
has gone through and created the index files inside the filesIndex
folder. If you need to reindex for any reason, simply add this to the
URL: file_test.cfm?reindex and it will force a reindex. You should also
be seeing a text input box, that will allow you to search against the
newly created index. Give it a whirl!
If you want to test
further, you can add words, or files, to the files folder, and force a
reindex, and do a search again to see if Lucene picked it up. Feel free
to try partials like “tes*” without the quotes.
Next, lets take a look at indexing db entries. New ColdFusion installs come with a datasource called “cfartgallery”, and that is what Ray uses for his example. But it should be pretty easy to change that and have this example work. So browse to query_test.cfm. You should see a message similar to: “Doing index of db query to D:wwwrootTEMPseekerdemosdbindex”. The query: “select * from art” runs and stores that in Lucene format in the “dbindex” subfolder. Again, you should also
be seeing a text input box, that will allow you to search against the
newly created index. Give it a whirl!
You can also force the same reindex after adding or editing records in the DB. And the same goes for partial searches.
Next, I’ll probably a larger test and compare with Verity, and also look at the inside of Seeker, and see how its functioning. In the mean time, enjoy!
 

Advertisements

Comments on: "ColdFusion & Lucene: Running the Demos" (20)

  1. Nathan Miller said:

    I got Lucene and Seeker installed no problem – thanks for the easy instructions.

    One thing I think is missing is a context of the search result. I quickly mocked this up by using cffile to open each file in the search result and get the context manually, then add it to the result query, but that seems to be an inefficient way to do it. Isn’t the context stored in the search result somewhere?

    Spidering is the other thing that I’d like to see, so I took a look at Nutch, and my eyes quickly glazed over when I read about cygwin…

    Overall, the engine seems fast and easy to use. I look forward to more updates.

  2. @Nathan,
    I will be looking a bit later at some of the advanced settings of Lucene. Some of the fuzzy searching in Ray’s code doesn’t seem to be working like it did for me pre-Seeker. Stay tuned! And yeah, spidering always gets crazy.

  3. Sami, thanks for giving Seeker a good run through. Please be sure to email me your findings. I’m following your blog, but I want to be sure I don’t miss anything.

    Context isn’t supported by Lucene… I think. But I keep getting surprised by what is in there.

  4. @Ray,

    Will do. I have some of my own custom code which I’ll compare with Seeker. More to come!

  5. You can do context pretty easily with Lucene (see http://swem.wm.edu/beta/flathat/?q=classes). That sample uses a SimpleHTMLFormatter and a Highlighter to generate the spans around the word. They’ve moved some things around, but check out the jars in the contrib folder for a lot of this additional functionality.

    One of the other really cool things is that you can generate a spelling index, not of the words in a dictionary, but in your target pages (which significantly improves finding ‘things’).

  6. Thanks Wayne. Will definitely be looking at that.

  7. Hi,
    thanks for the great post! I’m getting an error though when i try to run file_test.cfm. You can see it here:
    http://dimitritiomkin.com/demos/file_test.cfm.

    i’m able to run the lucene test no problem, so everything would seem to be fine: http://dimitritiomkin.com/demos/lucene_test.cfm. I’m on a UNIX box running CFMX7 with Lucene 2.4.1 and Seeker.

    Any thoughts?

  8. David,

    I can only imagine that the issue has to do with the platform, both in terms of OS and CF version. The examples I ran were on Windows and CF8. Is there another box you can try the code in?

  9. Unfortunately, my server is Unix and CF7, so doing it on anything else wouldn’t help me out. I’ll see if i can dig anything else up. Perhaps i can send Raymond a line to get his thoughts.

    Thanks again Sami!
    David

  10. So i got it working. Apparently with CFMX7 the PDF reader will not work, so simply deleting it makes it work. I still have yet to get the admin to work properly, but i’m glad i got this far.

    In regards to context searching and highlighting like Wayne had suggested, can anyone provide a sample of code to produce such a result? Also, is there a list of variables available from the result object? Currently, all the examples just use CFDump tags, but obviously thats not the greatest look. 🙂

    Thanks Again,
    David

  11. Also, is there no way to take wildcards into consideration automatically? I know that i can go in and enter a * after any word to search, but i’d prefer it to be automatic. For example, if a user types in “ma” (minus quotes), then i would hope that any words in the search containing “ma” would show up and be listed according to relevancy. however, using seeker only brings up this result if i type in the actual wild card “ma*”, which actually only brings up results that begin with “ma”. IE – grandma would never come up, but maria would.

    Any thoughts on this?

    Thanks again,
    David

  12. David,

    I believe that is part of the config files, if I am not mistaken. You should be able to control default behavior from there.

    Its been months since I’ve looked at it, so pardon me if I’m wrong, but I think that is how you set it up.

    Sami

  13. David, you could also just auto add * to the search string before passing it to Seeker.

  14. @Sami
    Thanks i’ll check into it.

    @Raymond

    Any thoughts on highlighting?

    Thanks on the wildcard tip, but how do i deal with items that end with ma rather than start with ma? I can’t do this *ma*. Is there another way around it?

    Thanks again
    David

  15. The highlighting is really a context feature (show words around search match). That is not yet supported in the official Lucene server (I believe) but is a work in progress currently in beta.

    As to your second thing – you’ve lost me. If you want a search for X to be wild on both sides, you add * to both ends. If you don’t, you add it just to the end.

  16. @Raymond,
    Thanks again for the quick response. As to the second part, when i try to add leading and ending wildcards (ie *ark* – as if i was searching for “marks”), i get the following error on the leading wildcard:
    Cannot parse ‘*mar*’: ‘*’ or ‘?’ not allowed as first character in WildcardQuery

    The error occurred in C:ColdFusion8wwwroot iomkinassetscustomtagssearch.cfm: line 51

    49 :
    50 :
    51 :
    52 :
    53 :

    Hope that makes a bit more sense.

    Thanks again,
    David

  17. Ah, well, that may be a Lucene thing then. I’d have to figure out how it supports ‘match in the middle’ type searches.

  18. Where can i have more info on this ?

    Regards

  19. Hi

    Maybe a stupid question but how do I search two collections at the same time, one files the other a DB index?

    Thanks

    Peter

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: