Lessons Learned in Solr
We recently worked with a client that wanted to improve the speed of their site's search functionality. At Cloudspace we know that one way to get truly fast searches is with Apache Solr. Solr does full text-searching that maintains its speed even with hundreds of thousands of records. There are other options, one of which is Elasticsearch (also in the Lucene family), if you're having trouble deciding between the two, read Tim's post from a few weeks ago. Being new to Cloudspace I wasn't as familiar with Solr as many of my coworkers, so here's my take on some of the stuff I learned while working with Solr. This isn't a technical overview, a guide, or even an introduction. Its just a few pitfalls I came across and some of the lessons I took away from the experience.
Sorting in Solr is case-sensitive
Coming from a MySQL background where ORDER BY ignores case, this one was confusing at first. I was setting Solr's sort parameter to "name desc", That should be a Z - A search based on the user's first name. But I was noticing some strange behavior, "malcolm graves" was sorted before "Xin Zhao". The last time I checked, 'M' came before 'X' in the alphabet. One of my co-workers pointed out the case discrepancy. Apparently a-z comes after A-Z when sorting (which makes sense, just look at this ascii table). When sorting by descending, lower-case names are shown first. Well, that may make sense to Solr, but it won't make sense to the user, we needed to fix it. Our solution was to create a separate case-normalized field in the Solr index for user's names. To do that we created a new field type called text_sort which utilized a lowerCaseFilterFactory then, added a copyField directive which automatically duplicated the name field into a new field we called name_sort using our text_sort type. When we wanted to sort, we simply sorted documents based on this new field ("name_sort desc").
Most of your time is spent re-indexing
As you work with Solr more you'll realize that a lot of your time is spent simply waiting for your database to re-index. If you're lucky you can work with a subset of the data and don't have to re-index the entire table every time. In fact, you should strive for this. If you can get away with only indexing some of your data until you have to do a full test, you should do it. It will save you countless time waiting for your index script to finish. The long wait for re-indexing is also a great motivator to double-check that you're indexing that field the right way. Are you sure you have the filters you want, are you tokenizing in the correct way? Have you thought of every edge-case? If not, you'll be waiting on a re-index when you discover the issue.
Websolr can save you a lot of headache.
Websolr is great, we recommend it to most of our clients. Unless you have a ridiculously large dataset, Websolr will likely work for you. It saves you tons of start-up and management time. You don't have to worry about managing another server, or making sure your Solr instance is configured correctly. Websolr handles all of that for you, and in my experience does it extremely well.
Final thoughts
Once you have figured out how to navigate the pitfalls in Solr (of which there are many) you can greatly improve your user's experience and in the end, I think its worth it.
Have any questions about Solr you didn't see covered in this post? Let us know, we'd love to share some more of our insight!