Lucene Facet calculation performance optimizations

Ok, so I did put my MultiFacetLucene library to the test. Downloaded a  pretty large database containing 140000 posts. Two columns of which I wanted to do facetting on. One of them containing around 20000 distinct values. The other just about 100.

So this means – we are gonna do 20000 facet bitset queries and calculations. I.e with the current codebase.

I did try it out and well, no- 600 ms (against a physical disk directory – not RAM) is nothing too be too proud of. Searching without facets is 0 ms…

So what to do to speed things up? First of all – let me tell you I did speed it up. 600 ms down to 35ms. Now, since I (also) changes the API a bit I will in a few days time release a version 2. Containing the optimizations.

Basically it’s all about not calculating everything. For starters it doesn’t makes sense to calculate 20000 facet values and counts for a specific search. In reality – how are you gonna show that in a webpage to a user? 20000 checkboxes?

What you typically want to do is offer checkboxes on the most popular facets. So I modified the api to let you (for each facet) specify a MaxToFetchExcludingSelections parameter. Cause note that facets the user has selected will always be calculated.

Anyway; now with that parameter in palce, how can we use that to end our loop as soon as possible premature?  Well, remember we have two bitsets for each facetvalue. One saying, GREEN exists 14 times in ALL documents while BLUE exists 22.

Then generate a new bitset based on current user search and facets ecluding the current one. This might say GREEN exists in 12 documents and BLUE in 5. And those are ANDED to get the final result.

So the optimization is: sort the first (ALL document) bitsets based on count – descending. This will mean we will start calculating facetcount for the most overall values. We add all calculated facet counts to a list. When the list is full (we asked for say 8 facet values) – we from now on compare the “ALL document bitset count” with the least value in our 8 objects list.

I.E

If the 8th facetc ount says “BLUE” 34 . And we know that YELLOW exists in only 33 documents overall, we can bail out. Since we calculate facets based on the count in the overall bitsets.

I will post some code, describing it better, when I have published the new version of the library.