New version of MultiLiceneNet library, both at Github and Nuget

Just a headsup – a new release is available.

NUGET

GITHUB

– Nuget now supporting Framework 4.5 as well

– nasty memory leak fixed, cussing used memory to half

There are two notable things to comment on when it comes to the memory leak.

a)  facet calculation IS memory intense. It has to be for performance to be acceptable. Basically we need our full dataset bitsets saved in memory for each possible facet value.

b) I did notice the leak cause I am using my library for real. I’m using it on sites with million of documents, And it’s really cool to see acceptable search and facetting performance, despite the large datasets.

I will however think a bit about optimizing it even further. I have some ideas about a more LRU like approach where not all bitsets are stored in memory. However I still need the term and count, even though the actual bitsets for less used facet values might be flushed out from cache and read again at demand. My second idea is simply writing the whole structure to disk. And using a memory mapped file to access it. In a sense I would then leave it to the operating system to do the optimization of memory usage.  Since I already sort the list by most popular values it would give me a sort of clustered access – thereby minimizing paging.

I’ll get back on this one.

Lucene Facet calculation performance optimizations

Ok, so I did put my MultiFacetLucene library to the test. Downloaded a  pretty large database containing 140000 posts. Two columns of which I wanted to do facetting on. One of them containing around 20000 distinct values. The other just about 100.

So this means – we are gonna do 20000 facet bitset queries and calculations. I.e with the current codebase.

I did try it out and well, no- 600 ms (against a physical disk directory – not RAM) is nothing too be too proud of. Searching without facets is 0 ms…

So what to do to speed things up? First of all – let me tell you I did speed it up. 600 ms down to 35ms. Now, since I (also) changes the API a bit I will in a few days time release a version 2. Containing the optimizations.

Basically it’s all about not calculating everything. For starters it doesn’t makes sense to calculate 20000 facet values and counts for a specific search. In reality – how are you gonna show that in a webpage to a user? 20000 checkboxes?

What you typically want to do is offer checkboxes on the most popular facets. So I modified the api to let you (for each facet) specify a MaxToFetchExcludingSelections parameter. Cause note that facets the user has selected will always be calculated.

Anyway; now with that parameter in palce, how can we use that to end our loop as soon as possible premature?  Well, remember we have two bitsets for each facetvalue. One saying, GREEN exists 14 times in ALL documents while BLUE exists 22.

Then generate a new bitset based on current user search and facets ecluding the current one. This might say GREEN exists in 12 documents and BLUE in 5. And those are ANDED to get the final result.

So the optimization is: sort the first (ALL document) bitsets based on count – descending. This will mean we will start calculating facetcount for the most overall values. We add all calculated facet counts to a list. When the list is full (we asked for say 8 facet values) – we from now on compare the “ALL document bitset count” with the least value in our 8 objects list.

I.E

If the 8th facetc ount says “BLUE” 34 . And we know that YELLOW exists in only 33 documents overall, we can bail out. Since we calculate facets based on the count in the overall bitsets.

I will post some code, describing it better, when I have published the new version of the library.

Facets in Lucene.net – the theory

Ok, so lets talk a a little about the actual facetting theory. Or to be more exact: lets talk about how it’s implemented in Lucene.NET. Cause thanks to Lucene, facet calculation boils down to pretty simple bitmapping arithmetics.

What we do is, when facet calculation for a specific field for the first time is being requested, we read (and cache) bitmaps from the Lucene index. Each possible value is given a bitmap of its own. So for value “BLUE” a bitmap could be 01001. That would mean that for the specific facet field, document number 2 and 5 indeed contains the value “BLUE”. As a side note, fetching all term values is supported from core Lucene.net and it even offers an bitmap abstraction, through the OpenBitSetDISI class

private FacetValues ReadBitSetsForValues(string facetAttributeFieldName)
        {
            var facetValues = new FacetValues();
            facetValues.Term = facetAttributeFieldName;

            facetValues.FacetValueBitSetList.AddRange(
                GetFacetValueTerms(facetAttributeFieldName).Select(fvt => new FacetValues.FacetValueBitSet
                {
                    Value = fvt.Term,
                    Filter = fvt.Filter,
                    OpenBitSetDISI =
                        new OpenBitSetDISI(fvt.Filter.GetDocIdSet(IndexReader).Iterator(), IndexReader.MaxDoc)
                }));

            return facetValues;
        }

        private IEnumerable<FacetValueTermFilter> GetFacetValueTerms(string facetAttributeFieldName)
        {
            var termReader = IndexReader.Terms(new Term(facetAttributeFieldName, String.Empty));
            do
            {
                if (termReader.Term.Field != facetAttributeFieldName)
                    yield break;

                var facetQuery = new TermQuery(termReader.Term.CreateTerm(termReader.Term.Text));
                var facetQueryFilter = new CachingWrapperFilter(new QueryWrapperFilter(facetQuery));
                yield return new FacetValueTermFilter {Term = termReader.Term.Text, Filter = facetQueryFilter};
            } while (termReader.Next());
        }


Not too many rows of code to do that, right?

So the FacetSearcher class in our library offers a SearchWithFacets function. Apart from running the regular query to fetch the actual documents (that query could be “searchword:’hello’ – we shold also fetch all facet values matching that original query. Now the bitmap aritmetics comes into play.

If we generate a new bitmap for the ‘searchword:hello’ filter we might get a 11000 result. Only document 1 and 2 matches that search filter.

Now, Simply ANDing the 01001 with 11000 will tell us that only document 2 matches BLUE and searchword:hello

That’s the basics of Lucene facetting! However, the beauty of my library is how it takes into consideration other facet selection. The “If we generate a new bitmap for the ‘searchword:hello’ filter” is not entirely true, that’s my point.

Lets say we have selected COLOR:BLUE and SIZE:MEDIUM. Other lucene.net facet implementations I’ve seen uses both those filters when calculating facets. Which will completely rule out the possibility to do multi value selection. I.e COLOR:RED OR COLOR:BLUE. Anyway, I’ve talked about that way too much already in my earlier posts, so what I wanna show you is a demo site I whipped together using the MultiFacetLucene library.

Lets go to the search page of  Aviation questions – searching for the word aviation gave us 69 results.

To the right you see facet values. Click one of them and search result is filtered. Facet results are also updated, but one facet value dopes not rule out the other options in that facet field.

Multi facets and lucene – behind the scenes

So how does the code from the previous article on Lucene.net and the MultiFacetLuceneNet library I have published work?

Well the secret lies in how to run the facet queries. Basically when calculating facets from field “X” we need to dynamically create a query – the original MINUS the X term selection. And same thing when calculating facets for field “Y” we need to remove the Y selection from the original query.

Have a look at the CreateFacetedQuery function.

protected Query CreateFacetedQuery(Query baseQueryWithoutFacetDrilldown, IEnumerable<SelectedFacet> selectedFacets, string facetAttributeFieldName)
{
    var facetsToAdd = selectedFacets.Where(x => x.FieldName != facetAttributeFieldName).ToList();
    if (!facetsToAdd.Any()) return baseQueryWithoutFacetDrilldown;
    var booleanQuery = new BooleanQuery {{baseQueryWithoutFacetDrilldown, Occur.MUST}};
    foreach (var selectedFacet in facetsToAdd)
	{
        if (selectedFacet.SelectedValues.Count == 1)
            booleanQuery.Add(new TermQuery(new Term(selectedFacet.FieldName, selectedFacet.SelectedValues[0])), Occur.MUST);
        else
        {
            var valuesQuery = new BooleanQuery();
			foreach (var value in selectedFacet.SelectedValues)
			{
				valuesQuery.Add(new TermQuery(new Term(selectedFacet.FieldName, value)), Occur.SHOULD);
			}
			booleanQuery.Add(valuesQuery, Occur.MUST);
		}
	}
	return booleanQuery;
}

As you can see we feed it with the selected facets (could be “COLOR:blue+green”, “SIZE:Medium” ) but exclude the facet matching the current one we are calculating (facetAttributeFieldName)

Which means when we calculate the COLOR facets we well generate a query saying {original query} AND SIZE:’Medium’

Likewise when we calculate the CATEGORY facets the query will be {original query} AND (COLOR:’blue’ or COLOR:’green’)

Next post will describe the actual facet calculation algoritm, along with the stupid mistake I did which (when found out) gave me 100x times better performance…(the “100x times faster” checkin at  https://github.com/aspcodenet/MultiFacetLuceneNet )

Also: dont miss the NUGET package for MultiFacet Lucene.NET

Lucene.net and multiple facets

Consider a site with a set of “checkbox navigation” groups on a search result page. To let the user drilldown the search result.

Say we were searching for the phrase Programming.

Checkbox navigation would the be, for example

CATEGORIES

====================

Blog (12) []

MVC (4) []

Messaging (3) []

WRITTEN

====================

April 2014(2) []

May 2014(9) []

Jun 2014 (14) []

The whole idea is, as I said, to be able to drill down the search result to say articles matching “Programming”, tagged in category MVC – AND written in April 2014 or May 2014.

The checkboxes inside each group are OR:ed together and AND:ed with other groups. Combined with original search phrase it limits the actual documents returned.

However – when the user clicks on say MVC – then of couse the DATE options – and numbers – update so it matches the correct number. I.e the number is to be though of as a expectation. If i click this checkbox as well I will be given X more documents (hrmm, not entirely true as one document could have multiple categories etc). But my main point is that calculation of facets in each respective group should not include itself in the filter .

This is (better) described at the SOLR site: http://wiki.apache.org/solr/SimpleFacetParameters – scroll down to Tagging and excluding Filters

ANyway, since I have yet to fiond an extension library to Lucene.net offering this kind of facet behaviour I decided to implement it myself.

So here it goes: MultiFacetLueceneNet on github