Lucene.net facet optimise memory utilization

by Stefan Holmberg30. June 2014 12:32

As I stated in my previous post - memory consumption is inevitable if you wanna do facet calculation against a larger dataset AND still meet somewhat acceptable performance requirements.

After experiencing that myself, seeing memory usage really spike on one of my bigger sites using the MultifacetLucene library I spent some days thinking and experimenting in how it could be optimized without sacrificing too much on the performance side.

Let me start by saying a little about my reference index. I have 240000 documents, two columns are meant to be faceted. One containing around 20000 unique values, the other one only a couple of hundred. On my desktop machine (where I don't have enough room to place the index on the SSD disk, but therefore use my regular mechanical disk to store the index) I have a test suite running 100 consequtive queries (with facets) and I do get somewhere around 42ms in response time in average. In fact on this particular box I think 50 ms is acceptable.

Now, when it comes to memory - the actual structure to store the facet information and bitsets weighs in at around 111 MB ...That is NOT good...

 

So - I've been playing with a LRU Cache for starters. Thinking I should only store just X percent of the bitsets in memory. And calculate the rest when asked for. But I found that in such a performance critical library actually handling the cache could very well turn into a bottleneck/become a resource hog itself. 

 

My second shot was to kind of simplify the actual caching logic. I turned it into a "most likely to be used" cache. Cause, remember, I already do some optimization when calculating the facet values - shortcutting the loop as early as possible. Meaning I do have some clues which facet values will probably calculated the most - the most frequent ones that is! They will be used in most webpages, based on pure logic.

 

So a really simple implementation, where I just store 50 percent of the bitsets in memory gave 50 percent less memory utilization - and still performance was only hit by 4-5 ms. I think this might be the way to go. 

I still have some code refactoring to do before I will release it, but I hope to be able to try it on my own sites anyday now.  My idea is to make the optimization extendable - like being able to feed the FacetSearcher a IMemoryOptimizer instance. So it would be easy enough to override/provide your own strategy.

        bool ShouldLazyLoad(string fieldAttributeName, FacetSearcher.FacetValues.FacetValueBitSet bitSet, int index, IEnumerable<FacetSearcher.FacetValues.FacetValueBitSet> allFacetValueBitSets, int totalCount);

 

I mean, some attributes you might really want to keep in memory. Or you might wanna do a strategy where only attributes where total number of attributes is more than 15000 values or something like that.

 

 

 

 

 

 

Tags:

Lucene.net

New version of MultiLiceneNet library, both at Github and Nuget

by Stefan Holmberg28. June 2014 16:44

Just a headsup - a new release is available. 

 

NUGET

GITHUB

 

- Nuget now supporting Framework 4.5 as well

- nasty memory leak fixed, cussing used memory to half

 

There are two notable things to comment on when it comes to the memory leak.

a)  facet calculation IS memory intense. It has to be for performance to be acceptable. Basically we need our full dataset bitsets saved in memory for each possible facet value. 

b) I did notice the leak cause I am using my library for real. I'm using it on sites with million of documents, And it's really cool to see acceptable search and facetting performance, despite the large datasets.

 

I will however think a bit about optimizing it even further. I have some ideas about a more LRU like approach where not all bitsets are stored in memory. However I still need the term and count, even though the actual bitsets for less used facet values might be flushed out from cache and read again at demand. My second idea is simply writing the whole structure to disk. And using a memory mapped file to access it. In a sense I would then leave it to the operating system to do the optimization of memory usage.  Since I already sort the list by most popular values it would give me a sort of clustered access - thereby minimizing paging. 

 

I'll get back on this one.

Tags:

Lucene.net

Lucene Facet calculation performance optimizations

by Stefan Holmberg18. June 2014 16:08

Ok, so I did put my MultiFacetLucene library to the test. Downloaded a  pretty large database containing 140000 posts. Two columns of which I wanted to do facetting on. One of them containing around 20000 distinct values. The other just about 100.

So this means - we are gonna do 20000 facet bitset queries and calculations. I.e with the current codebase.

I did try it out and well, no- 600 ms (against a physical disk directory - not RAM) is nothing too be too proud of. Searching without facets is 0 ms...

 

So what to do to speed things up? First of all - let me tell you I did speed it up. 600 ms down to 35ms. Now, since I (also) changes the API a bit I will in a few days time release a version 2. Containing the optimizations.

Basically it's all about not calculating everything. For starters it doesn't makes sense to calculate 20000 facet values and counts for a specific search. In reality - how are you gonna show that in a webpage to a user? 20000 checkboxes?

What you typically want to do is offer checkboxes on the most popular facets. So I modified the api to let you (for each facet) specify a MaxToFetchExcludingSelections parameter. Cause note that facets the user has selected will always be calculated.

 

Anyway; now with that parameter in palce, how can we use that to end our loop as soon as possible premature?  Well, remember we have two bitsets for each facetvalue. One saying, GREEN exists 14 times in ALL documents while BLUE exists 22.

Then generate a new bitset based on current user search and facets ecluding the current one. This might say GREEN exists in 12 documents and BLUE in 5. And those are ANDED to get the final result.

 

So the optimization is: sort the first (ALL document) bitsets based on count - descending. This will mean we will start calculating facetcount for the most overall values. We add all calculated facet counts to a list. When the list is full (we asked for say 8 facet values) - we from now on compare the "ALL document bitset count" with the least value in our 8 objects list.

 

I.E

If the 8th facetc ount says "BLUE" 34 . And we know that YELLOW exists in only 33 documents overall, we can bail out. Since we calculate facets based on the count in the overall bitsets.

 

I will post some code, describing it better, when I have published the new version of the library.

 

 

 

 

 

Tags:

Lucene.net

Facets in Lucene.net - the theory

by Stefan Holmberg14. June 2014 15:34

Ok, so lets talk a a little about the actual facetting theory. Or to be more exact: lets talk about how it's implemented in Lucene.NET. Cause thanks to Lucene, facet calculation boils down to pretty simple bitmapping arithmetics.

 

What we do is, when facet calculation for a specific field for the first time is being requested, we read (and cache) bitmaps from the Lucene index. Each possible value is given a bitmap of its own. So for value "BLUE" a bitmap could be 01001. That would mean that for the specific facet field, document number 2 and 5 indeed contains the value "BLUE". As a side note, fetching all term values is supported from core Lucene.net and it even offers an bitmap abstraction, through the OpenBitSetDISI class

private FacetValues ReadBitSetsForValues(string facetAttributeFieldName)
{
var facetValues = new FacetValues();
facetValues.Term = facetAttributeFieldName;
facetValues.FacetValueBitSetList.AddRange(
GetFacetValueTerms(facetAttributeFieldName).Select(fvt => new FacetValues.FacetValueBitSet
{
Value = fvt.Term,
Filter = fvt.Filter,
OpenBitSetDISI =
new OpenBitSetDISI(fvt.Filter.GetDocIdSet(IndexReader).Iterator(), IndexReader.MaxDoc)
}));
return facetValues;
}
private IEnumerable<FacetValueTermFilter> GetFacetValueTerms(string facetAttributeFieldName)
{
var termReader = IndexReader.Terms(new Term(facetAttributeFieldName, String.Empty));
do
{
if (termReader.Term.Field != facetAttributeFieldName)
yield break;
var facetQuery = new TermQuery(termReader.Term.CreateTerm(termReader.Term.Text));
var facetQueryFilter = new CachingWrapperFilter(new QueryWrapperFilter(facetQuery));
yield return new FacetValueTermFilter {Term = termReader.Term.Text, Filter = facetQueryFilter};
} while (termReader.Next());
}


Not too many rows of code to do that, right?

 

So the FacetSearcher class in our library offers a SearchWithFacets function. Apart from running the regular query to fetch the actual documents (that query could be "searchword:'hello' - we shold also fetch all facet values matching that original query. Now the bitmap aritmetics comes into play.

If we generate a new bitmap for the 'searchword:hello' filter we might get a 11000 result. Only document 1 and 2 matches that search filter.

Now, Simply ANDing the 01001 with 11000 will tell us that only document 2 matches BLUE and searchword:hello

That's the basics of Lucene facetting! However, the beauty of my library is how it takes into consideration other facet selection. The "If we generate a new bitmap for the 'searchword:hello' filter" is not entirely true, that's my point.

Lets say we have selected COLOR:BLUE and SIZE:MEDIUM. Other lucene.net facet implementations I've seen uses both those filters when calculating facets. Which will completely rule out the possibility to do multi value selection. I.e COLOR:RED OR COLOR:BLUE. Anyway, I've talked about that way too much already in my earlier posts, so what I wanna show you is a demo site I whipped together using the MultiFacetLucene library

Lets go to the search page of  Aviation questions - searching for the word aviation gave us 69 results.

To the right you see facet values. Click one of them and search result is filtered. Facet results are also updated, but one facet value dopes not rule out the other options in that facet field. 

 

 

 

 

 

 

 

 

 

 

 

 

 

Tags:

Lucene.net

Multi facets and lucene - behind the scenes

by Stefan Holmberg8. June 2014 14:39

So how does the code from the previous article on Lucene.net and the MultiFacetLuceneNet library I have published work? 

 

Well the secret lies in how to run the facet queries. Basically when calculating facets from field "X" we need to dynamically create a query - the original MINUS the X term selection. And same thing when calculating facets for field "Y" we need to remove the Y selection from the original query.

 

Have a look at the CreateFacetedQuery function.

 

protected Query CreateFacetedQuery(Query baseQueryWithoutFacetDrilldown, IEnumerable<SelectedFacet> selectedFacets, string facetAttributeFieldName)
{
var facetsToAdd = selectedFacets.Where(x => x.FieldName != facetAttributeFieldName).ToList();
if (!facetsToAdd.Any()) return baseQueryWithoutFacetDrilldown;
var booleanQuery = new BooleanQuery {{baseQueryWithoutFacetDrilldown, Occur.MUST}};
foreach (var selectedFacet in facetsToAdd)
{
if (selectedFacet.SelectedValues.Count == 1)
booleanQuery.Add(new TermQuery(new Term(selectedFacet.FieldName, selectedFacet.SelectedValues[0])), Occur.MUST);
else
{
var valuesQuery = new BooleanQuery();
foreach (var value in selectedFacet.SelectedValues)
{
valuesQuery.Add(new TermQuery(new Term(selectedFacet.FieldName, value)), Occur.SHOULD);
}
booleanQuery.Add(valuesQuery, Occur.MUST);
}
}
return booleanQuery;
}

 

As you can see we feed it with the selected facets (could be "COLOR:blue+green", "SIZE:Medium" ) but exclude the facet matching the current one we are calculating (facetAttributeFieldName)

Which means when we calculate the COLOR facets we well generate a query saying {original query} AND SIZE:'Medium' 

Likewise when we calculate the CATEGORY facets the query will be {original query} AND (COLOR:'blue' or COLOR:'green')

 

Next post will describe the actual facet calculation algoritm, along with the stupid mistake I did which (when found out) gave me 100x times better performance...(the "100x times faster" checkin at  https://github.com/aspcodenet/MultiFacetLuceneNet )

 

Also: dont miss the NUGET package for MultiFacet Lucene.NET

Tags:

Lucene.net and multiple facets

by Stefan Holmberg5. June 2014 14:13

Consider a site with a set of "checkbox navigation" groups on a search result page. To let the user drilldown the search result.

Say we were searching for the phrase Programming.

 

Checkbox navigation would the be, for example 

 

CATEGORIES

====================

Blog (12) []

MVC (4) []

Messaging (3) []

 

WRITTEN

====================

April 2014(2) []

May 2014(9) []

Jun 2014 (14) []

 

The whole idea is, as I said, to be able to drill down the search result to say articles matching "Programming", tagged in category MVC - AND written in April 2014 or May 2014.

The checkboxes inside each group are OR:ed together and AND:ed with other groups. Combined with original search phrase it limits the actual documents returned.

However - when the user clicks on say MVC - then of couse the DATE options - and numbers - update so it matches the correct number. I.e the number is to be though of as a expectation. If i click this checkbox as well I will be given X more documents (hrmm, not entirely true as one document could have multiple categories etc). But my main point is that calculation of facets in each respective group should not include itself in the filter .

 

This is (better) described at the SOLR site: http://wiki.apache.org/solr/SimpleFacetParameters - scroll down to Tagging and excluding Filters

 

ANyway, since I have yet to fiond an extension library to Lucene.net offering this kind of facet behaviour I decided to implement it myself.

So here it goes: MultiFacetLueceneNet on github 

 

Tags:

Lucene.net

SignalR in practice 1

by Stefan Holmberg1. January 2013 16:18

Ok, the purpose of this article is to give you some insights on how to really make use of SignalR and the server side push functionality it offers. Now, to be honest, yes there ARE many code examples out there, however most of them kind of build on the "let's write a chat application!" approach, which is not that very useful in everyday programming.

Where I have found usage of SignalR is instead in lenghty processes rather than a way of having multiple webbrowser sessions communicate with each others. Very much the same concepts, yes - but I hope this example still eases the understanding of SignalR. 

Pushing data from server to client is the key - and I certainly appriciate the elegance in how SignalR (or other similar techniques like pubnub.com) let us accomplish that. One of the patterns I love the most (both as a software and integration architect) is publish/subscribe. And the way SignalR is implemented - to us application developers - it's nothing more than just pubsub! The cool - and inventive - bits of course being able to have the webbrowser session being the subscriber to events.  

So this very first example is this:
SignalR Server website (hosted in IIS). Contains a single webpage - hockeyresults. The HockeyResultsController contains two actions, Index and ShowGame

Index: lets the user connect to a certain hockey game (a list of fake live games)

 

 

ShowGame: displays live game results and events. Coming from some sort of third party (or maybe manually by someone watching/commenting the game live)
Now some lines of Javascript sets up the subscription (I want to subscribe to all events concerning the particular game ID - yes this is the SignalR GROUP) - and when there is new events for this game it is supposed to be added to the webpage.

 

 

Ok, server side: there is a small console application. PushResults.exe.
Call it like this, two parameters - first a game number - then the news text to be published
>PushResults.exe 12345 "Malkin scored his third goal against Rangers..."

 

 

 

The console app will simply call a WebAPI function (downloading http://localhost:17872/PushGameData/SendMessage?gameid={0}&amp;message={1} ) from the SignalR Server website. Thereby handling it over to the website - since that's where the SignalR server is hosted - and where the subscription database is.
Note: I implemented it as a regular controller - my point is just to show an EASY way of interacting with the SignalR server...

 

As you can see, no magic at all! Point being - input (publishing data) is coming from completely different sources than another webbrowser session.

What really makes this cool is when you, as I said, have for example lenghty third party integrations. Initiate from webbrowser client (which also means set up a subscription to get result whenever available) .  Put request into queue/bus whatever. Persist. When data comes back we both store it into persistant storage and sends it back to client browser. In case the user has closed his browser - this way result can be made available to the user next time.

 

I will make more posts on this theme - as I would like to show you some real example on how to tie state (lengthy process results) to a persistant storage. And utilize queues to fork actual work to dedicated worker processes.    

   

GET THE SOURCECODE (GitHib)

https://github.com/aspcodenet/SignalR1

 

Another restart of this blog

by Stefan Holmberg31. December 2012 16:00

Ok back again...New year, new expectations, new goals...And on a new server as well...

 

Looking for aspcode.net content? Sorry...I have removed everything and redirecting the domain to this site instead...Why?

About 13 years ago I created the site ASPCode.net and while I had a blast with it, I felt the domain name "aspcode" did put some  contraints on me and the content (also the site structure turned into a mess...I mean mixing articles like "Read a textfile with classic ASP" and newer stuff like CQRS/architecture/continous integration etc ).

 

Back then a website typically was just that - a website. You had some ASP/ASP.NET pages, most often just showing some data from a database backend...Today the web platform (.NET) is SO much more than just the website. There are so many more type of endpoints than the basic html over http - MSMQ/JSON (Web API)/websockets (SignalR) etc, and I wanna start over completely with my blog being able to cover these areas without the old luggage.

So - Happy New Years and hopefully I'll start posting some real articles pretty soon!

Tags:

Blog

About the author

 

Proud of being a geek

Month List

Page List