
As you might already know, reader, if you hit this page, is that Indexing is and will always be:
- Indexing comes from a simple need, for dev related things or not.
- It helps getting to the data you want FASTER.
“I want to find all people with their first name starting with S”, I can sort that out with that folder sorting.
When it comes to a data storage things are the same:
- File systems indexing the position of files on the actual hard drive and flash drive,
- databases can allows you to create multiple indexes on certain columns, to make sure you can sort it faster.
Why making the point for code then? Well, it comes down to the exact same problem: performance.
Most of the times, people using LINQ, to actually parse data and get subsets of it for processing, in a loop for example:
foreach (var countryCode in _countryCodes)
{
var countriesPerCode = _entitiesLists.Where(e => e.Country == countryCode).ToList();
var count = countriesPerCode.Count;
// Code supposingly doing something with entities.
}
It can be fine if not too much of these loops are run, but why is it so different when it runs on let’s say a million rows?
Same problem occurs with a database: if the sql enginge running the query doesn’t know anything about how to get rows that match your WHERE clause, it will have to run on all rows.
In the case of our loop above, LINQ does the same: how LINQ would know about which items match you lambda function until it tried it on all items in your list?
To solve that issue we are going to use an uncommon LINQ object: Lookup.
The goal is simple: we are going to use it to build out an index out of our data to group it by a given key. Running this only once on our dataset fixes our problem, in the sense that getting back data subset for each loop iteration will be instant with our Lookup.
Here are the performance difference you can get from our test app (output from our console app):
Data: building
Data: done
Test1: start
Test1: processed in 535 milliseconds
Test2: start
Test2: lookup done in 140 milliseconds
Test2: processed in 141 seconds
To summarize our little article (TL,DR):
- The cost of initializing the dictionary first takes some time, but doing it once offers around a x4 performance gain
- Make use of indexing capabilities when you dataset starts to be above a certain number of items (that is to say most of the times!)
- This is a really simple sample that does not reflect other things that could happen around your code, as I have seen performance getting from 45 minutes to 5 (a x9 increase then)
- Repro code is to be found in Github here.