Comparing two or more datasets is one of the data monkey’s most rudimentary functions. Whether it’s checking records from one set have been included in (or excluded from) another, highlighting amended data between two sets or validating your tabular stats against last week’s, comparison of data is something your average data monkey does several times per day if not per working hour.
It rears it’s head not only as a function in it’s own right, required by pretty much anyone who works with data at any level, but in almost every other function performed by your general data geek such as ETL, data migration and MI reporting (for generation, testing and verification purposes). Effectively and accurately comparing datasets, be they large or small, is our bread and butter.
In the next few posts I want to talk about some of the most common data comparison techniques I use and how I do them. (Just as an aside, it’s how I do them – I’d love to hear your comments on alternatives and improvements!)
I’m going to start off with my little rant on how people too often (especially in Excel) ignore the simplest and most basic comparison method in lieu of more complicated methods (I allow myself a little data rant every now and then…) before moving on to:
1. VLookUp 101
Perhaps the most ubiquitous Excel formula around. There are hundreds of sites offering tutorials throughout the web on how to do a VLookUp but I wouldn’t be able to sleep at night if my blog didn’t include one!
2. Improving VLookUp Performance
Ever sat at your desk for five, even ten, minutes while Excel calculates a large VLookup, or went to make a coffee only to return to a lookup only 78% calculated? It’s highly irritating, really stunts your performance and puts you off your flow. And if you have to do several?! Painful… This post will be about making lookups faster, how to necessitate doing fewer of them and when a VLookUp is just flat out the wrong tool to be using.
3. Large dataset comparison in MySQL
Even if you’ve migrated to Excel 2007 or 2010 with it’s hugely increased range, quite often manipulation of large datasets is just too big for Excel even for such basic manipulation as a lookup or a countif. If data is given to you in an Excel or CSV format, most people will grin and bear it but thankfully – with a little initial set up – sticking data into MySQL, querying/transforming it and spitting it out again can take significantly less time than any spreadsheet calculation ever could.
4 – 7. Identifying changes between 2 datasets
Firstly a little VBA to highlight changes (additions, deletions and amendments) between an old and new dataset in Excel. Then an SQL query to perform the same function between two database tables (or two subqueries). How to make it happen with XML data and finally with JSON.
8. Identifying changes in tabular data
Hugely embarrassing when your boss or client ask you why the running sales figures in your dashboard have gone down and it’s due to a manual error. To avoid that this kind of checking, this very soon becomes a staple check before sending out any MI reports to make sure your stats haven’t somehow gone in the wrong direction! (Or just to quickly find out where the changes are in larger tables… Both are good)
9. Standardising data for comparison
One problem with comparing data is that lots of things can make it look like a mismatch when actually it isn’t (and vice versa although this is less common). To round off the series I’ll list a few of the typical cleaning methods which should normally be used before trying to compare disparate datasets. I’m not going into real ‘data cleansing’ per se, just highlight some basic tidy ups that often need to be done for comaprison techniques to work. But fear not (!!) I have a series of posts planned on data cleansing to follow! (I can sense the anticipation from here…) =D
These are all the ones I’ve come up with so far and plan to do more in the future but if any stray reader has other ideas for common data comparison requirements that I’ve been remiss in omitting please feel free to leave a comment and I’ll add a post onto the series!