Skip navigation

Category Archives: XML

eXtensible Markup Language

Hello world,

Comparing two or more datasets is one of the data monkey’s most rudimentary functions. Whether it’s checking records from one set have been included in (or excluded from) another, highlighting amended data between two sets or validating your tabular stats against last week’s, comparison of data is something your average data monkey does several times per day if not per working hour.

It rears it’s head not only as a function in it’s own right, required by pretty much anyone who works with data at any level, but in almost every other function performed by your general data geek such as ETL, data migration and MI reporting (for generation, testing and verification purposes). Effectively and accurately comparing datasets, be they large or small, is our bread and butter.

In the next few posts I want to talk about some of the most common data comparison techniques I use and how I do them. (Just as an aside, it’s how I do them – I’d love to hear your comments on alternatives and improvements!)

I’m going to start off with my little rant on how people too often (especially in Excel) ignore the simplest and most basic comparison method in lieu of more complicated methods (I allow myself a little data rant every now and then…) before moving on to:

1. VLookUp 101

Perhaps the most ubiquitous Excel formula around. There are hundreds of sites offering tutorials throughout the web on how to do a VLookUp but I wouldn’t be able to sleep at night if my blog didn’t include one!

2. Improving VLookUp Performance

Ever sat at your desk for five, even ten, minutes while Excel calculates a large VLookup, or went to make a coffee only to return to a lookup only 78% calculated? It’s highly irritating, really stunts your performance and puts you off your flow. And if you have to do several?! Painful… This post will be about making lookups faster, how to necessitate doing fewer of them and when a VLookUp is just flat out the wrong tool to be using.

3. Large dataset comparison in MySQL

Even if you’ve migrated to Excel 2007 or 2010 with it’s hugely increased range, quite often manipulation of large datasets is just too big for Excel even for such basic manipulation as a lookup or a countif. If data is given to you in an Excel or CSV format, most people will grin and bear it but thankfully – with a little initial set up – sticking data into MySQL, querying/transforming it and spitting it out again can take significantly less time than any spreadsheet calculation ever could.

4 – 7. Identifying changes between 2 datasets

Firstly a little VBA to highlight changes (additions, deletions and amendments) between an old and new dataset in Excel. Then an SQL query to perform the same function between two database tables (or two subqueries). How to make it happen with XML data and finally with JSON.

8. Identifying changes in tabular data

Hugely embarrassing when your boss or client ask you why the running sales figures in your dashboard have gone down and it’s due to a manual error. To avoid that this kind of checking, this very soon becomes a staple check before sending out any MI reports to make sure your stats haven’t somehow gone in the wrong direction! (Or just to quickly find out where the changes are in larger tables… Both are good)

9. Standardising data for comparison

One problem with comparing data is that lots of things can make it look like a mismatch when actually it isn’t (and vice versa although this is less common). To round off the series I’ll list a few of the typical cleaning methods which should normally be used before trying to compare disparate datasets. I’m not going into real ‘data cleansing’ per se, just highlight some basic tidy ups that often need to be done for comaprison techniques to work. But fear not (!!) I have a series of posts planned on data cleansing to follow! (I can sense the anticipation from here…) =D

 

These are all the ones I’ve come up with so far and plan to do more in the future but if any stray reader has other ideas for common data comparison requirements that I’ve been remiss in omitting please feel free to leave a comment and I’ll add a post onto the series!

 

Boy, was my face red…

I’m always partly overjoyed and partly embarassed when I realise I’ve been doing something the long way round and there’s been a brutally easy method available all along.

Like many other developers, I started learning with VBA, and was largely self-taught. As such I became used to doing some things the long way round. When I subsequently started learning other languages such as c# and PHP I learned how to do things quicker and more efficiently and for a while it just never crossed my mind to rethink how I was coding in VBA so it’s only recently dawned on me that I could use xPath in VBA development, even though I’ve been using it for ages with other frameworks.

xPath is a method of navigating though all elements of an XML document. It is the basis of xQuery which is often described as the XML equivalent of SQL for relational databases. I’m not going to go much into the syntax of xPath today, that will have to follow in a subsequent post. (It’s fairly intuitive anyway!) For now I’ll just assume you know what it is and post some code snippets on how to use it through VBA. And if you simply can’t wait for an xPath post then sites such as the w3Schools, as always, have very comprehensive tutorials.

Query XML and return results

Firstly, to do anything with XML in VBA you need to set a reference to the Microsoft XML library (Tools -> References) – I’m using v6.0 but everything back to v2.6 will work fine for the examples here.

And secondly, a brief overview of the objects to be used:

  • MSXML2: This is the library reference you made above (and is not strictly necessary but I think prefixing it here makes the code easier to read).
  • DOMDocument: Represents the entire XML document
  • IXMLDOMNode: A single XML node
  • IXMLDOMNodeList: This is a list of nodes which will hold the results of the xPath query.
The snippet below is going to:
  1. Load an XML document (The example file I used can be viewed here: contacts.xml). The Load method returns a boolean value so can be tested for error trapping.
  2. Define a basic xPath query. The below example selects contacts with an ID of “C0005″. Note this will select the whole contact node. If I wanted to select just one of the child nodes (e.g. email) that could be appended to the query: “//contacts/contact[@id='C0005']/email”
  3. Fill the Node List with the results of the query. Note – If you’re familiar with using xPath in javascript then using it in VBA is (quite understandably) comparable to using it in IE (“.selectNodes(xPath)”) as compared to every other browser which uses the .evaluate method… but I digress.. In this case there will be only one contact node returned for contact C0005.
  4. Give some output. Here it will print the first child node (base 0) which in this case is ‘dept’.
Sub xPathQueryByID()

    Dim theDoc As New MSXML2.DOMDocument: Set theDoc = New DOMDocument
    Dim xmlFilePath As String: xmlFilePath = "F:\Documents\contacts.xml"
    Dim xList As IXMLDOMNodeList, xNode As IXMLDOMNode, xChild As IXMLDOMNode
    Dim xPath As String

    '' ** Load XML file
    If Not theDoc.Load(xmlFilePath) Then
        MsgBox ("Failed to load XML file."), vbCritical, "Sub Aborted"
        Exit Sub
    End If

    '' ** Query
    xPath = "//contacts/contact[@id='C0005']" 'Return Contact node with ID = C0005
    Set xList = theDoc.SelectNodes(xPath)

    '' ** Do something with the results
    For Each xNode In xList
        Debug.Print xNode.ChildNodes(0).nodeTypedValue
    Next xNode

    Set theDoc = Nothing

End Sub

Simple enough =D

As a brief refresher on xPath, remember that the ‘@’ is required to query the node’s attributes. Note also that the query can include multiple parametres…

xPath = "//contacts/contact[staff='1' and grade<='2']"

… which would output a nodeset of staff who are grade 2 or less. It could also include any of the numerous xPath functions, such as ‘fn:position()’.

Query and Update XML File

Finally, note that having a node list creates a reference to the returned nodes, not a copy. As such you can amend and update the DOM by looping through each node in your results:

'' ** Query XML file
  xPath = "//contacts/contact[@id='C0007']/staff" 'Staff child node from Contact ID = C0007,
  Set xList = theDoc.SelectNodes(xPath)

'' ** Update each node returned
  For Each xNode In xList
        xNode.nodeTypedValue = 0 'Update node value: Mr Stockman is no longer Staff
  Next xNode

  theDoc.Save xmlFilePath
  Set theDoc = nothing

Main lesson for me here: Always question if there is a better way of doing things. If you’ve been doing something the same way for over 12 months, it’s a fair bet there are better ways of doing it now!

I still haven’t found any way of using xQuery in VBA so am still more limited than with other languages but it nevertheless made my day when that little light bulb appeared over my head =D

And if any reader has applied xQuery to VBA (natively, without the use of xlDNA etc.) then I’d love to hear about it in the comments!