19 May 2013

Excel for serious data analysis?!

Let's all laugh at Excel - sure, I do. When doing data analysis, it's
good for data entry but I'd hate to rely on it for anything serious.

But a while ago, I came across a good use case for it. I'm sure R
could do the same thing fairly well if needed, but here's a nice quick
and dirty method of exploring data sets with lots of variables.

What's the problem with lots of variables? Data analysts these days
are suppose to start drooling when they get more data but the reality
is that this makes things harder to see what's going on. The chances
are that the noise is outgrowing the data. Yeah, we can all be pompous
and promote ourselves by saying, "Hey, dirty data is what I live for -
so just get out of my way, I'm coming through" and other such macho

If we're really analysing data, at some point we want answers. So how
can we decrease the noise?

For an example, I'm going to look at an available data set from
SEOMoz. What they did was monitor a number of websites and their
Google rankings in response to various queries.

But to me in the task of trying to understand what happened here, it's
confusing. Models with this number of variables are probably going to
fail because the ability to discriminate a variable's effect will
often lower with the addition of a influencing variable - even if that
influencing variable is orthogonal to what we're measuring.

So my first job was to more clearly define my problem space by
reducing noise. I did this by correlating each variable with eath
other variable. This formed a nice square matrix of correlations, with
a diagonal consisting of exactly 1.0 (each variable correlated with
itself). By the way, I'm not looking at significance here so alpha
inflation is not an issue.

But this was still hard to really visualise. Visualising is a critical early step in almost any analysis process. It helps me develop a mental model of the data I'm going to be working with.

But here, Excel and conditional formulae came to the rescue. I figured
that if I could colour a cell according to the strength of a
correlation, I might be able to get a high level view of the data
which let's me focus on bits of interest.

So I copied out a conditional formulae which is probably useful for
any correlation matrix.

Then by making cells tiny, I could get a nice visualisation that
allowed me to identify groups of variables that appeared to co-vary
strongly (the darker bits)

I could look at these in more detail and decide whether or not to
delete them or keep them and simplify the data set somewhat. It was
helpful to go to others with a list of variables to ignore and tell
them to focus on replacements instead.

Of course, this is only a very small part of the story, but as a few
first steps, it was useful to reduce the noise.