Let's all laugh at Excel - sure, I do. When doing data analysis, it's good for data entry but I'd hate to rely on it for anything serious. But a while ago, I came across a good use case for it. I'm sure R could do the same thing fairly well if needed, but here's a nice quick and dirty method of exploring data sets with lots of variables. What's the problem with lots of variables? Data analysts these days are suppose to start drooling when they get more data but the reality is that this makes things harder to see what's going on. The chances are that the noise is outgrowing the data. Yeah, we can all be pompous and promote ourselves by saying, "Hey, dirty data is what I live for - so just get out of my way, I'm coming through" and other such macho nonsense. If we're really analysing data, at some point we want answers. So how can we decrease the noise? For an example, I'm going to look at an available data set from SEOMoz. What they did was monitor a number of websites and their Google rankings in response to various queries. But to me in the task of trying to understand what happened here, it's confusing. Models with this number of variables are probably going to fail because the ability to discriminate a variable's effect will often lower with the addition of a influencing variable - even if that influencing variable is orthogonal to what we're measuring. So my first job was to more clearly define my problem space by reducing noise. I did this by correlating each variable with eath other variable. This formed a nice square matrix of correlations, with a diagonal consisting of exactly 1.0 (each variable correlated with itself). By the way, I'm not looking at significance here so alpha inflation is not an issue. But this was still hard to really visualise. Visualising is a critical early step in almost any analysis process. It helps me develop a mental model of the data I'm going to be working with. But here, Excel and conditional formulae came to the rescue. I figured that if I could colour a cell according to the strength of a correlation, I might be able to get a high level view of the data which let's me focus on bits of interest. So I copied out a conditional formulae which is probably useful for any correlation matrix.
Then by making cells tiny, I could get a nice visualisation that allowed me to identify groups of variables that appeared to co-vary strongly (the darker bits) I could look at these in more detail and decide whether or not to delete them or keep them and simplify the data set somewhat. It was helpful to go to others with a list of variables to ignore and tell them to focus on replacements instead. Of course, this is only a very small part of the story, but as a few first steps, it was useful to reduce the noise.
1 comment:
Aivivu đại lý vé máy bay, tham khảo
mua ve may bay di my
dat ve may bay tu han quoc ve viet nam
vé máy bay quy nhơn tphcm
vé máy bay tphcm đi hà nội
bay từ california về việt nam mất bao lâu
taxi sân bay nội bài
Post a Comment