25 February 2017

Statistics Routines

In addition to work on SigmaSwiftStatistics, we're writing some statistics libraries for other languages.

We're a bit sneaky, because we find this a great way to get familiar with languages old and new, but it also provides a useful service for everyone else.

Currently, we're writing some routines in Ada to complement the Swift ones. You can get the Ada code here. We'll be creating a Prolog repository soon.

19 February 2017

Consulting at the Department for International Trade

Thought Into Design Ltd is heavily involved with some research activities at the Department for International Trade (DIT). We're very excited to be collaborating with DIT on a range of research projects. We've already undertaken some usability testing and we're hoping to add our skills and knowledge to all the rest that the DIT has available.

This is coming very soon after our stint at helping the Office for National Statistics improve their Interdepartmental Business Register (IDBR), which is a register of 2.1 million UK businesses and fully compliant with the European Union regulation on harmonisation of business registers for statistical purposes (EC No 177/2008).

18 February 2017

Trying out Swift

Lately, at Thought Into Design, we've been trying out Swift. This was partially so that we could write apps for iOS and OSX (or is that MacOS now?) but it is also good to learn the foibles of a new language's syntax.

Our most recent work has myself (Alan) contributing towards an open source statistics library called SigmaSwiftStatistics. It's been great fun so far and we've contributed a few descriptive functions (to be cleaned up and made better by the project's maintainer, Evgenii Neumerzhitckii) such as the coefficient of variation, geometric mean, harmonic mean, skewness, kurtosis, Hyndman & Fans 9 quantile methods and a handful of nonparametric tests like a decent mode routine and a routine to rank data.

The mode is more useful than, say, that found in SPSS because it reports not just the modal value but all the indices at which it occurs. It's been some years since I used SPSS, but when I did, I recall it only provided the first index of occurrence. Sometimes, it's useful to know everywhere that it occurs.

Anyway, we've been committing to this project and are sure you'll find it useful. It's in Cocoapods so it should be simple to use into a project.

Over the last few evenings, we've been coding up a routine for univariate analysis of variance (both within and between subjects). It's passed its initial tests (using ANOVA tables generated from SPSS) so the code is almost ready to go. We hope to get it committed sometime this weekend.

We're also keen to produce some decent nonparametric tests. When people write routines, there is a tendency to focus on tests for parametric data and tests for nonparametric data are the forgotten children. We've found nonparametric tests to be excellent in many real-life circumstances (heavily skewed or kurtotic data, no normal distribution, scales of less than 11 items if I remember my Nunally correctly) so we want to ensure they are included.

In the future, I would like to investigate putting a Swift wrapper of some kind conforming to BLAS so that a massive range of linear algebraic routines (often heavily optimised) will be available. When using Swift, I find I miss Numpy and SciPy, and a layer on top of BLAS would help bring some of that serious power over.

Here's to statistics on Swift!

13 February 2015

Fast prime numbers in Python

I spent some time recently on Project Euler and got side-tracked by the efficient calculation of prime numbers. After using a brute force method (iterating through a range of numbers and trying to find their factors), I read around and found a nice page at http://rebrained.com/?p=458, I found a good Stack Overflow page at http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n. These inspired me to try again and I came up with the following routine. It's faster than most but not as fast as primes6 on the first page I linked to when generating more than approximately prime numbers up to 350,000-400,000. Below that, nothing seems to touch it.

It's also different. It uses numpy (which is cheating, in a way) but does the job well. I like it because it seems more understandable once you've grasped that the routine doesn't do any division. Instead, it's pure sieve operations on a vector of booleans. Anything found to be divisible by anything other than 1 and itself is marked as False, and the routine finishes by returning the indices of True values - which are primes.

I've triangulated the results by summing them and comparing the sums against those of other routines and there's no differences I've noticed yet.

import numpy as np
from math import sqrt

def ajs_primes3a(upto):
  mat = np.ones((upto), dtype=bool) # set up a long boolean array
  mat[0] = False # remove 0
  mat[1] = False # remove 1
  mat[4::2] = False # remove anything divisible by 2
  for idx in range(3, int(sqrt(upto))+1, 2): # remove anything else divisible
      mat[idx*2::idx] = False 
  return np.where(mat == True)[0] # return the indices which are the primes

I'm quite pleased with this early foray into optimising a routine but there's work to do compared to prime6. What I like is that it has no division and instead seems to be a pure sieve and doesn't create a long list of numbers.

I tried other versions with a half-series so that anything divisible by 2 just wasn't considered, but what I came up with just weren't as fast.

Times (msecs, same machine, best of 3-6 multiple runs)

             10k      100k     500k     1m       20m
prime6       0.001258 0.002722 0.007229 0.001229 0.22388
erat         0.005414 0.059047 0.333737 0.673749 15+ seconds
ajs_primes3a 0.000360 0.001897 0.008540 0.016952 0.70135

Up to 100k, mine leads but prime6 takes over strongly after that. Mine doesn't lose too much ground, considering, so it's best to think of mine as fast-ish but nicely understandable. 

23 August 2013

Crowd-sourcing research

One idea I had a few years ago was to use the various crowd-sourcing websites as a source of willing and cheaply-paid participants for UX research.

Wait, you fool! You cannot do an hour-long usability session like that!

Well, the keyword is reductionism. UX research is discovering brain and behaviour. It's psychology. And a few psychologists have already been using crowd-source sites as sources of participants. [1, 2]

The overall results are quite promising: It is possible to undertake such experiments with a crowd-sourced experiment sample. There are, however, precautions to be taken, and it's only fair to pay participants a decent amount if only to ensure a low drop-out rate and faster participation.

So it's not cheaper but it is faster. One study [2] said, "Performing a full-sized replication of the Nosofsky et al. [40] data set in under 96 hours is revolutionary." For UX research, it shows a wonderful promise for particular questions as long as the experiment is designed well.

But I also want to make sure that I'm dealing with an ethical company. In my new role as a research lead for Vodafone, there is a reputation risk to the company. This means that I've been participating in some of these sites as a worker to check out conditions from within. A lot depends upon the micro-tasks conditions; but company also have their own attitudes to workers which was taken into account. We will not work with companies that argue about or unnecessarily delay payments to workers; or use petty reasoning to 'trick' workers out of their money.

To guard against reputational risk, we will engage only those companies that treat workers with at least some respect.

I don't expect this post to make any waves, at least not amongst the crowd-sourcing sites, because our work is fairly small potatoes for them. But we are a large company with expanding requirements, and it's always good to remember that those on the bottom can also, with a slight change in context, become the one who pays the piper.


[1]   Paolacci, Chandler, Ipeirotis (2010) Running experiments on Amazon Mechanical Turk. Judgement and Decision Making, Vol. 5, no. 5.

[2]   Crump MJC, McDonnell JV, Gureckis TM (2013) Evaluating Amazon's Mechanical Turk as a Tool for Experimental Behavioral Research. PLoS ONE 8(3): e57410. doi:10.1371/journal.pone.0057410

09 July 2013

Will UX exhaust itself?

When I first started in the field (originally doing usability along with some design), it was easy to make an impact. Just think of all those nasty 1990s websites with fundamental flaws that could remedied with a wave of a good developers hand? It was like that.

But now, the whole world is getting on-board with user experience and the upshot is that everyone is a little more savvy than they used to be. This in turn makes small-change-big-impact gigs far less frequent. UX is focusing increasingly upon minutia because that's where the benefits will be found.

But eventually, the law of diminishing returns will kick in and we will come to the point where UX talent will be employed for the quick wins and little else. Or perhaps even not at all.

Another possibility is that UX practitioners will increasingly focus on niche areas. "Hey!", they'll say. "You need someone to optimise a revenue stream from selling vintage astronomy books? Well, I've done vintage chemistry books which is much the... Oh, okay. So you think I haven't got the skills..."

I wonder if this might lead not so much to a contraction in the market, but rather a slow down? I can envisage companies saying, "Well, these are all UX problems but they're pretty much solved so let's hold off on that freelancer."

19 May 2013

Excel for serious data analysis?!

Let's all laugh at Excel - sure, I do. When doing data analysis, it's
good for data entry but I'd hate to rely on it for anything serious.

But a while ago, I came across a good use case for it. I'm sure R
could do the same thing fairly well if needed, but here's a nice quick
and dirty method of exploring data sets with lots of variables.

What's the problem with lots of variables? Data analysts these days
are suppose to start drooling when they get more data but the reality
is that this makes things harder to see what's going on. The chances
are that the noise is outgrowing the data. Yeah, we can all be pompous
and promote ourselves by saying, "Hey, dirty data is what I live for -
so just get out of my way, I'm coming through" and other such macho

If we're really analysing data, at some point we want answers. So how
can we decrease the noise?

For an example, I'm going to look at an available data set from
SEOMoz. What they did was monitor a number of websites and their
Google rankings in response to various queries.

But to me in the task of trying to understand what happened here, it's
confusing. Models with this number of variables are probably going to
fail because the ability to discriminate a variable's effect will
often lower with the addition of a influencing variable - even if that
influencing variable is orthogonal to what we're measuring.

So my first job was to more clearly define my problem space by
reducing noise. I did this by correlating each variable with eath
other variable. This formed a nice square matrix of correlations, with
a diagonal consisting of exactly 1.0 (each variable correlated with
itself). By the way, I'm not looking at significance here so alpha
inflation is not an issue.

But this was still hard to really visualise. Visualising is a critical early step in almost any analysis process. It helps me develop a mental model of the data I'm going to be working with.

But here, Excel and conditional formulae came to the rescue. I figured
that if I could colour a cell according to the strength of a
correlation, I might be able to get a high level view of the data
which let's me focus on bits of interest.

So I copied out a conditional formulae which is probably useful for
any correlation matrix.

Then by making cells tiny, I could get a nice visualisation that
allowed me to identify groups of variables that appeared to co-vary
strongly (the darker bits)

I could look at these in more detail and decide whether or not to
delete them or keep them and simplify the data set somewhat. It was
helpful to go to others with a list of variables to ignore and tell
them to focus on replacements instead.

Of course, this is only a very small part of the story, but as a few
first steps, it was useful to reduce the noise.