29 January 2013

SEO: Scraping synonyms from Wikipedia

Here's a Python script for scraping synonyms from Wikipedia. You provide the core keywords, and Python (plus the BeautifulSoup module) will get the synonyms.

The last article I wrote about getting SEO keywords from Wikipedia seemed interesting to people. The method was, however, manual, which takes time and effort to complete for more than a couple of keywords.

If you want to hurry things up or automate the process, below is a Python script which scrapes potential keywords from Wikipedia. If you’re going to be doing a lot of this, I’d be tempted to download a snapshot of the Wikipedia database and use that rather than downloading. This script is really intended for educational purposes or those who want to retrieve a very small list of synonyms.

The output is a CSV (comma separated file) that can be opened in a spreadsheet. I find LibreOffice much better at handling Unicode content from a CSV file than Excel and it’s a free download. Just don’t ask me how to import Unicode CSV with Excel!

Each series of synonyms requires a lot of cleaning but it’s easier than downloading it all yourself.

"""
Scrape synonyms from Wikipedia
"""

import urllib2
import BeautifulSoup as BS

# phrases go in here as a list of strings
# e.g., names = ['United_kingdom','United_States','Philippines'] looks for synonyms for the UK, US and the Philippines

names = ['United_kingdom','United_States','Philippines'] 

URL = "http://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/%s&hidetrans=1&hidelinks=1&limit=500"

fout = open('synonyms_test.csv','w')
for name in names:
    active_URL = URL%name
    print active_URL
    req = urllib2.Request(active_URL, headers={'User-Agent' : "Magic Browser"})
    data = urllib2.urlopen(req)
    stuffs = data.read()
    soup = BS.BeautifulSoup(stuffs)
    links_body = soup.find("ul", {"id" : "mw-whatlinkshere-list"})
    fout.write('%s, '%name)
    try:
        links = links_body.findAll('a')
        for link in links:
            if link.text != "links":
                fout.write('%s, '%link.text.encode('utf-8'))
        fout.write('\n')
    except AttributeError: # needed in case nothing is returned
        pass
fout.close()

A couple of things: this code needs BeautifulSoup installed. See their install notes on how to do this. What this module does is parse the Wikipedia page. The script then iterates through the names you’ve provided with scrapes a page for the first 500 links that are not transduction pages or just plain links. This reduces the problem space down to mostly synonymous re-directs which is what we want.

To run this script, you need to see line 11 and insert the phrases you want to retrieve synonyms for. Don’t leave spaces: replace them with underscores. The script also doesn’t check whether the phrase you’ve used is the canonical page either so that’s something you need to check for.

Once the script’s finished, load the CSV file into LibreOffice Calc (or some other form of spreadsheet that can load CSV files with Unicode), and delete anything that clearly isn’t a valid synonym for SEO purposes.

When that’s done, delete all the blanks and shift cells left (not up!), and you should have a spreadsheet full of nice synonyms that can enhance your SEO.

Happy scraping!

Post a Comment