Scrapping Geno 2.0 Next Generation webpage in Python using BeautifulSoup

Sat, 3 Jun 2017 data, python

Beautiful Soup is a Python library to search and extract what we need from a document. In this post, I use it to access the data in Geno 2.0 Next Generation webpage for each population. This project contains information of more than 830000 volunteers from 140 countries who have participated in the project. The webpage summarizes the results and shows what is the share of various genetic affiliations in each population. Possibly, novel insights is hidden in the data, but first of all we need to collect them for post-processing!

The overall workflow looks like this: 1. Identify a source, whether a website url or a locally saved file. 2. In BeautifulSoup, use a parser to parse HTML source code. The default is html.parser. Other options include html5lib library to parse sources written in HTML5, but you have install it separately. see the instructions here. 3. Find HTML elements such as div or a that hold the required data. We can also select elements with certain id or class. 4. Then use commands such as findAll or find to find all or an instance of the data, you are looking for. 5. Possibly do a post-process on the scraped data, to make it in the required format. Here, I collect them in an ordered dictionary to convert the dataset into JSON file and pandas dataframe.

This script uses libraries:

BeautifulSoup: To scrape the webpage
Collections: To hold an ordered list of items in a dictionary
json: To save extracted data in JSON format
pandas: To create a dataframe
numpy: To do numerics

from bs4 import BeautifulSoup
from collections import OrderedDict
import json
import pandas as pd
import numpy as np

Specifying the source

As I said in the above workflow, we could use either a locally save file or a url. The most usual way is to use a url using libraries like urllib or requests. But for this particular webpage that I am intrested in, we are not able to extract all data because part of data is generated through javascript code. There are workarounds to access data in webpages rendered by javascript (like dryscrape), but since I only work with 1 page, it is easier to save it locally. I have saved a local copy of the webpage in webpage directory.

# url to scrape
url_to_scrape = 'https://genographic.nationalgeographic.com/reference-populations-next-gen/'

# local file to scrape
file_to_scrape = open("./webpage/Reference Populations - Geno 2.0 Next Generation.html")
# Create a beautifulsoup object from html content
soup = BeautifulSoup(file_to_scrape,"html.parser")

Looking into soup!

Let’s see what is inside the variable soup. It contains all HTML elements in the webpage. Looking through the code, I realized the info that I’m interested in are wrapped in <div> elements that look like this:

<div class="pop-211">
...
</div>

The class name is pop-x where x ranges from 200 to 260. But we don’t need to know the exact range, as we will see later. Within each of these <div> elements, there are a few <li> items which look like the following block:

<li class="pop-id-2105" style="width:8%;">
            <div class="wp-autosomal-bar-label">
                <p>Eastern Africa</p>
                <div class="wp-autosomal-bar-line"></div>
            </div>
            <div class="wp-autosomal-bar-section">
                <h3>2%</h3>
            </div>
        </li>

We are interested in the strings within <p> (<p>Eastern Africa</p>) and <h3> (<h3>2%</h3>) tags. So the idea is this:

Find all <div> elements with class=pop-x,
Extract the text within <p> elements in <div class="wp-autosomal-bar-label">.
Extract the text within <h3> elements in <div class="wp-autosomal-bar-section">.

Number two gives the population name and number three gives us the percentage. We’re ready to crawl the webpage and extract the data. The idea is going through the page and collect the data in a dictionary. Later, we use the dictionaries to create a dataframe.

# create an empty parent dictionary containing
# dictionaries for all labels
dic = []

for identifier in range(200,261):
    # make sure you use a wide enough range
    # to include all possible numbers

    # create an ordered dictionary to keep
    # all info about genetic contributions
    # of this identifier
    d = OrderedDict()

    # find all `div elements corresponding to `identifier`
    # This contains all HTML codes within that <div>
    data = soup.findAll('div', class_="pop-"+str(identifier))[0]

    # Population selected to find its genetic contributions
    population_label = data.findAll('h3')[0].get_text()
    d['title'] = population_label

    # How much each gene contributes in the selected populations
    # find <div>s with the mentioned classes
    label = [key.find('p').text for key in data.findAll('div',class_="wp-autosomal-bar-label")]
    percent = [key.find('h3').text for key in data.findAll('div',class_="wp-autosomal-bar-section")]

    # make sure that the number of labels
    # and percentages match!
    if (len(label)==len(percent)):
        # if yes, put them in an ordered dictionary
        for i in range(len(label)):
            d[label[i]]=percent[i].split('%')[0]

    # append the ordered dictionary to the parent dictionary
    dic.append(d)

Now we could see how a dictionary for each label looks like. It contains a title (e.g. Chinese) with a set of labels (e.g. ‘Finland & Northern Siberia’,…) and values (e.g. ‘2’, …) for each genetic type.

dic[7]

OrderedDict([('title', 'Chinese'),
             ('Finland & Northern Siberia', '2'),
             ('Eastern Asia', '81'),
             ('Central Asia', '8'),
             ('Southeast Asia & Oceania', '7')])

Saving the results

Finally we need to decide what is the best way to store data in a file. For example, I can save all the results in a JSON (JavaScript Object Notation) file to use it later in a D3 visualization.

with open('data.json', 'w') as outfile:
    json.dump(dic, outfile)

But a very common way is to save data in a dataframe. We need to achieve a dataframe that looks like this:

Population	Arabia	Asia Minor
African-American (Southwestern US)	0	0
Altaian (Siberian)	0	0
Amerindian (Mexico)	0	0

So all populations are stored in a column while each regional affiliation has its own column. The numbers show percentage of the share of regional affiliations in that population. So I need to find out all regional affiliations plus all populations by looping through dic which hold all scraped data. Since an affiliations can appears in more than one population, we need to find unique affiliations, so we use set class to hold them. We

# find all regional affiliations and sort them
regions = sorted(list(set([keys for v in dic for keys in v if keys!='title'])))
# find all populations
titles = [v['title'] for v in dic]
# what is the number of rows in our dataset?
n = len(titles)
# initialize a dataset but set the share equal to zero temporarily
columns=OrderedDict()
for r in regions:
    print
    columns[r]=np.zeros(n)
df = pd.DataFrame(columns,index=titles)

Now we created the dataframe using pandas but the all elements are zero. So we loop through dic again to fill the dataset with scraped values:

for d in dic:
    # the population
    title = d['title']
    # select the row related the title
    row = df.loc[title]
    # fill the cell using the percentage value
    for k in d:
        if k!='title': row[k]=d[k]
df.ix[0:5,0:5]

	Asia Minor	Central Asia	Eastern Africa	Eastern Asia
African-American (Southwestern US)	0.0	0.0	2.0	0.0
Altaian (Siberian)	8.0	42.0	0.0	18.0
Amerindian (Mexico)	0.0	0.0	0.0	0.0
Bermudian	0.0	0.0	0.0	0.0
Bougainville-Nasioi (Oceania)	0.0	0.0	0.0	13.0

Now the dataset contains the percentage values and if an affiliation does not contribute into a population, its share is 0. Finally I convert the populations into a column on their own and save the dataset to use it in my next project.

df.reset_index(inplace=True)
df.ix[0:5,0:5]

	index	Asia Minor	Central Asia	Eastern Africa
0	African-American (Southwestern US)	0.0	0.0	2.0
1	Altaian (Siberian)	8.0	42.0	0.0
2	Amerindian (Mexico)	0.0	0.0	0.0
3	Bermudian	0.0	0.0	0.0
4	Bougainville-Nasioi (Oceania)	0.0	0.0	0.0
5	British (United Kingdom)	0.0	0.0	0.0

df.to_csv('./geno2.csv')

This post is written in Jupyter notebook and is available with the required dataset as a github repository.