Beautiful Soup is a Python library to search and extract what we need from a document. In this post, I use it to access the data in
Geno 2.0 Next Generation webpage for each population. This project contains information of more than 830000 volunteers from 140 countries who have participated in the project. The webpage summarizes the results and shows what is the share of various genetic affiliations in each population. Possibly, novel insights is hidden in the data, but first of all we need to collect them for post-processing!
The overall workflow looks like this:
1. Identify a source, whether a website url or a locally saved file.
2. In BeautifulSoup, use a parser to parse HTML source code. The default is
html.parser. Other options include
html5lib library to parse sources written in HTML5, but you have install it separately. see the instructions here.
3. Find HTML elements such as
a that hold the required data. We can also select elements with certain
4. Then use commands such as
find to find all or an instance of the data, you are looking for.
5. Possibly do a post-process on the scraped data, to make it in the required format. Here, I collect them in an ordered dictionary to convert the dataset into
JSON file and
This script uses libraries:
BeautifulSoup: To scrape the webpage
Collections: To hold an ordered list of items in a dictionary
json: To save extracted data in JSON format
pandas: To create a dataframe
numpy: To do numerics
from bs4 import BeautifulSoup from collections import OrderedDict import json import pandas as pd import numpy as np
Specifying the source
As I said in the above workflow, we could use either a locally save file or a url. The most usual way is to use a url using libraries like
dryscrape), but since I only work with 1 page, it is easier to save it locally. I have saved a local copy of the webpage in
# url to scrape url_to_scrape = 'https://genographic.nationalgeographic.com/reference-populations-next-gen/' # local file to scrape file_to_scrape = open("./webpage/Reference Populations - Geno 2.0 Next Generation.html") # Create a beautifulsoup object from html content soup = BeautifulSoup(file_to_scrape,"html.parser")
Looking into soup!
Let’s see what is inside the variable
soup. It contains all HTML elements in the webpage. Looking through the code, I realized the info that I’m interested in are wrapped in
<div> elements that look like this:
<div class="pop-211"> ... </div>
The class name is
x ranges from 200 to 260. But we don’t need to know the exact range, as we will see later. Within each of these
<div> elements, there are a few
<li> items which look like the following block:
<li class="pop-id-2105" style="width:8%;"> <div class="wp-autosomal-bar-label"> <p>Eastern Africa</p> <div class="wp-autosomal-bar-line"></div> </div> <div class="wp-autosomal-bar-section"> <h3>2%</h3> </div> </li>
We are interested in the strings within
<p>Eastern Africa</p>) and
<h3>2%</h3>) tags. So the idea is this:
- Find all
- Extract the text within
- Extract the text within
Number two gives the population name and number three gives us the percentage. We’re ready to crawl the webpage and extract the data. The idea is going through the page and collect the data in a dictionary. Later, we use the dictionaries to create a dataframe.
# create an empty parent dictionary containing # dictionaries for all labels dic =  for identifier in range(200,261): # make sure you use a wide enough range # to include all possible numbers # create an ordered dictionary to keep # all info about genetic contributions # of this identifier d = OrderedDict() # find all `div elements corresponding to `identifier` # This contains all HTML codes within that <div> data = soup.findAll('div', class_="pop-"+str(identifier)) # Population selected to find its genetic contributions population_label = data.findAll('h3').get_text() d['title'] = population_label # How much each gene contributes in the selected populations # find <div>s with the mentioned classes label = [key.find('p').text for key in data.findAll('div',class_="wp-autosomal-bar-label")] percent = [key.find('h3').text for key in data.findAll('div',class_="wp-autosomal-bar-section")] # make sure that the number of labels # and percentages match! if (len(label)==len(percent)): # if yes, put them in an ordered dictionary for i in range(len(label)): d[label[i]]=percent[i].split('%') # append the ordered dictionary to the parent dictionary dic.append(d)
Now we could see how a dictionary for each label looks like. It contains a
title (e.g. Chinese) with a set of
labels (e.g. ‘Finland & Northern Siberia’,…) and
values (e.g. ‘2’, …) for each genetic type.
OrderedDict([('title', 'Chinese'), ('Finland & Northern Siberia', '2'), ('Eastern Asia', '81'), ('Central Asia', '8'), ('Southeast Asia & Oceania', '7')])
Saving the results
Finally we need to decide what is the best way to store data in a file. For example, I can save all the results in a
with open('data.json', 'w') as outfile: json.dump(dic, outfile)
But a very common way is to save data in a dataframe. We need to achieve a dataframe that looks like this:
|African-American (Southwestern US)||0||0|
So all populations are stored in a column while each regional affiliation has its own column. The numbers show percentage of the share of regional affiliations in that population. So I need to find out all regional affiliations plus all populations by looping through
dic which hold all scraped data. Since an affiliations can appears in more than one population, we need to find unique affiliations, so we use
set class to hold them. We
# find all regional affiliations and sort them regions = sorted(list(set([keys for v in dic for keys in v if keys!='title']))) # find all populations titles = [v['title'] for v in dic] # what is the number of rows in our dataset? n = len(titles) # initialize a dataset but set the share equal to zero temporarily columns=OrderedDict() for r in regions: print columns[r]=np.zeros(n) df = pd.DataFrame(columns,index=titles)
Now we created the dataframe using pandas but the all elements are zero. So we loop through
dic again to fill the dataset with scraped values:
for d in dic: # the population title = d['title'] # select the row related the title row = df.loc[title] # fill the cell using the percentage value for k in d: if k!='title': row[k]=d[k] df.ix[0:5,0:5]
|Arabia||Asia Minor||Central Asia||Eastern Africa||Eastern Asia|
|African-American (Southwestern US)||0.0||0.0||0.0||2.0||0.0|
Now the dataset contains the percentage values and if an affiliation does not contribute into a population, its share is 0. Finally I convert the populations into a column on their own and save the dataset to use it in my next project.
|index||Arabia||Asia Minor||Central Asia||Eastern Africa|
|0||African-American (Southwestern US)||0.0||0.0||0.0||2.0|
|5||British (United Kingdom)||0.0||0.0||0.0||0.0|
This post is written in Jupyter notebook and is available with the required dataset as a github repository.