Pipelines incidents: not as rare!

Dakota Access Pipeline is no longer an unfamiliar name. The controversy over the pipeline comes from two opposing views. On the one hand, people are of afraid of (water) pollution. On the contrary, the corporation responsible for the pipeline says it’s safe. So the question is clear: Is the pipeline really safe?

It’s hard to answer this for a not operational pipeline, but we can look at pipeline incidents data to answer if pipelines are safe. Fortunately, the Pipeline and Hazardous Materials Safety (PHMS) Administration provides data on every single pipeline accidents from 1986 to now. The dataset is rich with a lot of raw data showing all kinds of details about the accidents.

The analysis of the pipeline data is not unprecedented. The New York Times in 2014, created an interactive map of environmental incidents in North Dakota from 2006-2014. Back in 2015, High Country News wrote a piece on crude oil spills after an incident in Santa Barbara costing more than $62M at the time. Also after an incident in North Dakota, Inside Energy analyzed wastewater spills in North Dakota. All these articles emphasize that the spills are not as rare as one might think.

This post is my take on the pipeline incidents. I’m interested in answering these specific questions:

  1. How common are spills?
  2. What is their spatial and temporal distributions?
  3. What is their scale regarding volume and cost?
  4. What are the main causes of spills?
  5. What places have a higher risk?

There are tons of other questions to ask, but these are basic questions to understand the problem better. Hopefully, similar studies are done for policy making.

Dataset and technical details

I am doing the analysis in Python using its standard libraries:

  • numpy
  • pandas
  • matplotlib

I also use plotly, but only for plotting points on the US map. I have customized matplotlib style using the recommendations of this post.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.serif'] = 'Helvetica Neue'
plt.rcParams['font.monospace'] = 'Helvetica Neue'
plt.rcParams['font.size'] = 10
plt.rcParams['axes.labelsize'] = 10
plt.rcParams['axes.labelweight'] = 'bold'
plt.rcParams['axes.titlesize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['figure.titlesize'] = 12

You can download the dataset from PHMS website with xlsx format. The dataset is broken into different time periods. Here I only focus on the data from 2010-2016 (filename is hl2010toPresent, converted to CSV format). Although, many incidents in 2016 are not reported yet. This data contains massive details of each incident (having 255 columns for each record shows this!). I only select a few columns and rename them to be more descriptive:

  • Year
  • Location (longitude and latitude)
  • Commodity type
  • The released volume (in barrels)
  • State
  • Release type
  • Whether water was contaminated
  • Total cost (in 1984 dollars)
  • Cause

I make a few changes in the dataset to make it more approachable:

  • The cost comes in 1984 dollar. According to this website, \$1 in 1984 is equal to \$2.3 in 2016.
  • The released volume is in barrels, so I convert it to gallons which I have a better sense of (1 barrel = 42 US gallons).
  • The commodity names are long, so I make them shorter.
dataset = pd.read_csv("./hl2010toPresent.csv", usecols = usefulcols)

dataset.columns = ["time", "year", "lat", "long", "commodity", "volume","state", "release type", "water","cost","cause"]

    , inplace=True)

# convert to gallons
dataset['volume'] = 42 * (dataset['volume'].astype(int))

# convert to 2016 dollars
dataset['cost'] = 2.3 * dataset['cost']

# save the cleaned dataset in a new file
time year lat long commodity volume state release type water cost cause
0 2/16/10 7:42 2010 41.94352 -88.23353 REFINED LIQUIDS 0 IL LEAK NO 38379.865371 EQUIPMENT FAILURE
1 3/1/10 11:50 2010 37.10847 -100.80037 CARBON DIOXIDE 84 KS OTHER NO 4414.659264 OTHER INCIDENT CAUSE
2 2/22/10 10:38 2010 32.22471 -101.40440 FLAMMABLE GASES 42 TX LEAK NO 33782.834143 CORROSION FAILURE
3 2/19/10 6:50 2010 40.60860 -74.23990 REFINED LIQUIDS 0 NJ LEAK NO 21507.314365 EQUIPMENT FAILURE
4 2/21/10 12:45 2010 31.13284 -101.18974 CRUDE OIL 378 TX LEAK NO 21991.543373 CORROSION FAILURE

Spread of spills

The first plot shows the volume of the released commodities for all incidents in each year. Both plots are exactly the same, except the marker size is representing two different things. On the left plot, the size shows the cost of the spill. Unsurprisingly, the most costly incidents are those with a high volume usually $>10^4$.

The right panel, on the other hand, offers an alternative and interesting view (and it looks like they are a bunch of earrings!). The size of each point is proportional to the cost of the spill divided by the release volume. This reveals a very interesting trend: The cost per volume is more for small spills! I don’t know exactly why but my guess is the actions need to be taken are partly independent of the spill volume (this is up for debate, of course).

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,6))
dataset.plot.scatter(x="year", y="volume",s=dataset.cost/80000,ax=axes[0])
dataset.plot.scatter(x="year", y="volume",s=0.04*dataset.cost/dataset.volume,ax=axes[1])
for ax in axes:
    ax.set_ylabel('Volume (Gallons)')


Based on the commodity, the crude oil is the leading released commodity. Among 2693 incidents (roughly 1.5 incidents a day) reported from 2010 to October 2016, 1344 incidents involve releasing crude oil followed by refined fluid by 911 incidents. Together they make up ~$\%84$ of the incidents.

ax = dataset["commodity"].value_counts(sort=True).plot.barh()
ax.set_title("The number of spills by commodity")
for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_width()), (1.1*p.get_x()+p.get_width(), p.get_y()+p.get_height()/3 ))


It is worrying that number of the incidents has been constantly growing. Let’s look at the data by year and the reason of spill. Obviously leak is the leading cause and unfortunately constantly number of the leaks is increasing. I see no reason that this trend stops at this year. Other causes have more or less the same frequency over these years.

release_year = pd.crosstab(dataset['year'],dataset['release type'])
ax = release_year.plot(kind='bar', stacked=True, grid=False)
ax.set_ylabel("number of incidents")
ax.legend(loc = 'upper left')


From here on, I will focus on incidents involving crude oil only. The number of such incidents has constantly increased with the exception of 2011. But the released volume follows no such pattern. Also, as dataset clearly mentions, part of the released oil can be recovered but it is not reported in the dataset. In 1344 incidents of crude oil, more than 8.5 million gallons are released.

data = dataset[dataset["commodity"]=="CRUDE OIL"]
time year lat long commodity volume state release type water cost cause date ctime
4 2/21/10 12:45 2010 31.13284 -101.18974 CRUDE OIL 378 TX LEAK NO 21991.543373 CORROSION FAILURE 2010-02-21 2/21/10 12:45
8 2/20/10 6:30 2010 32.47850 -94.86790 CRUDE OIL 336 TX OVERFILL OR OVERFLOW YES 117614.823872 EQUIPMENT FAILURE 2010-02-20 2/20/10 6:30
10 3/1/10 9:16 2010 47.68857 -95.41732 CRUDE OIL 126 MN OVERFILL OR OVERFLOW NO 23997.634976 INCORRECT OPERATION 2010-03-01 3/1/10 9:16
13 3/1/10 8:10 2010 32.48325 -94.83034 CRUDE OIL 8316 TX RUPTURE NO 20146.442193 CORROSION FAILURE 2010-03-01 3/1/10 8:10
16 1/25/10 11:07 2010 29.47000 -90.25444 CRUDE OIL 8484 LA OVERFILL OR OVERFLOW YES 838911.034004 EQUIPMENT FAILURE 2010-01-25 1/25/10 11:07
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,6))

axes[0].set_title("The number of crude oil accidents per year")
axes[0].set_ylabel("number of accidents")
for p in axes[0].patches:
    # write number on the bars
    axes[0].annotate('{:.0f}'.format(p.get_height()), (p.get_x() + p.get_width()/8, p.get_height() * 1.005))

axes[1].set_title("The volume of crude oil in million gallons per year")
axes[1].set_ylabel("volume (million gallons)")
for p in axes[1].patches:
    axes[1].annotate('{:.1f}'.format(p.get_height()), (p.get_x() + p.get_width()/8, p.get_height() * 1.005))


Which states are dealing with the incidents?

Texas by far has the most spills, followed by Oklahoma, California, and Wyoming. In fact, Texas and Oklahoma deal with more than %51 of the incidents. This is not unexpected as according to Railroad Commission of Texas:

Texas has the largest pipeline infrastructure in the nation, with more than 439,771 miles of pipeline representing about 16 of the total pipeline mileage of the entire United States.

ax.set_title("The number of crude oil accidents in different states")
ax.set_ylabel("number of accidents")
# share of the first two states from the incidents



What is total money spent?

Another factor in a pipeline incident is the cost of each incident. Shown in 2016 dollar, look at the alternating pattern in the cost. This is probably because cleaning and fixing the issue takes a long time (maybe about a year?). Also, the companies are supposed to report the data at the end of fixing and cleaning process, so some incidents are not reported completely. The total reported cost so far has been $2.16 billion since 2010.

ax = (data.groupby(['year'])['cost'].sum()/10**6).plot(kind='bar',rot=0)
ax.set_title("The total cost of spills in 2016 dollar")
ax.set_ylabel("million dollar")
for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+ p.get_width()/2, p.get_height() * 0.98),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')


Was water contaminated?

Maybe the most concerning aspect of spills is whether they pollute the water. Fortunately, this is often not the case (in %91 of cases). Although, we don’t know if those spills happen close to the water or not. Maybe, we have been lucky that pipelines mostly are not crossing close to the water resources.

data['water'].value_counts().plot.bar(title="Whether water was contaminated or not",rot=0)


Geographic distributions

The color intensity represents the number of spills in each states. The color palettes on the US map is from colorbrewer2. As it was mentioned, Texas is by far the worst state in terms of number of incididents.

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from IPython.display import Image
import plotly.plotly as py
df = pd.DataFrame(data["state"].value_counts(sort=True).reset_index())
df.columns = ["code","counts"]
datax = [ dict(
        autocolorscale = False,
        colorscale=[[0, 'rgb(255,237,160)'],[0.5, 'rgb(254,178,76)'],[1, 'rgb(240,59,32)']],
        locations = df["code"],
        z = df["counts"],
        locationmode = 'USA-states',
        marker = dict(line = dict (color = 'rgb(0,0,0)',width = 1,) ),
        colorbar = dict(title = ""),
        ) ]

layout = dict(
        title = 'The number of spills from 2010-October 2016',  
        geo = dict(
            projection=dict( type='albers usa' ),
            showland = True,
            landcolor = "rgb(229, 229, 229)",
            subunitcolor = "rgb(0,5,5)",
            countrycolor = "rgb(0,50,5)",
            countrywidth = 0.5,
            subunitwidth = 0.5        

fig = dict( data=datax, layout=layout )
#iplot( fig, filename='./d3-cloropleth-map',) #image='png'


Incidents on map

Finally, maybe the most informative plot is the spatial distribution of all incidents. The map shows that spills really happen wherever there is a pipeline. Possibly connecting points will show exactly where pipelines are placed. The color of points is proportional to the released volume. The two large black points are two accidents in North Dakota (July 2013, 865000 gallons, \$20 million) and Michigan (July 2010, 843000 gallons, \$1 billion).

datax = [ dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = data['long'],
        lat = data['lat'],
        text = data['cost'].astype(str),
        mode = 'markers',
        hoverinfo = 'lon+lat+location',
        marker = dict(
            size = 9,
            symbol = 'circle',
            line = dict(width=1,color='rgba(10, 10, 1)'),
            opacity = 0.9,
            reversescale = False,
            autocolorscale = False,
            colorscale=[[0, 'rgb(255,237,160)'],[0.5, 'rgb(254,178,76)'],[1, 'rgb(240,59,32)']],
            cmin = 0,
            color = data['volume'],
            cmax = data['volume'].max(),

layout = dict(
        title = 'All crude oil incidents from 2010-October 2016',   
        geo = dict(
            projection=dict( type='albers usa' ),
            showland = True,
            landcolor = "rgb(229, 229, 229)",
            subunitcolor = "rgb(0,5,5)",
            countrycolor = "rgb(0,50,5)",
            countrywidth = 0.5,
            subunitwidth = 0.5        

fig = dict( data=datax, layout=layout )
#iplot( fig, filename='d3-airports' )


Conclusion & future ideas

During the period 2010 to October 2016, more than 8.5 million gallons of crude oil is spilled. The total cost of these spills is well over 2 billion dollars. In one of the worst cases, 843444 gallons was released in Michigan, resulting in a massive 1 billion cost in 2016 dollars and water contamination. In 2016 alone, more than 54 million dollars is the cost of releasing 800000 gallons of crude oil. Although, it is claimed that most incidents happen within the facilities not in the public land. Although, confirming this needs further analysis.

As Christofer Jones, historian of energy at ASU points out: “While it’s true that improved technology and regulation have reduced spills significantly—much like flying today is far safer than in the early years of commercial aviation—the fact remains that there exists no such thing as a spill-proof pipeline. Recognizing this historical reality is crucial to crafting future policy.”

Whether this is currently a real priority is another question. At the time of writing, there are 188 federal and 340 state inspectors to inspect “2.7 million miles of pipelines, 148 liquefied natural gas plants, and 7,574 hazardous liquid breakout tanks.”

This quick analysis shows different aspects of the pipeline incidents but many aspects of this dataset is unexplored. Some that I can think of are:

  • The number of casualties,
  • Whether the pipeline is underground or on ground ,
  • possible seasonal/temporal patterns in incidents (one might guess winters are the worst seasons!),
  • Whether the spill happened close to the well or far from the pipeline. This is important as some pipelines cross lands close to farming areas and cities.
  • How long was the response time?
  • What are the most vulnerable areas? Is the vulnerability related to their age?
  • How does cost depend on the commodity, volume and cost?
  • Categorizing spills based on the volume and cost because many spills have really small volumes.

I’ll be grateful to hear your ideas, comments, and suggestions!

This post is written in Jupyter notebook and is available with the required dataset as a github repository.

comments powered by Disqus