The cap_1 from kevhandel

The Toxics Release Inventory (TRI)

Contains EPA Data on Chemical Fates within US Industry

Motivation and Background...

The United States Environmental Protection Agency (EPA) is charged with keeping our air, the waterways, and ground free from harmful chemicals. The Agency maintains a database which contains the "fate" of industrial chemicals in use by US-based companies, so that we might better track the whereabouts of those which are potentially harmful.

Data on the amounts of each chemical traveling in any of the 60+ streams are delivered through self-reporting.
Complete sets are available for years 1987 - 2016, inclusive.

The publicly-available repository contains over 6 million records on 600 unique chemicals traversing various pathways. While the data are highly organized, the overlap of certain categories, some of which combine multiple fates, and the emergence of new fates that did not appear in the earlier reports, does allow for obfuscation.

Plenty of ponderables served to motivate this study, such as:

Do fates of chemicals predictable and reliably evolve over time?
Can identified trends be correlated with enhanced Federal and/or local environmental regulations? Might we become able to better predict the effect of future proposed regulation?
Is the early data detectably padded due to the very lax reporting standards which, anecdotally, resulted in over-reporting of certain hazardous chemicals?
Are trends toward more society-wide recycling apparent in industry as well?

The primary goal was to plot a few fates of select chemicals in order to test for industry-wide evolution in chemical fate. Quantification would involve Hypothesis Testing.

As the database contains no information regarding environmental regulation, regression testing to model the effects of regulations would require merging this data with a secondary regulations database.

Investigation...

The primary data source is the EPA's own website. A single csv contains all of the records. A secondary copy is housed on Kaggle.com, where the records are housed in 29 separate files and sorted by calendar year.

Data was downloaded from both sites. However, the csv file from the EPA site was corrupted, and when imported as a Pandas dataframe, did not reliably return complete records. As an alternative, we reconstructed a single database by merging all of the Kaggle files as individual Pandas dataframes, and saving this object in a single csv file. Data values were maintained in their present form throughout the project. At nearly one GB in size, the single file was far too massive to manage; we addressed the following issues while reorganizing the data:

We shrunk the data by dropping 39 of the 109 columns that dealt with company info, EPA-specific chemical classification, or other EPA geographical site-related information. We kept all of the numeric columns associated with chemical fate.
We found many rows with NaN values. As these appeared to exist due to the expansion in the reporting options from original conception, no data imputation was warranted at the early stage of exploration. We then split the dataframe into 613 dataframes each of which housed the data for a unique chemical, and stored these as csv files.
Each dataframe contained the 66 columns related to fate, and the file names were coded according to chemical name, which meant that the future process of querying any given chemical could be fairly well automated. A lookup function initially queried the monster file for chemical names, and then stored these in a separate tiny file to be accessed at the startup of each session.

Start the EDA...

Some of the 630 chemicals

New individual csv files

Fate classifications

Cross section of fate data for Benzene

Graphing by the 1000's...

A few graphs showing fate over time

Benzene total fig_chemicals_BENZENE_TOTAL_RELEASES

Gathering Data...

To test the proposed hypothesis, "Fates of chemicals do not change over time across US industry", we next sought to gather fate data for the compelling case of FUGITIVE AIR emissions, which appeared to be diminishing over time, especially for volatile chemicals. We divided the emissions data for this fate into two temporal groupings: 1987-2001 and 2002-2016.

The next distribution graph of emission amounts would demonstrate that an effect is worthy of investigating emperically.

Had this worked well, we could select other fates and attempt to discover which ones change over time.

For starters, the admittedly selective data (emission amounts up to 100,000 pounds) for select volatile chemicals reveal diminished fates in fugitive and stack air emissions. (We also included styrene in water, for additional comparison.)

M-Xylene Fugitve	M-Xylene stack
Methylene fugitive	Methylene stack
Styrene fugitve	Styrene water
Toluene fugitive	Toluene stack
Xylene fugitive	Xylene stack

Results...

The data are interesting in that emission fates exhibit marked variability over time for many of the chemicals. However, without significant further numerical analysis, we cannot test the data using regression models in an attempt to ascertain true correlation between the various fates.

Looking Ahead...

We certainly know that the chemical fate numbers in this database are intrinsically related since the database is affords a complete accounting for the measured chemicals. If one fate decreased as a percentage of the the total, then one or more others must compensate. Further, there may be correlation with outside forces, such as Environmental regulations. We may be able to develop a model that predicts the change in fates based upon regulation, for instance.

Data Sources...

The US EPA maintains the TRI database.
Kaggle.com maintains a copy of the original.
With gratitude, we accessed both.

kevhandel / cap_1 Goto Github PK

cap_1's Introduction

The Toxics Release Inventory (TRI)

Motivation and Background...

Investigation...

Start the EDA...

Graphing by the 1000's...

Gathering Data...

Results...

Looking Ahead...

Data Sources...

cap_1's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent