This repository holds the lessons I've learned as I've explored various sub-disciplines of data science (Python, R, statistics, neural network programming, etc.). Most of the insights and code come from my MS Data Science studies at Indiana U, which I plan to complete in May 2020.
I hope you learn something useful as you read the code and discussions in the topics below.
I have been a data geek all my life. For science fair projects, I didn't build baking soda volcanos; instead, I analyzed weather forecast accuracy and factors in student fitness. As an undergrad at Princeton, I aced linear algebra, multivariable calculus, and several math-track economics courses. My interests in social justice and statistical analysis aligned as I pored over US Census Bureau data to identify gentrification trends for a course in urban economics. Continuing in this vein, I enrolled in a graduate level course in Comparative Urban Development, where I happily immersed myself in analyzing migration and growth patterns in India.
Shortly after my graduation in 1983, I joined a Christian humanitarian organization as the administrator of a nutrition and health education project in West Africa. To increase our effectiveness, my wife and I were required to abandon our American lifestyle in favor of local dress, custom, and language. Fortunately, I was not called upon to abandon my passion for data.
The previous project management had collected children's weight and age data as points on a scatter plot. While this permitted a rough evaluation of the health of enrolled children, I and my leadership wanted to explore questions such as:
- What percentage of the children were improving or regressing? Our intervention would be different for perpetually malnourished children than for those who fluctuated between malnutrition and proper development.
- Was there a critical age span where children were at greatest risk? If so, we wanted to focus interventions on that cohort.
- Were children in some of the five regions struggling more than in others? If so, was it the result of random variance or a trend that warranted further investigation? We might, for example, want to give extra training to volunteers at centers serving the most at-risk children.
To answer these questions, I designed a dBase III-based system for tracking each child's age and weight. A colleague regularly exported the data to a spreadsheet which I used to generate analyses and visualizations for our stakeholders.
In the early 90s, I became a Merrill Lynch financial consultant. I am probably the only retail broker in the history of the firm who taught himself C in order to write an asset allocation recommendation system based on modern portfolio theory! This was the beginning of a course of self-study in programming that led to a lasting career shift.
Since 1994 I have worked as a software engineer and architect. I have coded at every application layer in at least a dozen languages; debugged assembler code in Windows 98; worked with clients to craft system requirements and architecture; led development teams; designed system collaborations via both proprietary APIs and industry-standard SOA/message exchanges; and blogged shamelessly about it all. But my favorite assignments have always involved intensive data analysis. As a Microsoft consultant, for example, I led customers through data-intensive performance labs. More recently, I have analyzed method call time-series data to identify the root causes of system outages in an extremely large federal system, and have optimized Spark jobs in a big data curation/analysis system for the Defense Intelligence Agency. I am currently working on a microservices architecture project for the VA.