PROJECT - Wrangle Twitter data via API

Introduction
Initial Setup
Global Functions
Methodology
References

Introduction

Data in the real world and that provided for academia are completely different. Data provided for academia is for teaching purposes and is generally ideal with minimal corrections required to provide examples. Real world data is messy and untidy with no real structure, and would otherwise overwhelm students.

This projects intent is to simulate in between where real world data has been structured to suit the providers needs, gathering to put to practice that which has been learnt academically.

Generally:

Gathering data, webscraping raw data or via an API
Assessing format and quality of data, the conversions required to make analysing possible and straightforward
Cleaning utilising programmatic methods to scale for future expansion and editing of unpredictable changes

Initial Setup

Generally, run from conda/terminal:
replace pip with conda for anaconda environment

pip install pandas, which by default will download numpy
pip install requests
pip install tweepy

Optional - provides Table Of Contents

pip install jupyter_contrib_nbextensions
pip install -U pandas-profiling

Global Functions

Definition:

add_files(*filename)

returns file added, as a string
global variable: filelist

tracks files added so far in a list, saves time scrolling
can pass multiple parameters, for loop will iterate to catch all

get_values(df, col)

returns value_counts, as a series
parameter: , <dataframe.column>

automates .value_counts() for all columns/variables in a dataframe
based on .values, will print out if duplicates are evident and the max number reoccuring number.
does not provide feedback on indexes

go_assess(df)

calls get_values for each column in a dataframe
parameter:

Methodology:

Gather data 1.1 Twitter archive data:

Read in provided CSV, from subfolder using pandas.
Check file imported correctly.
Create two copies, one backup of original and one to be cleaned.

1.2 Twitter Image Prediction data:

Access URL and download image predictions using Requests.
Binary read/write not required as image predictions would not be enclosed in a TSV file.
Read in TSV file into pandas dataframe.
Check file imported correctly.
Create two copies, one backup of original and one to be cleaned.

1.3 Import tweet_id JSON:

Sign up to twitter and refer to developer documentation
Access twitter API, obtain token required for API
Wait for script run, read file and print finish time
Visually analyse text file for holistic overview
Contains {} to denote in correct format to be read in as JSON
Read in data from stored .txt file

Assess data: 3.1 Iteration 1

Visually assess using .sample(qty), files are already in folder ready to be assessed in external programs.
Programmatically assess using .info, .describe, value_counts, isnull Quality checklist (Completeness, Validity, Accuracy, Consistency):
Format
Duplicates
Nulls
Data Types
Matching Structure checklist
Single values per variable/column
Variables match purpose of table i.e. correct schema

3.1.1 Observations from .head() - Twitter data archive: [V]=visual, [P]=programmatically, [O]=Optional

17 Variables in total

Col0 tweet_id: [P]

Integer datatype suits content
No nulls: .info()
No duplicates: df_twitter.tweet_id[df_twitter.tweet_id.duplicated()]
No outliers in values: .info()

Col1-2 in_reply_to_status_id, in_reply_to_user_id: [P][V]

Float datatype precision not required [P]
Significant quantity of values are missing
Possible outlier values, e+ seen in describe min values different in relation to the other values

Col3 timestamp:

[P][V] Contains date and timestamp that can be split into additional columns = Date, Time
[V][O] Contains +0000 at the end, research indicates its purpose is to display the timezone. Extract into timezone column.
Remove data past 01 Aug 2017 i.e. 2017/08/01 as requested

Col4 source:

[P][V] Contains HTML tags, extract URL

Col5 text: [V]

Contains description of the tweet, details of the dog, description of the picture, URL

Col6-7 retweet_status_id, ..._user_id: [V]

See Col1-2 comments above, similar findings

Col8 retweet_status_timestamp: [P][V]

Significant values missing

Col9 expanded_urls:

[P][V] Contains duplicates within row value

Col10, 11 rating: [P][V]

Numerator exceeds 10, values greater than 100 are present, most likely decimal points were not factored in.
Denominator is always 10, redundant information no change required as requested

Col12 name:

[P][V] .value_counts shows significant number of names not provided and duplicates

3.1.2 Observations from .head() - Twitter image recognition: [V]=visual, [P]=programmatically, [O]=Optional [P][V]

Globally, there are no null entries for all columns
12 Variables in total

Col0 tweet_id:
[P]

Integer datatype matches that found in twitter dataframe, will merge based on this primary/foreign key [P][V]
Data is valid
Correct integer lengths
unique and conforms to a schema
No structure issues found.

Col1 jpg_url [P][V]

Values appear consistent
Variable is descriptive, however inaccurate as there are extensions not of .jpg
Datatype suits content [P]
Duplicates are present and seem correct as these could be retweets, possibly?

Col2 img_num [P][V]

Appears completely
Unsure of purpose, information lacking
Max value is 4, min is 1
Duplicates values are expected, the data here appears to be categorical, unsure of how it is quantified/measured from initial observation

Col3 p1, Col6 p2, Col9 p3 [P][V]

Mix of lower and proper case
Data validity/machine learning prediction accuracy issues, i.e. canoe, suit, candle are prevalent. Purpose of the file is to provide predictive images of dogs
Contains no white space
Consistency - String has a mix of lower and proper case

Col4 p1_conf, Col7 p2_conf, Col10 p3_conf [P][V]

Confidence of the p1 observation made by the program
Datatype suits
Value is not greater than 1, i.e. 100%

Col5 p1_dog, Col8 p2_dog, Col11 p3_dog [P][V]

Data validity issue with false numbers not matching those found in col3 mask, cross reference required to see what col5 false value equate to those found matched in col3

3.1.3 Observations after write to .txt file - Twitter API scrape: [V]

text file shows structure of data is correct and formatted to suit JSON [P]
imports as dict file type after using json_loads
for loop required to sift through objects key within file
append and merge into data frame
no nulls

Col0 tweet_id [P][V]

Data quality fixed during import of .txt file, tweet_id and tweet_idstr were available keys.

Col1 retweet_count [P][V]

Correct data type to suit values within column

Col2 favourite_count [P][V]

Correct data type to suit values within column

3.1.4 Define data dictionary (basic description) 3.1.4.1 Twitter Archive tweet_id: numeric, user identifier in_reply_to_status_id: numeric, user identifier, with NaN in_reply_to_user_id: numeric, user identifier, with NaN timestamp: date & time, YYYY-MM-DD HH:MM:SS+GMT source: string, html tag with URL text: string, twitter user text retweet_status_id: numeric, user identifier, with NaN retweet_status_user_id: numeric, user identifier, with NaN retweet_status_timestamp: numeric, user identifier, with NaN expanded_urls: string, user twitter URL rating_numerator: numeric, exceeding 10, extracted from text column rating_denominator: numeric, value = 10 across column name: string, name extract from text column floofer: category, dog type extracted from text column doggo: category, dog type extracted from text column pupper: category, dog type extracted from text column puppo: category, dog type extracted from text column

3.1.4.2 Twitter Image Predictions tweet_id: numeric, user identifier jpg_url: string, image URL img_num: number, corresponds to algorithm with highest probability p1: string, predicted image 1 out of top 3 p1_conf: numeric value, algorithm confidence in recognition p1_dog: boolean, image is a dog p2: see p1 p3: see p1

3.1.4.3 Twitter API extract tweet_id: numeric, user identifier retweet_count: numeric, retweet count of twitter id favourite_count: numeric, favourite count of twitter id

3.2 Iteration 2

Size of the three archives differ and are inconsistent. Join dataframes on lowest number of tweet_id's.

3.3 Iteration 3

Names were incorrect and needed to be extracted from text column

4.1 Cleaning Summary: 4.1.1 Quality Issues:

col0: tweet_id data type change to string, all dataframes df_twitter
col3: change timestamp datatype to datetime 3.1 col4: split string to remove html tag and extract text within 3.2 col4: rename column heading from source to add source_app
col1,2,6,7: change datatype from float to string 5.1 remove whitespaces in string/object columns
review col12 to ensure correct name transferred over
check numerator rating against text and valid/correct
remove retweets, indicated by RT @ in text column, retweet status id and in reply to id

twitter_image_predictor 5.2 remove whitespaces in string/object columns 6. col3,6,9: change to lower case 7. col1: rename from jpg_url to img_url 8. col2: rename from img_num to conf_tweet_img

twitter_api 5.3 remove whitespaces in string/object columns

4.1.2 Structure Issues:

timestamp split into three columns, date, time, timezone
categorize dog type into one column
merge, denormalize dataframe to contain the relevant columns required for analysis 3.1 twitter_data to contain all relevant twitter data

================================================== Upgrades:

add file tracker
assess function
requirements: specify column (series) as functions do not apply to whole dataframe
argument = dataframe.seriesname
create memory release, for dataframes that have been copied
add function to create compiled dataframes i.e raw and clean
container to list all functions present and the arguments required

Results

Prepare: 1.1 wrangle_act.ipynb 1.2 wrangle__report.pdf/html for documentation of steps

References

Udacity

Docs

Misc.

JUPYTER CONVERT: HOW TO GET A TABLE OF CONTENTS
Index Match
Does not contain string%5D-,Search%20for%20%22does%2Dnot%2Dcontain%22%20on%20a%20DataFrame,an%20object%20dtype.%20%3E%3E%3E)
HTML type Attribute
Apply BeautifulSoup function to Pandas DataFrame
Pandas replace strings
Ignoring NaNs with str.contains
Python regex - visual guide
Show Tableau dashboard in Jupyter Notebook

Incomplete functions

Problems

SweetViz compare PROBLEM: -> Raw and Clean dataframes required to be same shape.

jcalaunan / prj_wrangle-twitter Goto Github PK

prj_wrangle-twitter's Introduction

PROJECT - Wrangle Twitter data via API

Table of contents

Introduction

Initial Setup

Global Functions

Definition:

Methodology:

References

Udacity

Docs

Misc.

Incomplete functions

Problems

prj_wrangle-twitter's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent