Giter VIP home page Giter VIP logo

prj_wrangle-twitter's Introduction

PROJECT - Wrangle Twitter data via API

Table of contents

Introduction

Data in the real world and that provided for academia are completely different. Data provided for academia is for teaching purposes and is generally ideal with minimal corrections required to provide examples. Real world data is messy and untidy with no real structure, and would otherwise overwhelm students.

This projects intent is to simulate in between where real world data has been structured to suit the providers needs, gathering to put to practice that which has been learnt academically.

Generally:

  • Gathering data, webscraping raw data or via an API
  • Assessing format and quality of data, the conversions required to make analysing possible and straightforward
  • Cleaning utilising programmatic methods to scale for future expansion and editing of unpredictable changes

Initial Setup

Generally, run from conda/terminal:
replace pip with conda for anaconda environment

  • pip install pandas, which by default will download numpy
  • pip install requests
  • pip install tweepy

Optional - provides Table Of Contents

  • pip install jupyter_contrib_nbextensions
  • pip install -U pandas-profiling

Global Functions

Definition:

add_files(*filename)

  • returns file added, as a string
  • global variable: filelist
  • tracks files added so far in a list, saves time scrolling
  • can pass multiple parameters, for loop will iterate to catch all

get_values(df, col)

  • returns value_counts, as a series
  • parameter: , <dataframe.column>
  • automates .value_counts() for all columns/variables in a dataframe
  • based on .values, will print out if duplicates are evident and the max number reoccuring number.
  • does not provide feedback on indexes

go_assess(df)

  • calls get_values for each column in a dataframe
  • parameter:

Methodology:

  1. Gather data 1.1 Twitter archive data:
  • Read in provided CSV, from subfolder using pandas.
  • Check file imported correctly.
  • Create two copies, one backup of original and one to be cleaned.

1.2 Twitter Image Prediction data:

  • Access URL and download image predictions using Requests.
  • Binary read/write not required as image predictions would not be enclosed in a TSV file.
  • Read in TSV file into pandas dataframe.
  • Check file imported correctly.
  • Create two copies, one backup of original and one to be cleaned.

1.3 Import tweet_id JSON:

  • Sign up to twitter and refer to developer documentation
  • Access twitter API, obtain token required for API
  • Wait for script run, read file and print finish time
  • Visually analyse text file for holistic overview
  • Contains {} to denote in correct format to be read in as JSON
  • Read in data from stored .txt file
  1. Assess data: 3.1 Iteration 1
  • Visually assess using .sample(qty), files are already in folder ready to be assessed in external programs.
  • Programmatically assess using .info, .describe, value_counts, isnull Quality checklist (Completeness, Validity, Accuracy, Consistency):
  • Format
  • Duplicates
  • Nulls
  • Data Types
  • Matching Structure checklist
  • Single values per variable/column
  • Variables match purpose of table i.e. correct schema

3.1.1 Observations from .head() - Twitter data archive: [V]=visual, [P]=programmatically, [O]=Optional

  • 17 Variables in total
  • Col0 tweet_id: [P]
  • Integer datatype suits content
  • No nulls: .info()
  • No duplicates: df_twitter.tweet_id[df_twitter.tweet_id.duplicated()]
  • No outliers in values: .info()
  • Col1-2 in_reply_to_status_id, in_reply_to_user_id: [P][V]
  • Float datatype precision not required [P]
  • Significant quantity of values are missing
  • Possible outlier values, e+ seen in describe min values different in relation to the other values
  • Col3 timestamp:
  • [P][V] Contains date and timestamp that can be split into additional columns = Date, Time
  • [V][O] Contains +0000 at the end, research indicates its purpose is to display the timezone. Extract into timezone column.
  • Remove data past 01 Aug 2017 i.e. 2017/08/01 as requested
  • Col4 source:
  • [P][V] Contains HTML tags, extract URL
  • Col5 text: [V]
  • Contains description of the tweet, details of the dog, description of the picture, URL
  • Col6-7 retweet_status_id, ..._user_id: [V]
  • See Col1-2 comments above, similar findings
  • Col8 retweet_status_timestamp: [P][V]
  • Significant values missing
  • Col9 expanded_urls:
  • [P][V] Contains duplicates within row value
  • Col10, 11 rating: [P][V]
  • Numerator exceeds 10, values greater than 100 are present, most likely decimal points were not factored in.
  • Denominator is always 10, redundant information no change required as requested
  • Col12 name:
  • [P][V] .value_counts shows significant number of names not provided and duplicates

3.1.2 Observations from .head() - Twitter image recognition: [V]=visual, [P]=programmatically, [O]=Optional [P][V]

  • Globally, there are no null entries for all columns
  • 12 Variables in total
  • Col0 tweet_id:
    [P]
  • Integer datatype matches that found in twitter dataframe, will merge based on this primary/foreign key [P][V]
  • Data is valid
  • Correct integer lengths
  • unique and conforms to a schema
  • No structure issues found.
  • Col1 jpg_url [P][V]
  • Values appear consistent
  • Variable is descriptive, however inaccurate as there are extensions not of .jpg
  • Datatype suits content [P]
  • Duplicates are present and seem correct as these could be retweets, possibly?
  • Col2 img_num [P][V]
  • Appears completely
  • Unsure of purpose, information lacking
  • Max value is 4, min is 1
  • Duplicates values are expected, the data here appears to be categorical, unsure of how it is quantified/measured from initial observation
  • Col3 p1, Col6 p2, Col9 p3 [P][V]
  • Mix of lower and proper case
  • Data validity/machine learning prediction accuracy issues, i.e. canoe, suit, candle are prevalent. Purpose of the file is to provide predictive images of dogs
  • Contains no white space
  • Consistency - String has a mix of lower and proper case
  • Col4 p1_conf, Col7 p2_conf, Col10 p3_conf [P][V]
  • Confidence of the p1 observation made by the program
  • Datatype suits
  • Value is not greater than 1, i.e. 100%
  • Col5 p1_dog, Col8 p2_dog, Col11 p3_dog [P][V]
  • Data validity issue with false numbers not matching those found in col3 mask, cross reference required to see what col5 false value equate to those found matched in col3

3.1.3 Observations after write to .txt file - Twitter API scrape: [V]

  • text file shows structure of data is correct and formatted to suit JSON [P]
  • imports as dict file type after using json_loads
  • for loop required to sift through objects key within file
  • append and merge into data frame
  • no nulls
  • Col0 tweet_id [P][V]
  • Data quality fixed during import of .txt file, tweet_id and tweet_idstr were available keys.
  • Col1 retweet_count [P][V]
  • Correct data type to suit values within column
  • Col2 favourite_count [P][V]
  • Correct data type to suit values within column

3.1.4 Define data dictionary (basic description) 3.1.4.1 Twitter Archive tweet_id: numeric, user identifier in_reply_to_status_id: numeric, user identifier, with NaN in_reply_to_user_id: numeric, user identifier, with NaN timestamp: date & time, YYYY-MM-DD HH:MM:SS+GMT source: string, html tag with URL text: string, twitter user text retweet_status_id: numeric, user identifier, with NaN retweet_status_user_id: numeric, user identifier, with NaN retweet_status_timestamp: numeric, user identifier, with NaN expanded_urls: string, user twitter URL rating_numerator: numeric, exceeding 10, extracted from text column rating_denominator: numeric, value = 10 across column name: string, name extract from text column floofer: category, dog type extracted from text column doggo: category, dog type extracted from text column pupper: category, dog type extracted from text column puppo: category, dog type extracted from text column

3.1.4.2 Twitter Image Predictions tweet_id: numeric, user identifier jpg_url: string, image URL img_num: number, corresponds to algorithm with highest probability p1: string, predicted image 1 out of top 3 p1_conf: numeric value, algorithm confidence in recognition p1_dog: boolean, image is a dog p2: see p1 p3: see p1

3.1.4.3 Twitter API extract tweet_id: numeric, user identifier retweet_count: numeric, retweet count of twitter id favourite_count: numeric, favourite count of twitter id

3.2 Iteration 2

  • Size of the three archives differ and are inconsistent. Join dataframes on lowest number of tweet_id's.

3.3 Iteration 3

  • Names were incorrect and needed to be extracted from text column

4.1 Cleaning Summary: 4.1.1 Quality Issues:

  1. col0: tweet_id data type change to string, all dataframes df_twitter
  2. col3: change timestamp datatype to datetime 3.1 col4: split string to remove html tag and extract text within 3.2 col4: rename column heading from source to add source_app
  3. col1,2,6,7: change datatype from float to string 5.1 remove whitespaces in string/object columns
  4. review col12 to ensure correct name transferred over
  5. check numerator rating against text and valid/correct
  6. remove retweets, indicated by RT @ in text column, retweet status id and in reply to id

twitter_image_predictor 5.2 remove whitespaces in string/object columns 6. col3,6,9: change to lower case 7. col1: rename from jpg_url to img_url 8. col2: rename from img_num to conf_tweet_img

twitter_api 5.3 remove whitespaces in string/object columns

4.1.2 Structure Issues:

  1. timestamp split into three columns, date, time, timezone
  2. categorize dog type into one column
  3. merge, denormalize dataframe to contain the relevant columns required for analysis 3.1 twitter_data to contain all relevant twitter data

================================================== Upgrades:

  • add file tracker
  • assess function
  • requirements: specify column (series) as functions do not apply to whole dataframe
  • argument = dataframe.seriesname
  • create memory release, for dataframes that have been copied
  • add function to create compiled dataframes i.e raw and clean
  • container to list all functions present and the arguments required

Results

  • Prepare: 1.1 wrangle_act.ipynb 1.2 wrangle__report.pdf/html for documentation of steps

References

Udacity

Docs

Misc.

Incomplete functions

Problems

SweetViz compare PROBLEM: -> Raw and Clean dataframes required to be same shape.

prj_wrangle-twitter's People

Contributors

jcalaunan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.