abjer / sds2019 Goto Github PK

View Code? Open in Web Editor NEW

46.0 46.0 96.0 37.16 MB

Social Data Science 2019 - a summer school course

Home Page: https://abjer.github.io/sds2019

HTML 0.02% Shell 0.01% Batchfile 0.01% Jupyter Notebook 99.97%

sds2019's People

Contributors

Stargazers

Watchers

Forkers

nannalynge kasperloewe ceciliej rasmushabel lfri1608 philipchristiansen1 jacobravn98 ceca92 adamtrolle christianmackeprangbruhn lauritslassen elshaolsen rrosasl kbaltzer troelsboesen toromash johanneshoseth tskjaer leaus10 iamandreassk migueldeg guidoturdera oliviarw markdrejer david40k pernillehansen kjohnsen-dk mikkelphillips anke-dfx407 magnusorberg timobhansen alescanferla singulyticsam saarabarberena sarahnoerby lauraflader fk8 kristianhholm silkekofoed xwj215 oliverkobaek myskovgaard marcelostegmann para0010 kristine123 vsadolin dittenielsen christofferbruun lannguyen1510 tchigher valdemarjh esbenandersen1 madsfernandez choptdei siffmp kcc7788 sommmarv dorothynearthesea siyuan999dk elsaolsen nynnebech jensaurup helaya kevinguldager storkehave asgerthomsen imosegaard peterhestbaek emck3n michelleoestergaard paxrex danielpryn louisewillerslev jakobtofthansen moritzschn andersstensbjergkristensen nabin091 bjorncilleborg tassosmar annetiedemann christine-bach msaxtorph jbjornholm runesigurd kayxu09 naja415 jossl95 benny-ucph nsc617 daconom antonthomsen matpiq holger-harmsen pontifexmaxi rwq289 anhnguyendepocen

sds2019's Issues

A1 7.1.3 – number of rows in weather dataset

Hello,

In exercise 7.1.3 of Assignment 1 we are asked to:

merge station locations onto the weather data spanning 1864-1867.

In the previous exercise, these datasets are combined to form a DataFrame with 30003 rows. However, the assert statement in 7.1.3 goes like this:

assert answer_73.shape == (5686, 15)

which seems to indicate that we are supposed to merge location data with only one dataset (the one from 1864?).

Maybe I'm just going crazy, but hopefully someone can clarify! :)

/ Mathias

problem 0.3.4

I am working on the last problem (0.3.4) and find it a little difficult

First of all, how can we call logexp(1,1000), when it has to be stored as a variable, and not a function?

Secondly, if the inner function has to return func(e, k), my guess has been that the value should equal func (e, k), but it does not work out for me. Maybe I just do not understand the problem?

Assignment 1 - Problem 6.1.4

Group 20, have a question regarding the country codes in problem 6.1.4, but when i run the assert, it fails as you have fewer unique ID numbers than i do. Have you removed some of the stations?

We have 21 unique station identifiers and you only have 11.

Assignment0 file type

Hi there

Should the assignment be handed in as a specific type of file in Absalon?
Should it be a .ipynb or a .pdf file for example?

Thanks in advance

Where to hand in exam project description

On the Absalon page we cannot find any 'submit' button under the hand in page. Have you forgotten to allow for hand in, or are we to hand it in somewhere else?

Example

This is an example issue. Here you can pose questions about the material and assignments, as well as practial matters.

Address KU

Hi! I'm arriving at CPH today - I read on the blog teaching takes place at 9AM in "lecture hall CSS 35-01-05". This is at Øster Farimagsgade 5, right?

And as an external student should I do any in particular before class - any registration, maybe?

Thanks!

Guido.

add list() to Box 6 under section 0.3?

To illustrate the example with ranges, did you intend to write the following?

print("Range from 0 to 100, step=1:", list(range(100)))
print("Range from 0 to 100, step=2", list(range(0, 100, 2)))
print("Range from 10 to 65, step=3", list(range(10, 65, 3)))

Upload of Assignment 2

Hey,
We were wondering when Assignment 2 will be uploaded? :)
Best,
Group 10

The objective of the analysis of the look is to document data quality. This means being transparent about your data collection. Analytically you look for signs of potentially systematic missing data (certain error codes being systematically distributed in part of the scrape, holes in the time series indicating an error in the scraping program), and artifacts (suspiciously similar response sizes or suspiciously short responses).

Analyze systematic connection errors / error codes and systematically missing data.

Plot the Number of Errors codes over time - to see if there are any systematics in missing answers
Plot the Number of Errors codes in relation to different subsections of your scrape (cnn.com/health or cnn.com/business) to see if there are any systematics in missing answers.
Plot length before response (dt column delta_t) over time, to see if server response times are changing, indicating potential problems.

Look for artifacts, and potential signs of different html formatting. Systematically different formatting of the HTML will probably force you to design two or more separate parsing procedures.

Plot size distribution (length of html/json response) - i.e. histogram /sns.distplot-, to look for potential artifacts and errors (unexpected small responses, standard responses with the exact same length).
Plot size of response over time, or in relation to a specific subsections (e.g. cnn.com/health or cnn.com/business), to look for potentially formatting issues or errors in different subsections.

If any problems are present, you get the chance to demonstrate your serious attitude towards methodological issues. You should sample anomolies (i.e. breaks in the time series, samples suspiciously small response lengths or too similar (i.e. standard empty response)) and inspect them manually to find the explanation (report this).
If a real issue - think about potential consequences (if any) to your analysis - and you should now comment on potential causes and explanations, thereby demonstrating strong methodological scraping skills.

What do I write after the for in a for loop?

When I have to create a for loop, what do I write after the for?

Ex. for x in y:

What am I supposed to write instead of the x?
And is it supposed to be related to the topic?

Assignment 1 resubmit?

Will it be possible to resubmit assignment 1?

Access to datacamp

Hi,
How do I access the DataCamp group?

Problem 0.3.3 (How does sigma sum notation translate into code?)

I have an issue understanding the description of problem 0.3.3. I have defined the function and tried to loop it, but there is clearly something wrong with my coding. There is something wrong with the structure, and should i return value twice?

def natural_logarithm(x, k_max):
total = 0
for k in range (0, k_max):
total += k
total = total * 2
return (1/2k_max+1)((x-1)/((x+1))**2*k_max+1)
answer_033 = natural_logarithm(2.71, 1000)

Assignment 1 _ unique or identical hand-ins?

Hi,

Just to be clear, is the Assignment 1 to be handed in as unique hand-ins (i.e. where group members work together on the issues but each write their own pieces of code), or should we together agree on the pieces of code for the assignment?

Thanks!

Is there a minimum number of pages requirement?

Hello,
Me and my group are wondering whether there is such thing as minimum page requirement in order to get a good grade?

We saw that there is a MAX limitation of pages (https://abjer.github.io/sds2019/page/practical/) but is there a lower bound? I checked the previous year reports published in Absalon and it seems they have between 15-18 pages, including table of content and different visualisations. However, we are not sure by the number of people writing these reports.

Thank you for the clarification!

Exercise 3.3.4 shows no plot

When I run the code in exercise 3.3.4, it does not create a plot. The code executes without an error-message, but there is not shown any plot. I have not changed any of the code. The lists boys and girls works, and I can see them as lists.

On Stack Exchange some people have experienced similar problems with matplotlib, and they suggest to change the "backend" of matplotlib. I am not sure how to do that, or if that is why I am experiencing this problem?

Assignment 2 | assert 12.2.2.

The assert for answer 12.2.2. seem to check the dimension of the output DataFrame: (20, 3). For checking the number of columns it asserts all(len(i) == 3 for i in output), this races an unspecified error. Wouldn't len(output.columns) == 3 be more a robust assertion?

8.2.3 - Scrape links from category pages on Trustpilot

Hi, I have written the following function, but I cant get it working.
The lists 'reviews' and 'firmaer' is empty. The tages and classes should be right.

Can you help me?

`firmaer = []
reviews = []

def scraper(url):
trin1 = requests.get(url)
trin2 = BeautifulSoup(trin1.text, 'html.parser')

    firmaer.append(trin2.find_all('h3', {'class': 'category-business-card__header'}))
    temp_url = trin2.find_all('a', {'class': 'category-business-card card'})
    for review in temp_url:
        reviews.append(url + review['href'])

scraper('https://www.trustpilot.com/categories/social_club')`

About the Connector Class

Hi!

We are multiple people in my team using the requests module to collect tweets from Twitter API.
We might use several computers to make our requests to the API. As of right now we will end up with several log-files. Is that a problem? And should we hand in our log-files?

Best regards

Problems with fetching GitHub

After I wanted to fetch the new updates on GitHub I got the following error message/warning:

I already tried finding the issue myself, but I seem not able to solve it myself. If anyone can help me in the right direction that would be great.

Ex. 6.1.5

Hi,

I tried to create the code for 6.1.5. I tested it line by line, and it doesn't seem to work with the loop. I am trying to get the country codes and insert them into the appropriate column, "Country_Codes". What should I change?

Note: This is not the full code I intend to write, but it's what I have so far

def weather(year):
    import pandas as pd
    import re
    url="https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/"+year+".csv.gz"
    data=pd.read_csv(url,header=-1)
    data=data.drop(data.columns[4:],axis=1)
    COLS=["Station_Identifier","Observation_Time","Observation_Type","Observation_Value"]
    data.columns=COLS
    data["Observation_Value"]=data["Observation_Value"]/10
    data.round(decimals=2)
    data2=data.loc[(data["Observation_Type"]=="TMAX")]
    data2["TMAX_F"]=data2["Observation_Value"]*1.8+32
    data2["Observation_Time"]=data2["Observation_Time"].astype(str)
    data2.Observation_Time=pd.to_datetime(data2["Observation_Time"]) #.loc[row_indexer,col_indexer] = value instead
    data2["Month"]=data2["Observation_Time"].dt.month
    data2["Country_Code"]=""
    for i,row in data2.iterrows():
            data2.loc[i,"Country_Code"]=" ".join(re.findall("[a-zA-Z]+", data2.loc[i,"Station_Identifier"]))
    data2.set_index("Observation_Time")
    print(data2)


weather("1905")`
```

Thanks,
Andreas

sum([1,2,3]) error: TypeError: 'int' object is not callable

Hi,

I have been playing around with Python in Jupyter Notebook (not necessarily directly relevant for the assignment). I tried to run the sum() function on a list with a start value of 10:

thirdlist=[1,2,3,4,5] x = sum(thirdlist,10) print(x)

I get the following output:

`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
1 thirdlist=[1,2,3,4,5]
----> 2 x = sum(thirdlist,10)
3 print(x)

TypeError: 'int' object is not callable`

I can't seem to understand what I'm doing wrong? I've tried Googling without any look. Moreover, when I try the same code in SublimeText and PythonTutor's Visualizer, it seems to give the correct output (25).

Thank you,
Andreas

Do we need to submit our raw data?

Hi everyone,

So, in the project requirements, we read that we only need to hand in our report in pdf format and our jupyter notebook as a documentation. However, in order to run our notebook, you need to have the raw data as well as an input. We were wondering whether we need to hand in our raw data as well. If so, how would we do that?

Looking forward to hearing from you!

Purpose of "answer_0xx" comments

Just a quick question on the formatting of the Assignment 0 notebook:

What is the purpose of the comments at the top of the exercise cells? Are we supposed to actually fill them in (i.e. # answer_0xx = whatever the value of the variable is) as well as just passing the assertions?

Working With Lists In DataFrames

Hi,

My group and I use the hashtags #MakeAmericaGreatAgain and #ImWithHer as a basis for our project. However, we would also like to see the other hashtags which the tweets have been using. To that end, we wrote a code that would insert a list of all hashtags used under a "All Hashtags" column for all tweets.

hu=[]
for i in range(len(data["results"])):
    ho=[]
    for d in data['results'][i]['entities']['hashtags']:
        ho.append(str(d["text"]))
    hu.append(ho)
        
df["Hashtag"].copy()[0]=hu[0]

We were wondering the following:

How can we easily count the number of times the different hashtags have been used? We tried df["Hashtag"].value_counts() but that counts the number of times specific lists occur rather than the elements in them. I guess we could do a loop but I'd hope for a more elegant solution.
Is there a way to write the general code in a more 'smooth' way? And should we even use lists in the way we have done?

Thank you!

Solutions for ex 13

Could you please upload the solutions for exercise 13. Would be nice to be able to cross check.
Thank you

Exam project description - formal requirements

We are in doubt as to what the exam project description has to consist of and what the formal requirements are, will a description be uploaded?

Jupyter Notebook Directory

How do I change the Home directory in Jupyter Notebook so that none of my personal files are visible? It should be noted that I have a Mac.
/Kasper.

assignment1 ex. 7.1.2

problem with assert statement
the assert statement states:

assert answer_72.shape == (30003, 7)

however, the csv-file weather_data_1864to1867.csv only contains 29638 lines.

Hand-in of notebooks

Should we hand-in several notebooks, or should we combine all of them and only hand-in one notebook?

Ex. 8.2.3 - cannot get any company review links?

In Ex. 8.2.3 (The Trustpilot case) we are asked to extract links to company reviews from a category page. But when I retrieve the html-code from the category page, there are no review-links in it (I have tried a number of different categories).

If I go to e.g. https://www.trustpilot.com/categories/consultant and >Inspect Element< on a company review-link, I can easily find the company review-links in the html. But when I retrieve the code using request.get() or connector.get(), nothing...

I even tried running the uploaded solutions (exercise_8_sol notebook) as is, but the code in Ex. 8.2.3-4 simply returns an empty list. If I run the following code block (Ex.8.2.5-7), and print the randomly drawn company_links, I also get an empty list (again this is from the uploaded solutions - not my own code).

Am I missing something here?

Excercises 5_ Missing figures

In excercises 5.1.2, 5.1.3, 5.2.3 and 5.2.4 should normally a drawing/image show what we need to remake or help us understand the dataset. But, the drawings are not showing with me.
Can anyone help me load these drawings?

Twitter API maximum amount of tweets retrieved

Hi Everybody,

We cannot seem to find a funciton, using the twitter API, that finds the tweets between two specified dates. So far we have used:

id_start = 1125634881223000064
id_end = 1136190080220061698
api.user_timeline(id=name, since_id=id_start, max_id=id_end)

but the user_timeline will only retrieve 20 tweets, so we tried adding count=200

id_start = 1125634881223000064
id_end = 1136190080220061698
api.user_timeline(id=name, since_id=id_start, max_id=id_end, count=200)

which defeats the purpose of since_id and max_id, in other words:

isn't there a smarter way to get the tweets from one date to another, still using the Twitter API?

Kind regards,
Group 10

Lecture 8 - Connector function use "with open"?

I haven't looked in detail at the Connector function, but noted that after running it myself once the file didn't close immediately. Ie. I couldn't view it (it perhaps hadn't flushed the cache?)

My question is, would it be useful to open the log file in the Connector function using "with open" to always close the file, when needed?

Perhaps this has been considered and isn't necessary?

Problem 0.3.1 - Writing average function

Hi! I'm getting this problem while doing the problem 0.3.1. I tried different ways to store the result of the average function into answer_031, the output of the following returns 5.0 so I converted to int but I still can't figure it out what I'm missing...

# answer_031 = 

l = [1,2,3,4,5,6,7,8,9]
    
def average(l): 
    return sum(l) / len(l) 

answer_031 = average(l)

print(answer_031)

int(answer_031)

Basics: Opening and editing the .ipynb file

Hi! I'm starting with the Assignment 0. In order to complete it, we should download the .ipynb file to our local Jupyter Notebook, work on it and then submit it somehow? Or how it works?

Many thanks!

Guido.

Snorre's "Log" implementation API + Analysis of datalog

Hi Everybody,

We have two questions in regards to Snorre's datalog file:

Logging file with API:
We have tried to implement the 'logging code' from lecture 8 into our function that extracts tweets but it does not save anything to the file. Our function goes through Twitters API to extract the data and not a URL. We do not use the requests module, we use the tweepy function:

auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessTokenSecret)
api = tweepy.API(auth)

Maybe you can give us an idea of how to implement it correctly?

Describing the data (Maybe this one is more directed at Snorre):
We understand that the log file will log errors and save timestamps etc. How do you want us to analyze the data?

Kindest regards,
Group 10

Saving API CSV-response in a JSON-file

Hi, we talked with one of the TA's who advised us to chose CSV as the format when choosing the settings of the GET-call in the Danmarks Statisitik API. (JSON was not in the list of available settings). How do we transform the CSV-format to JSON format, which is necessary to complete 3.3.2? Or should we change the link-constructor in 3.3.1?

First, we created the function, that gets the link for specific parametres (sorry, its not perfect):

def construct_link(table_id, lst): url = "https://api.statbank.dk/v1/data/" + table_id + "/CSV?" for element in lst: url += element + "&" return url[:-1]

This is the code for 3.2.2, where we with no sucess tried to use the requests module on csv format:

`import requests
import csv

response = requests.get(construct_link("FOD",['tid=*',"barnkon=p"]))
response_csv = response.csv()

with open('my_csv.csv', 'w') as csv:
csv.write(response_csv)`