abjer / sds Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 34.0 49.74 MB

Social Data Science - a summer school course

Home Page: https://abjer.github.io/sds

Shell 0.01% Jupyter Notebook 100.00%

sds's People

Contributors

Stargazers

Watchers

sds's Issues

jobnet.dk excersise

Any one who's response.text for the various url's (https://job.jobnet.dk/CV/FindWork?Offset=20&SortValue=BestMatch etc) DOES generate text that includes the links to various jobs that are actually shown on the webpage? Somehow they are excluded from the response.text.

What is the size of int and float?

I understand that default for an int is int64, but you can specify int8 which is less precise, but would take less memory and storage space I would imagine.

I was unable to find a list over the exact storage values for the different types, does someone have an url or a list?

ratelimit and interations

We are having a problem with the following code.
First we specified our url, and now we are trying to put iterations=10 and a ratelimit. Does anyone has a clue of, why it is not working?

import time

def ratelimit():
time.sleep(1) # sleep one second.

def get(url, iterations=10,check_function=lambda x: x.ok):

for iteration in range(iterations):
    try:
        ratelimit(10) 
        response = session.get(url)
        if check_function(response):
            return response
    except exceptions as e: 
        print(e) 
return None

Multiprocessing Pool dont work on Windows 10

I can use Pool to paralellize my work on my macbook, but when I try on my more powerful Windows to run the code faster, it does not work. None of the cores start doing any work and for a sample of a 100 I let it run for 10 minutes, but nothing happens.

So I use this code:

def tree_paralel(x):
    tree = DecisionTreeClassifier(criterion="gini", max_depth= x, random_state=1)  
    accuracy_ = []
    for train_idx, val_idx in kfolds.split(X_dev, y_dev):

        X_train, y_train, = X_dev.iloc[train_idx], y_dev.iloc[train_idx]
        X_val, y_val = X_dev.iloc[val_idx], y_dev.iloc[val_idx] 
        
        X_train = pd.DataFrame(im.fit_transform(X_train),index = X_train.index)
        X_val = pd.DataFrame(im.transform(X_val), index = X_val.index)
        tree.fit(X_train, y_train)
        y_pred = tree.predict(X_val)
        accuracy_.append(accuracy_score(y_val, y_pred))
    print("This was the "+str(x)+" iteration", (dt.now() - start).total_seconds())
    return accuracy_

and then run:

start = dt.now()
p = Pool(4)

input_ = range(1,11)
output_ = []
accuracy = []
for result in p.imap(tree_paralel, input_):
    output_.append(result)
p.close()
temp = pd.DataFrame(output_).mean(axis = 1)
temp.index = input_
optimal_t = temp.nlargest(1)
print("Time:", (dt.now() - start).total_seconds())
print("Optimal hyperparameter: "+ str(optimal_t.index[0]) + " with accuracy: " + str(optimal_t.values) )

opgave 9.2.3

Kan i uddybe opgave 9.2.3? Vi har lidt svært ved at fortolke opgaven?

Vores fortolkning er: Vi har nogen beløb (fra opgave 9.2.2), hvor vi så nu skal fiske sætningerne ud, hvor disse beløb indgår i, for at forstå sammenhængen. Er dette korrekt forstået?

Visualizing data in other programs

Are we allowed to visualize our data in other programs/codes than iPython and add them to the final exam paper?

Since time is limitted, using the knowledge we have of other programs for visualization, it would be very beneficial to use other programs, and then describe, document and add the plot to the paper.
We know that learning python/seaborn/matplot "the hard way" would be more beneficial for us in the long run, but for now we only have until saturday morning...

Resuming lecture?

Is there any development in this matter?

Index problem.

ex. 13.1.3

KeyError: ['.., ,...'] not in index.

Someone who has solved the index problem in this exercise?

Pdf with the group info.

Someone remember how to access to the pdf with the number of the groups and the student emails?

Some practical questions for the exam

Should we write names, student ID's, or exam numbers on the project?

The project has a maximum of 24 pages (normalsider). How do you measure a normalside with graphs for instance?

And how is the project graded? Should we write who did which parts of the project for an individual grade, or are we graded as a group?

What does PolynomialFeatures do?

I have some difficulties to completely understand what this function does.

$y = ax_1 + bx_2 + ...$

If i run this through PolynomialFeatures without sepcifying degrees, it will by default choose 2 degrees. Do we get out this:

$y = ax_1^2 + bx_2^2 ...$

or:

$y = ax_1^2 + bx_2x_1 + ... + ax_1x_2 + bx_2^2 + ...$

How do I change my working directory?

I find it irritating that all the files i create through Jupyter Notebook is being saved in the Home folder on my computer by default. I would like to arrange a neat file structure for this course like any other course.

How do I change this?
Is there anything I need to think of regarding my future work with git?

I have tried Stack Overflow but find it hard to jump around in the terminal using 'cd'.

loading a shapefile error

I am trying to load in a shapefile. And yes I have installed Geopandas and imported geopandas and shapely. But I am getting the error shown under the code. Any idea why?

set the filepath and load in a shapefile

fp = “Desktop/DEU_adm1.shp”
map_df = gpd.read_file(fp)

check data type so we can see that this is not a normal dataframe, but a GEOdataframe

map_df.head()

File "", line 2
fp = “Desktop/DEU_adm1.shp”
^
SyntaxError: invalid character in identifier

Vague questions

Sometimes it would be nice with an extract or an example of the solution you want us to find, like: you should end up with strings that look like this: "string_example", or even a picture of a sample of the expected dataframe. For example this assignment:

Ex. 8.2.1: Visit the https://www.trustpilot.com/ website and locate the categories page. From this page you find links to company listings. Get the category page using the requests module and extract each link to a specific category page from the HTML. This can be done using the basic python .split() string method. Make sure only links within the /categories/ section are kept, checking each string using the if 'pattern' in string condition.

I then find the url https://www.trustpilot.com/categories/companies, which have links to company listings from the category page. Even after reading the question several times it was difficult to understand how I was going to proceed from here. After looking at the solution guide and reading the question again, I see that I misunderstood the crucial line extract each link to a specific category page from the HTML, and still I find it difficult to completely understand the assignment. After asking around I was not the only one with this issue, and this is not the first time it has been quite difficult understand exactly what an assignment asks from us, which is a source to a lot of frustration and an immense time sink.

In this exercise which is also quite difficult to get your head around, there seems that is should be an example, but is is sadly missing, even if I read the file directly from the repository:

Ex. 7.2.13: Turn the dataset from wide to long so hourly data is now vertically stacked. Store this dataset in a dataframe called data. Name the column with hourly information hour_period. Your resulting dataframe should look something like this.

Geopandas: Map of Denmark

We are trying to make a map of Denmark with Geopandas. We have found a shapefile with the boundaries of DK, that we hope we can use. But we can't read the shapefile into Python. Our code is:
import geopandas as gpd
import pandas as pd

fp = '/Users/louiseankersen/anaconda3/lib/python3.6/site-packages/geopandas/datasets/dk_100km.shp '
map_df = gpd.read_file(fp)

map_df.head()

But Python won't read the path for the file. Does anyone have any suggestions to what we can do to read the shapefile into Python?

Specifying hyperparameters

In class/at lecture we used np.logspace(-4,4,50) (or 33 as last input) to specify the range of hyperparameters to search.

This is quite broad, and would then give us a hyperparameter that minimizes the MSE. In class ABN mentioned that we could then create a new range using np.logspace around the hyperparameter that gets returned from the first minimization problem.

My question is as follows:

Are we expected to go through this process and thereby narrow down the search for the best hyperparameter of them all?

If we were to do it, would we just set the original range to the new-found range (and comment that we found the new range via. iteration), or would we be expected copy all of the code and insert the new range in that code, so the code that gave the new range still would be in the notebook?

Thanks in advance.

Working with Jobindex in the exam

We are working with Jobindex.dk for our project and are trying to make a successful scraping. If any other groups are trying to scrape from Jobindex, we could meet and exchange knowledge about this.

Solutions for exercise 9?

Wondering when they will be up.

assignment submission

Hi everyone,

I am logged on peergrade but I am not able to see where to submit assignment 1.
I am not sure if one of my group members has uploaded it yet. Is there a way to check if the assignment has already been submitted?

Thanks,
Paula

How to split time series in a train- and test-split

What function can i use to split my time series dataset in a train and test dataset for Machine Learning, and is TimeSeriesSplit the right function to use for cross validation?

How do I start up Jupyter from the terminal?

...

Peergrade for assignment 2

Assignment 2 is not visible on peer grade. Can you please make it possible to submit?

Anyone using geopandas?

Is anyone using Geopandas and want to share information on how to install it properly and import it?

I am trying to install geopandas with:
pip install geopandas
it installs correctly, but when importing it using
import geopandas as gpd

I get this error:

ImportError Traceback (most recent call last)
in ()
----> 1 import geopandas as gpd

~\Anaconda3\lib\site-packages\geopandas_init_.py in ()
2 from geopandas.geodataframe import GeoDataFrame
3
----> 4 from geopandas.io.file import read_file
5 from geopandas.io.sql import read_postgis
6 from geopandas.tools import sjoin

~\Anaconda3\lib\site-packages\geopandas\io\file.py in ()
1 import os
2
----> 3 import fiona
4 import numpy as np
5 import six

~\Anaconda3\lib\site-packages\fiona_init_.py in ()
67 from six import string_types
68
---> 69 from fiona.collection import Collection, BytesCollection, vsi_path
70 from fiona._drivers import driver_count, GDALEnv
71 from fiona.drvsupport import supported_drivers

~\Anaconda3\lib\site-packages\fiona\collection.py in ()
7
8 from fiona import compat
----> 9 from fiona.ogrext import Iterator, ItemsIterator, KeysIterator
10 from fiona.ogrext import Session, WritingSession
11 from fiona.ogrext import (

ImportError: DLL load failed: Det angivne modul blev ikke fundet.

You don't have permission to access /ml/machine-learning-databases/adult/adult.data on this server.

Apache/2.2.15 (CentOS) Server at archive.ics.uci.edu Port 443"

Splitting time series data for CV?

What split to use?

abjer / sds Goto Github PK

sds's People

Contributors

Stargazers

Watchers

Forkers

sds's Issues

set the filepath and load in a shapefile

check data type so we can see that this is not a normal dataframe, but a GEOdataframe

Recommend Projects

Recommend Topics

Recommend Org