mhoogen / ml4qs Goto Github PK

View Code? Open in Web Editor NEW

105.0 105.0 133.0 223.35 MB

Code belonging to the book machine learning for the quantified self.

Python 91.62% R 8.26% Batchfile 0.01% Shell 0.03% Dockerfile 0.07%

ml4qs's People

Contributors

Stargazers

Watchers

Forkers

florisdenhengst ashvinmanoj dromescu asutosh7hota zjy63562680 penningmeester tomescumihail93 kevanputten mrbads felixtan markwk symphony2014 vijayvardhan94 yonischirris spijkervet hedayat-r rvrheenen sssalcedo jszkodon jeba91 arumoy-shome asahoo1995 mh305 thommyson guusjeb dankonig avs78 rahulj123 sinberlin2 adah98 edadamian tmaaiveld xytreyum yaaani85 ml4qs-2 funkemt mick-ijzer l8518 mrthefastfender renxie luayoxu nhemisirmkow rskeskin jesse-ende daviddgd abijithanikkuruthi sebastiaangroeneveld nilshmeier esteban123210 anwarasif rubenhorn marinoandrea lagewel001 moin2435 tasosmitsi marijn111 hkstm michellekln baselaslan niyaij imtiaz-nazarali maxiels mahir079 serafim179 dtenwolde deborahvans estsaon romnn dutchcodes tessad okkevaneck zohee95 jwillekes brianlochanan rinapiggy buelentuendes adwitiya23 dariusbarsony lccls eantjon sandergs92 tisnn abhilashbalaji max-faber miker1423 guo-weiqiang yijie007 thomasdingemanse lauralatorrem hhuo7 mullaisanthanam behnam7171 kianfar77 philiphoek rsaeta warosaurus data-monk-123 hnouraei mansuba yanxinlan

ml4qs's Issues

Does the code is just the "Pen and Paper" code answer?

Distance metrics different from euclidian for hierarchical clustering fail

Hierarchical clustering fails when using manhattan or minkowski with specified value p as a distance metric. See the following lines in the code:
https://github.com/mhoogen/ML4QS/blob/master/PythonCode/Chapter5/Clustering.py#L307-L310
This is due to
a) pdist taking the string 'cityblock' for manhattan distance
b) linkage not taking additional arguments, so specifying p is not possible.
(a possible work around for b) would be:
from scipy.spatial.distance import pdist
self.link = linkage(temp_dataset.as_matrix(), method=link_function, metric= lambda x,y : pdist([x,y], 'minkowski', p)[0])
which is however significantly slower)

plot_numerical_prediction_vs_real uses fixed ylims

As the title says. If one wants to use the method for anything different than heart rate prediction one often has empty plots.
The following line causes this in our opinion:
https://github.com/mhoogen/ML4QS/blob/master/PythonCode/util/VisualizeDataset.py#L302

max cluster number not used

In the following line:
https://github.com/mhoogen/ML4QS/blob/master/PythonCode/crowdsignals_ch5.py#L112
the value k for the maximum number of clusters is not used, instead the function is always provided with input 5.
(This also explains the constant silhouette value.)

Wrong computing of LOF

ML4QS/Python3Code/Chapter3/OutlierDetection.py

Line 144 in 6381eb8

reachability_distances_array[i] = self.reachability_distance(k, i, neighbor)

In this line variable i is redefined from the line above (enumerate) which leads to computing reachiability distance between rows 0, 1, .., k and neighbor (which is a neighbor of main i from function's arguments). It should be computed between i and neighbor. I suggest changing the name in the function argument from i to root_i.

Reservoir matrix unlikely to be sparse

In our opinion this matrix W for the echo state network is not going to be sparse.
https://github.com/mhoogen/ML4QS/blob/master/PythonCode/Chapter8/LearningAlgorithmsTemporal.py#L68
Is this intended? Because according to the lecture slides the matrix should be sparse.
If this is intended, why?

Method gower_dinstance does not exist

In the compute_distance_matrix_instances method on line 161 of Chapter5/Clustering.py the gower_distance method is referenced:

ML4QS/Python3Code/Chapter5/Clustering.py

Line 161 in f567f71

 distances.iloc[i,j] = self.gower_distance(dataset.iloc[i:i+1,:], dataset.iloc[j:j+1,:]) 

This method does not exist, however, a similar method called gowers_similarity does exist and is referenced by k_means_over_instances. Unfortunately, a simple replacement seems to race further errors: KeyError: 0.

Dividing INF/INF, producing NaN values (unexpected)

ML4QS/Python3Code/Chapter3/OutlierDetection.py

Line 174 in 6381eb8

lrd_ratios_array[i] = neighbor_lrd / instance_lrd

Here is not caught the possibility that both lrd are INF. In one of the lines above it is assumed that lrd can become INF, but what if both of then are INF? Need to add if statement above to catch this case and set a not inf value (ex. 1 or 0). Luckily due to another mistake it never happens, but after fixing the previous one -- this produces errors :(.

Add unittest.py to check if environment is correctly set up

It would be great to have a unittest to see if the environment is set up correctly locally

Wrong path to Python requirements file in Dockerfile

Unable to build docker image due to wrong path to python requirements file. When running Python3Code/start_docker.sh there is an error:

 sh start_docker.sh                                                                                                     126 ↵
[+] Building 0.3s (9/14)
 => [internal] load build definition from Dockerfile                                                                                                         0.1s
 => => transferring dockerfile: 356B                                                                                                                         0.0s
 => [internal] load .dockerignore                                                                                                                            0.0s
 => => transferring context: 2B                                                                                                                              0.0s
 => [internal] load metadata for docker.io/library/ubuntu:latest                                                                                             0.0s
 => [ 1/10] FROM docker.io/library/ubuntu:latest                                                                                                             0.1s
 => [internal] load build context                                                                                                                            0.1s
 => => transferring context: 2B                                                                                                                              0.0s
 => CACHED [ 2/10] RUN apt-get update                                                                                                                        0.0s
 => CACHED [ 3/10] RUN apt-get install sudo                                                                                                                  0.0s
 => CACHED [ 4/10] RUN apt-get install git -y                                                                                                                0.0s
 => ERROR [ 5/10] ADD Python3_requirements.txt /src/requirements.txt                                                                                         0.0s
------
 > [ 5/10] ADD Python3_requirements.txt /src/requirements.txt:
------
failed to compute cache key: "/Python3_requirements.txt" not found: not found

Frequency Domains Incorrectly Added based on collist ordering

https://github.com/mhoogen/ML4QS/blob/master/Python3Code/Chapter4/FrequencyAbstraction.py#L48

Sorry if this is incorrect, but it seems that the calculated Frequency Features are inserted always at the beginning of the returned (or rather stored) object always at the beginning such that the real amplitudes of the FFT are added to the end of the dataset. However, the referenced calculated activities are being added in-order in terms of the code, but always at the beginning, so they are reversed. This means that the data being displayed would be incorrectly labeled, if I'm not wrong. I can open a PR with the fix we implemented if people agree that this is wrong. If I'm wrong, I would love to know that as well.

Can't use distance_metric except 'default' for k_medoids_over_instances

Calling the k_medoids_over_instances function from the NonHierarchicalClustering class will result in an error if the distance_metric parameter is different from 'default'.

Here calling idxmin(axis=1) will raise an error.

ML4QS/Python3Code/Chapter5/Clustering.py

Line 195 in f567f71

points_to_centroid = D[centers].idxmin(axis=1)

This is due to the dataframe containing multidimensional array objects instead of numerical values which in turn is caused by dist.pairwise(X, Y) in distance functions returning a multidimensional array instead of simply returning the distance value.

Changing the distance function to instead return dist.pairwise(X, Y)[0][0] solves the first issues but the dataframe still considers its element to be non-numerical.

The D[centers] can be changed to D[centers] = D[centers].apply(pd.to_numeric, errors='coerce', axis=0) which will at least allow the correct execution of idxmin(axis=1), however the examples will still run into another error pandas.core.indexing.IndexingError: Too many indexers.

As a result the practical exercise 5.9.2.4 is complicated as the code accompanied by the book can not be used in its current form.