mhoogen / ml4qs Goto Github PK
View Code? Open in Web Editor NEWCode belonging to the book machine learning for the quantified self.
Code belonging to the book machine learning for the quantified self.
Hierarchical clustering fails when using manhattan or minkowski with specified value p as a distance metric. See the following lines in the code:
https://github.com/mhoogen/ML4QS/blob/master/PythonCode/Chapter5/Clustering.py#L307-L310
This is due to
a) pdist taking the string 'cityblock' for manhattan distance
b) linkage not taking additional arguments, so specifying p is not possible.
(a possible work around for b) would be:
from scipy.spatial.distance import pdist
self.link = linkage(temp_dataset.as_matrix(), method=link_function, metric= lambda x,y : pdist([x,y], 'minkowski', p)[0])
which is however significantly slower)
As the title says. If one wants to use the method for anything different than heart rate prediction one often has empty plots.
The following line causes this in our opinion:
https://github.com/mhoogen/ML4QS/blob/master/PythonCode/util/VisualizeDataset.py#L302
In the following line:
https://github.com/mhoogen/ML4QS/blob/master/PythonCode/crowdsignals_ch5.py#L112
the value k
for the maximum number of clusters is not used, instead the function is always provided with input 5.
(This also explains the constant silhouette value.)
In this line variable i is redefined from the line above (enumerate) which leads to computing reachiability distance between rows 0, 1, .., k and neighbor (which is a neighbor of main i from function's arguments). It should be computed between i and neighbor. I suggest changing the name in the function argument from i to root_i.
In our opinion this matrix W for the echo state network is not going to be sparse.
https://github.com/mhoogen/ML4QS/blob/master/PythonCode/Chapter8/LearningAlgorithmsTemporal.py#L68
Is this intended? Because according to the lecture slides the matrix should be sparse.
If this is intended, why?
In the compute_distance_matrix_instances
method on line 161 of Chapter5/Clustering.py the gower_distance
method is referenced:
ML4QS/Python3Code/Chapter5/Clustering.py
Line 161 in f567f71
This method does not exist, however, a similar method called gowers_similarity
does exist and is referenced by k_means_over_instances
. Unfortunately, a simple replacement seems to race further errors: KeyError: 0
.
Here is not caught the possibility that both lrd are INF. In one of the lines above it is assumed that lrd can become INF, but what if both of then are INF? Need to add if statement above to catch this case and set a not inf value (ex. 1 or 0). Luckily due to another mistake it never happens, but after fixing the previous one -- this produces errors :(.
It would be great to have a unittest to see if the environment is set up correctly locally
Unable to build docker image due to wrong path to python requirements file. When running Python3Code/start_docker.sh
there is an error:
sh start_docker.sh 126 ↵
[+] Building 0.3s (9/14)
=> [internal] load build definition from Dockerfile 0.1s
=> => transferring dockerfile: 356B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/ubuntu:latest 0.0s
=> [ 1/10] FROM docker.io/library/ubuntu:latest 0.1s
=> [internal] load build context 0.1s
=> => transferring context: 2B 0.0s
=> CACHED [ 2/10] RUN apt-get update 0.0s
=> CACHED [ 3/10] RUN apt-get install sudo 0.0s
=> CACHED [ 4/10] RUN apt-get install git -y 0.0s
=> ERROR [ 5/10] ADD Python3_requirements.txt /src/requirements.txt 0.0s
------
> [ 5/10] ADD Python3_requirements.txt /src/requirements.txt:
------
failed to compute cache key: "/Python3_requirements.txt" not found: not found
https://github.com/mhoogen/ML4QS/blob/master/Python3Code/Chapter4/FrequencyAbstraction.py#L48
Sorry if this is incorrect, but it seems that the calculated Frequency Features are inserted always at the beginning of the returned (or rather stored) object always at the beginning such that the real amplitudes of the FFT are added to the end of the dataset. However, the referenced calculated activities are being added in-order in terms of the code, but always at the beginning, so they are reversed. This means that the data being displayed would be incorrectly labeled, if I'm not wrong. I can open a PR with the fix we implemented if people agree that this is wrong. If I'm wrong, I would love to know that as well.
Calling the k_medoids_over_instances
function from the NonHierarchicalClustering
class will result in an error if the distance_metric parameter is different from 'default'.
Here calling idxmin(axis=1)
will raise an error.
ML4QS/Python3Code/Chapter5/Clustering.py
Line 195 in f567f71
This is due to the dataframe containing multidimensional array objects instead of numerical values which in turn is caused by dist.pairwise(X, Y)
in distance functions returning a multidimensional array instead of simply returning the distance value.
Changing the distance function to instead return dist.pairwise(X, Y)[0][0]
solves the first issues but the dataframe still considers its element to be non-numerical.
The D[centers]
can be changed to D[centers] = D[centers].apply(pd.to_numeric, errors='coerce', axis=0)
which will at least allow the correct execution of idxmin(axis=1)
, however the examples will still run into another error pandas.core.indexing.IndexingError: Too many indexers
.
As a result the practical exercise 5.9.2.4 is complicated as the code accompanied by the book can not be used in its current form.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.