patrickloeber / mlfromscratch Goto Github PK

View Code? Open in Web Editor NEW

1.2K 29.0 517.0 26 KB

Machine Learning algorithm implementations from scratch.

License: MIT License

Python 100.00%

mlfromscratch's Introduction

ML algorithms from Scratch!

Machine Learning algorithm implementations from scratch.

You can find Tutorials with the math and code explanations on my channel: Here

Algorithms Implemented

KNN
Linear Regression
Logistic Regression
Naive Bayes
Perceptron
SVM
Decision Tree
Random Forest
Principal Component Analysis (PCA)
K-Means
AdaBoost
Linear Discriminant Analysis (LDA)

Installation and usage.

This project has 2 dependencies.

numpy for the maths implementation and writing the algorithms
Scikit-learn for the data generation and testing.
Matplotlib for the plotting.
Pandas for loading data.

NOTE: Do note that, Only numpy is used for the implementations. Others help in the testing of code, and making it easy for us, instead of writing that too from scratch.

You can install these using the command below!

# Linux or MacOS
pip3 install -r requirements.txt

# Windows
pip install -r requirements.txt

You can run the files as following.

python -m mlfromscratch.<algorithm-file>

with <algorithm-file> being the valid filename of the algorithm without the extension.

For example, If I want to run the Linear regression example, I would do python -m mlfromscratch.linear_regression

Watch the Playlist

mlfromscratch's People

Contributors

Stargazers

Watchers

Forkers

shugufta64 makama-md mehulpatel21 wesleyz ricardo-kowalski nfaihi mjsirfan2015 nlgrf yujing1997 mridulsyed susmoybarman1 arulantran hubertronald hakiiiim ashrafbily vadimvvlasov ult-processor awesome-interesting-projects ukalkanci ikhsanrahman hkhkhkhkhk hitashu lnt28 loftinjeff hovikgas brithosac niranmutakani sjhawkes farheenb debjit08 aidanvu1992 sushnata1 ryandsilva aggyvivek80 xiao-chen-usc qianyaoyy owaisgondal stanleysongpro ajmaan lineality gel1has3 ketanc32 theafricanquant inf800 emralvarez gamilyassin rcyajad cahersiveen pariyashu jaypajji4444 vikash58 gouthamcity ai-rafique uttgeorge malraharsh varundhanak vinodkandula vesta-nassone suman2mandal bedros gunther-on-fire logeswaran123 neerajkumar21 kuldeep52s abu-sadeed alexddemchenko shaunxwang dev-vibhor bondxue n00bmaster68 srikar-rao dipanwita2019 cgiltrow fagan2888 sajanyonjan leeloolee avinashy47 kusmahendra elguneminov tushar-chaudhary tzbil hollow667 brunopinho321 thongnd0200 lawan-l pedroescobedob alaap001 spotify-song-3 navaneetha-murali hackr-space angus001 vishalmehta1991 avinregmi joshuabambrick prernadutta15 syedazkarul thenullterminator dummy-akyl einstyn manojnirale

mlfromscratch's Issues

Explanation of `get_hyperplane_value`

Hi, I find your tutorials on SVM very helpful but I do not understand the get_hyperplane_value method in svm_tests.py. May I have an explanation of it? Thank you!

XGBoost Algorithm

Can I implement XGBoost Algorithm?

Project dependencies may have API risk issues

Hi, In MLfromscratch, inappropriate dependency versioning constraints can cause risks.

Below are the dependencies and version constraints that the project is using

numpy==1.22.0
scikit-learn==0.24.2
matplotlib==3.4.2
pandas==1.2.4

The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict.
The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.

After further analysis, in this project,
The version constraint of dependency numpy can be changed to >=1.8.0,<=1.23.0rc3.
The version constraint of dependency matplotlib can be changed to >=1.3.0,<=3.0.3.
The version constraint of dependency pandas can be changed to >=0.4.0,<=1.2.5.

The above modification suggestions can reduce the dependency conflicts as much as possible,
and introduce the latest version as much as possible without calling Error in the projects.

The invocation of the current project includes all the following methods.

The calling methods from the numpy

numpy.linalg.inv
numpy.linalg.eig

The calling methods from the matplotlib

matplotlib.colors.ListedColormap

The calling methods from the pandas

pandas.read_csv

The calling methods from the all methods

numpy.argwhere
self._grow_tree
self._best_criteria
self.plot
numpy.unique
numpy.amin
LDA.transform
RandomForest.predict
numpy.mean
range
numpy.exp
numpy.argsort
numpy.dot
sklearn.datasets.make_blobs
df.fillna.fillna
self._create_clusters
self._traverse_tree
numpy.log
self._approximation
numpy.sign
matplotlib.pyplot.figure
self._is_converged
numpy.linalg.eig
numpy.where
NaiveBayes
matplotlib.pyplot.show
numpy.sum
DecisionTree
mean_overall.mean_c.reshape.dot
SVM
matplotlib.colors.ListedColormap
SW.np.linalg.inv.dot
numpy.empty
csv.reader
centroid_idx.clusters.append
most_common_label
numpy.argmax
sklearn.datasets.make_classification
ax.scatter
matplotlib.pyplot.cm.get_cmap
matplotlib.pyplot.figure.add_subplot
KNN.predict
numpy.genfromtxt
bootstrap_sample
Node
LinearRegression
self._predict
fig.add_subplot.plot
Adaboost.fit
LinearRegression.predict
Perceptron.predict
enumerate
list
SVM.fit
Adaboost.predict
KMeans.predict
node.is_leaf_node
numpy.sqrt
self.trees.append
sum
matplotlib.pyplot.plot
numpy.swapaxes
self._pdf
DecisionTree.predict
numpy.random.seed
self._information_gain
matplotlib.pyplot.xlabel
KNN.fit
numpy.amax
DecisionStump
Perceptron
len
posteriors.append
numpy.log2
numpy.argmin
numpy.linalg.inv
self.clfs.append
self._get_cluster_labels
Perceptron.fit
numpy.cov
abs
accuracy
LogisticRegression.predict
numpy.array
mean_c.X_c.T.dot
visualize_svm
numpy.bincount
decision_tree.DecisionTree.fit
float
entropy
RandomForest.fit
sklearn.datasets.make_regression
mean_overall.mean_c.reshape
sklearn.datasets.load_iris
LinearRegression.fit
mean_squared_error
NaiveBayes.fit
KMeans.plot
PCA.transform
k_neighbor_labels.Counter.most_common
numpy.loadtxt
cmap
self._sigmoid
RandomForest
decision_tree.DecisionTree
numpy.zeros
sklearn.model_selection.train_test_split
self._split
pandas.read_csv
X_c.mean
X_c.var
self._get_centroids
df.fillna.to_numpy
LDA
fig.add_subplot.set_ylim
split_thresh.X_column.np.argwhere.flatten
collections.Counter.most_common
numpy.full
euclidean_distance
decision_tree.DecisionTree.predict
min
matplotlib.pyplot.scatter
self._most_common_label
print
get_hyperplane_value
matplotlib.pyplot.ylabel
PCA
Adaboost
numpy.corrcoef
self.activation_func
matplotlib.pyplot.subplots
numpy.ones
r2_score
matplotlib.pyplot.get_cmap
LogisticRegression.fit
KNN
open
sklearn.datasets.load_breast_cancer
NaiveBayes.predict
numpy.random.choice
DecisionTree.fit
self._closest_centroid
matplotlib.pyplot.colorbar
collections.Counter
KMeans
LDA.fit
PCA.fit
LogisticRegression

@developer
Could please help me check this issue?
May I pull a request to fix it?
Thank you very much.

Repository has no License

Thanks for the wonderful work on this repository! Unfortunately your code does not have a license, could you please add one so it's clear whether we are allowed to reuse it for teaching students etc.?

Thank you!

ML from scratch

Nice

do you have bert from scratch code ?

Use of log loss function in logistic regression

AdaBoost suggestions

Thanks much for putting this material together!

Looking at Lucky 13: AdaBoost. A few items are a bit unclear for us newbies.

First, in the fit() method, there is just a single pass over the data, X, while the original Freund & Schapire, 1995 paper suggests looping for T iterations, refitting the classifiers on each pass based on the evolving weights. Looks like the version here is based on Zhu, et al 2009. Might be worth a few words to explain the source of the algorithm, and also why this version needs to make only one pass over the samples.

Second, just from a learning perspective, it would be great to provide a data set that mimics the illustrations in the video, just so we can verify that things work as expected. For extra credit, use MatPlotLib to create the decision boundary visualization from the video.

Third, it might be worthwhile pointing out refinements a real design would need. For example, here are the decision stubs created from the test code. Notice that feature 23 is used twice: same polarity, just different threshold. Is this a limitation of this simple example, or actually a useful quirk of AdaBoost?

0: {'polarity': -1, 'feature_idx': 27, 'threshold': 0.1424, 'alpha': 1.2271759901553476}
1: {'polarity': -1, 'feature_idx': 23, 'threshold': 728.3, 'alpha': 0.9273811402788633}
2: {'polarity': -1, 'feature_idx': 1, 'threshold': 19.98, 'alpha': 0.7916733128875748}
3: {'polarity': -1, 'feature_idx': 23, 'threshold': 876.5, 'alpha': 0.6099992009200025}
4: {'polarity': -1, 'feature_idx': 26, 'threshold': 0.2177, 'alpha': 0.5775069918855832}

IndexError: index 6 is out of bounds for axis 0 with size 6

Hi!

I tried yours naive bayes classifier on
https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data

n_classes = len(self._classes) returns classes like [ 1,2,3,4,5,6 ] unlike in iris [0,1,2]

In loop
for c in self._classes:
X_c = X[y==c]
self._mean[c, :] = X_c.mean(axis=0)
self._var[c, :] = X_c.var(axis=0)
self._priors[c] = X_c.shape[0] / float(n_samples)

I will try to go self._mean[6, ;] which will be out of boundry.
Shouldn't it be
for index, c in enumerate(self._classes):
with index insteed of c in calculations?

        euclid= np.linalg.norm(np.array(feature)-np.array(predict))
      
        distances.append([euclid,group])
vote=[i[1]for i in sorted(distances)[:k]]
vote_result=Counter(vote).most_common(1)[0][0]

return vote_result

Regression tree

hey,
Could you please provide an implementation of Regression tree from scratch?
Thanks!