将机器学习的基本流程与算法进行手写实现,仅调用numpy以及python基本库
TODO list:
- test cases
- an efficient bp network
- more optimal methods
- train test split func in helper
- other feature select method to add
- lasso and Ridge
- add GBDT feature select
- update Readme
- setup.py
- examples
- get more datasets
当特征数小于样本数时:
from simple_ml.pca import *
pca = PCA(1)
a = np.array([[1,3,2], [3,5,1], [4,7,3], [1,2,0], [0,2,1]])
print(pca.fit_transform(a))
print(pca.explain_ratio)
当特征数远小于样本数时,通过矩阵分解进行低维PCA
from simple_ml.pca import *
pca = SuperPCA(1)
a = np.array([[1,3,2], [3,5,1], [4,7,3], [1,2,0], [0,2,1]])
print(pca.fit_transform(a.T))
print(pca.explain_ratio)
当前提供了四种Filter选择方法:
- 方差法
- 相关系数法
- 卡方检验法
- 互信息法
范例如下
from simple_ml.filter_select import *
X = np.random.random(20).reshape(-1, 4)
Y = np.random.randint(0,2,5)
mf = MyFilter(filter_type=FilterType.chi2, top_k=3)
mf.fitTransform(X,Y)
mf.transform(X)
- accuracy
- precision
- recall
- f1
- auc
- roc作图
- f1micro
- f1macro
- f1weight
- explainedvariance
- absoluteerror
- squarederror
- RMSE(root mean squared error)
- RMSLE(root mean squared log error, in case of the abnormal value)
- r2
- medianabsoluteerror
范例:
from simple_ml.score import *
print(classify_accuracy(np.array([1,0,1]), np.array([1, 1, 1])))
注意:
- 该画图方法是在内部训练进行画图,如果特征大于2,则降至2维再进行训练,而不是先训练后作图,因为要对图上每一个二维点都进行预测,因此,模型必须支持2维训练集(比如随机森林 m>2 时就不支持2维训练集)
- 如果想先训练再作图,且特征大于2维,则无法做出区域
范例:
from simple_ml import classify_plot
classify_plot.classify_plot(model, X_train, y_train, X_test, Y_test, title='My Support Vector Machine')
目前提供了两种交叉验证方法:
- 留出法(holdout)
- k折法(kfolder)
接受参数为:
- 模型实例
- 特征数据
- 标签数据
- 交叉验证类型
- 训练样本比重:只针对留出法
- 交叉验证次数
范例:
from simple_ml.cross_validation import *
cross_validation(model, X, y, CrossValidationType.holdout, 0.3, 5)
我在base.py 中给出了所有分类算法所虚继承的抽象类:BaseClassifier
主要作用是:
- 检查X,Y输入合法性
- 检查Y的类别,包括连续、二值、多值三种类型
- 申明样本数、变量数、训练集、测试集等类属性
必须要重写的方法有:
- fit(X,Y) 给定数据集X和Y进行拟合
- predict(X) 给定测试集进行预测
- score(X,Y) 给定X,Y进行预测效果打分
范例:
from simple_ml.knn import *
from dataset.classify_data import get_iris
knn_test = myKNN(K=3,distance_type=DisType.CosSim)
X, y = get_iris()
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
knn_test.fit(X_train, y_train)
print(knn_test.predict(X_test))
print(knn_test.score(X_test, y_test))
Comming Soon
范例
from simple_ml.logistic import *
X = np.array([[2,1], [4,2], [3,3], [4,1], [3,2], [2,3], [1,3]])
y = np.array([1,2,0,1,0,1,2])
lr = MyLogisticRegression(step=0.01,tol=1e-10)
lr.fit(X, y)
print(lr.predict(X))
print(lr.score(X, y))
lr.auc_plot(X, y)
范例
from simple_ml.naive_bayes import *
X = np.array([[0, 0, 0, 1],
[0, 1, 0, 0],
[1, 1, 0, 1],
[0, 1, 1, 1],
[0, 0, 0, 0]])
y = np.array([0,1,0,1,0])
nb = MyNaiveBayes()
nb.fit(X, y)
X_test = np.array([0, 0, 0, 0]).reshape(1, -1)
print(nb.predict(X_test))
Comming Soon
注意:只支持离散标签
import numpy as np
from simple_ml.bayes import MyBayesMinimumError
X = np.array([[2,1],
[0,3],
[3,0],
[1,2],
[2,0],
[0,1.5]])
y = np.array([1,0,1,0,1,0])
bme = MyBayesMinimumError()
bme.fit(X, y)
print(bme.predict(X))
注意:只支持离散标签
import numpy as np
from simple_ml.bayes import MyBayesMinimumRisk
X = np.array([[2,1],
[0,3],
[3,0],
[1,2],
[2,0],
[0,1.5]])
y = np.array([1,0,1,0,1,0])
bme = MyBayesMinimumRisk(np.array([[0,10], [1,0]]))
bme.fit(X, y)
print(bme.predict(X))
范例
from simple_ml.tree import *
np.random.seed(1234)
rt = RegressionTree(min_leaf_samples=3)
X = np.random.rand(20, 10)
Y = np.random.rand(20)
y_test = np.random.rand(10)
rt.fit(X, Y)
print(rt.predict(y_test))
范例
from simple_ml.tree import *
X, y = get_iris()
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
mrf = MyRandomForest(2)
mrf.fit(X_train, y_train)
print(mrf.predict(X_test))
print(y_test)
mrf.classifyPlot(X_test, y_test)
- 暂时只支持二分类问题
- 提供核函数如下:
class KernelType(Enum):
linear = 0 # 线性核
polynomial = 1 # 多项式核
gassian = 2 # 高斯核
laplace = 3 # 拉普拉斯核
sigmoid = 4 # sigmoid核
范例
from simple_ml.svm import *
from simple_ml.classify_data import get_iris
X, y = get_iris()
X = X[(y==1) | (y==2)]
y = y[(y==1) | (y==2)]
y = np.array([i if i ==1 else -1 for i in y])
mysvm = MySVM(0.6, 0.001, 0.00001, 50, KernelType.linear)
mysvm.fit(X, y)
print(mysvm.alphas, mysvm.b)
print(mysvm.predict(X))
mysvm.classifyPlot(X, y)
仅仅完成了单样本的情况
范例
from simple_ml.cluster import *
X = np.array([1, 2,3, 5,6, 10,11,12,20, 35]).reshape(-1, 2)
X = np.random.rand(*(50, 2))
km = MyKMeans(3, DisType.Minkowski, d=2)
km.fit(X)
print(km.labels)
# plot
import matplotlib.pyplot as plt
plt.scatter(x=X[:,0], y=X[:, 1], c=km.labels)
plt.show()
范例
from simple_ml.cluster import *
X = np.array([1, 2,3, 5,6, 10,11,12,20, 35]).reshape(-1, 2)
X = np.random.rand(*(50, 2))
km = MyHierarchical(DisType.Minkowski, d=2)
km.fit(X)
print(km.max_dis)
print(km.cluster(km.max_dis/4))
# plot
import matplotlib.pyplot as plt
plt.scatter(x=X[:,0], y=X[:, 1], c=km.labels)
plt.show()
from simple_ml.ensemble import MyAdaBoost
import numpy as np
X = np.array([[2,1], [4,2], [3,3], [4,1], [3,2], [2,3], [1,3]])
y = np.array([1,2,0,1,0,1,2])
lr = MyAdaBoost(nums=10)
lr.fit(X, y)
lr.predict(X)
- 只支持0-1特征
- 只支持连续标签
- 只支持平方损失
from simple_ml.ensemble import *
X = np.array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1]).reshape(2, -1).T
y = np.array([3., 3.2, 2., 2.1, 1.5, 2.3, 1.4, 2.1])
gbdt = MyGBDT()
gbdt.fit(X, y)
print(gbdt.predict(np.array([[1, 1], [0, 0], [1, 0], [0, 1]])))
Losers Always Whine About Their Best
献给所有为梦想不懈奋斗的人儿们