Light

hahacyd / emoji_recomendation Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 65.2 MB

Recommend the corresponding emoji to me based on the Weibo text

Python 100.00%

emoji_recomendation's Introduction

Emoji_Recomendation

内容介绍

这个项目包含若干种文本分类算法，如NaiveBayes、svm、cnn等，此文档将重点放在cnn的训练上。
cnn任务包含这几个步骤：
- 对训练和测试文本分词
- 对训练和测试文本集训练 词向量(wordvec),如果能从更大的语料库训练 word2vec,效果应当会更好，例如可以从这里获得。
- 用pytorch完成cnn并训练，cnn.py包含网络结构、训练、验证代码。
- 对已训练的模型，我们需要在测试集上测试。
对于其它的算法的训练，则是使用了sklearn库提供的方法。它们的数据预处理与cnn不同的地方在于“分词后需要做特征提取，而不是训练word2vec”。
这里的输入数据都是已经分好词了，默认不需要再做分词了，如果你想训练其它文本，可以修改jieba_lac.py来分词。

怎么上手呢？

以cnn为例，直接执行 python cnn.py
之所以如此简单有如下两个原因：

为了增加灵活性，步骤之间的中间文件都会保存在 dump 文件夹中，这样每次训练不会重复生成这些 中间文件 。
在执行每个步骤时都会自动检查所依赖的 中间文件 是否存在，若不存在，则先调用上一个步骤。

另外每个函数都有详细的说明注释，方便理解。

除了python代码,其它文件是什么？

corpus.csv为语料库文件，是train.csv和test.csv的混合
fine-tune.txt 是部分实验记录
test.cvs 和 train.csv 是预处理（分词）后的测试和训练数据

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.