Giter VIP home page Giter VIP logo

word2vec_java's Introduction

Word2VEC_java

word2vec java版本的一个实现

有人抱怨没有测试代码。我工作中用到。写了个例子正好发这里。大家领会下精神把

有人抱怨没有语料 https://pan.baidu.com/s/1jIy3YSY 大家用这个吧

package com.kuyun.document_class;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.util.List;

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;

import com.alibaba.fastjson.JSONObject;
import com.ansj.vec.Learn;
import com.ansj.vec.Word2VEC;

import love.cq.util.IOUtil;
import love.cq.util.StringUtil;

public class Word2VecTest {
    private static final File sportCorpusFile = new File("corpus/result.txt");

    public static void main(String[] args) throws IOException {
        File[] files = new File("corpus/sport/").listFiles();
        
        //构建语料
        try (FileOutputStream fos = new FileOutputStream(sportCorpusFile)) {
            for (File file : files) {
                if (file.canRead() && file.getName().endsWith(".txt")) {
                    parserFile(fos, file);
                }
            }
        }
        
        //进行分词训练
        
        Learn lean = new Learn() ;
        
        lean.learnFile(sportCorpusFile) ;
        
        lean.saveModel(new File("model/vector.mod")) ;
        
        
        
        //加载测试
        
        Word2VEC w2v = new Word2VEC() ;
        
        w2v.loadJavaModel("model/vector.mod") ;
        
        System.out.println(w2v.distance("姚明")); ;

    }

    private static void parserFile(FileOutputStream fos, File file) throws FileNotFoundException,
                                                                   IOException {
        // TODO Auto-generated method stub
        try (BufferedReader br = IOUtil.getReader(file.getAbsolutePath(), IOUtil.UTF8)) {
            String temp = null;
            JSONObject parse = null;
            while ((temp = br.readLine()) != null) {
                parse = JSONObject.parseObject(temp);
                paserStr(fos, parse.getString("title"));
                paserStr(fos, StringUtil.rmHtmlTag(parse.getString("content")));
            }
        }
    }

    private static void paserStr(FileOutputStream fos, String title) throws IOException {
        List<Term> parse2 = ToAnalysis.parse(title) ;
        StringBuilder sb = new StringBuilder() ;
        for (Term term : parse2) {
            sb.append(term.getName()) ;
            sb.append(" ");
        }
        fos.write(sb.toString().getBytes()) ;
        fos.write("\n".getBytes()) ;
    }
}

word2vec_java's People

Contributors

ansjsun avatar chongwf avatar elloray avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

word2vec_java's Issues

能大致介绍下流程不

1.加载语料库
2.对语料库分词
3.用语料库分词后的文本进行训练, 并将训练后向量数据保存
4.使用之前向量数据计算输入词的距离

我的理解是否有误呢?

咨询语句

请问 File[] files = new File("corpus/sport/").listFiles();这个句子的作用是什么?谢谢!

google模型加载的时候出现乱码

我使用c语言版本训练了一个二进制的模型,在直接运行的时候加载是正常的,但是以web形式使用的时候加载模型遇到中文就会出现乱码,该怎么解决?

测试类运行结果都为空,模型训练和加载都没问题

Learn learn=new Learn();
	//训练模型
	learn.learnFile(new File("library/xh.txt"));
	//存储模型
	learn.saveModel(new File("library/javaVector"));
	
    Word2VEC w1 = new Word2VEC() ;
    // 加载模型
    w1.loadJavaModel("library/javaVector");
   
    System.out.println(w1.distance("朋友"));
    
    System.out.println(w1.distance("主席"));
    
    System.out.println(w1.distance("***"));
    
    
    System.out.println(w1.distance("魔术队"));

运行结果:
Vocab size: 26
Words in train file: 31
sucess train over!
模型加载成功
[]
[]
[]
[]
[]

经过一天调试,用作者语料终于成功跑出结果,发现bug

Word2vec w2v = new Word2vec();
w2v.loadJavaModel("model.bin");
System.out.println(w2v.distance("魔术队"));
结果为:[奥兰多 0.8990011, 新泽西 0.83124423, 奇才队 0.82303494, 网队 0.6876496, 顾明 0.68449014, 喻广生 0.6766388, 大年初五 0.67316043, 实习生 0.67124707, 佛罗里达州 0.6711269, 刘国强 0.66510504, 利纳雷斯 0.6648634, 郑金发 0.66467416, 寒风料峭 0.6624502, 孙星文 0.6613987, 廿七 0.66014194, 谷利源 0.65965, 孙永明 0.6595951, 辛祥利 0.6593127, 蓝宝石 0.6587766, 秦凤桐 0.6582865, 乔颖 0.656125, 潘家埠 0.6528409, 安卡拉 0.6501446, 刘文国 0.6500238, 马那瓜 0.6496207, 盛世良 0.64810187, 年初四 0.64746207, 时装展 0.64492613, 孟军 0.6446666, 俞俭 0.64461464, 谢湘 0.64448756, 刘世昕 0.6439994, 摩洛哥王国 0.6434583, 科托努 0.64319485, 周健伟 0.6416397, 王波 0.6413159, 阿鲁沙省 0.64103186, 刘永华 0.64082164, 侯嘉 0.64043736, 里斯本 0.6402035]

如果使用作者提供的语料,需要做的处理包括如下:
1.确保语料文本文件是UTF-8编码,不是需要转换。
2.作者提供的语料是用制表符切割的词组,但是代码是根据空格切割,需要将制表符全部替换成空格。或者修改代码:Learn.java 271行,修改成String[] split = temp.split("[\s ]+");支持同时出现多个半角或全角空格,或制表符分隔。
3.发现一个bug
Word2Vec中2个distance方法中,min = result.last().score; 应该放在resultSize < result.size()块里。
只有当结果数已经大于resultSize,才能将最后一个得分数赋予min,作为以后最小允许得分。结果数不大于resultSize不能赋予给min。
最新的代码提交到:https://github.com/linshouyi/Word2VEC_java

为什么CBOW模型的变量g的计算方法和源码不一样啊?

您好,在你的Learn.jaja文件里面,第233-236行是变量g的计算方法,word2vec的源码的计算方法是和第234行一样的但是被你注释掉了,改成了236行的方法,为什么要这样改呢?这样结算出来的结果和源码差很多啊。

  // 'g' is the gradient multiplied by the learning rate
  // double g = (1 - word.codeArr[d] - f) * alpha;
  // double g = f*(1-f)*( word.codeArr[i] - f) * alpha;
  double g = f * (1 - f) * (word.codeArr[d] - f) * alpha;

跟测试代码有关的 aliliba.fastjason鏈接不通

只是告诉一声您的测试代码中用到 alibaba.fastjason但alibaba.fastjason github 有关 文件鏈接不通。在 alibaba.fastjason github这个好像以前有人报告过, 但那里还没人对这有回音。

调用loadGoogleModel模型,产生如下报错

Exception in thread "main" java.lang.NumberFormatException: For input string: " �1 � 2.2万吨��`6=+f�>o��>�;b>��R=~"2��W6=��)�%�������E��=s���l"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at com.ansj.vec.Word2VEC.loadGoogleModel(Word2VEC.java:80)
at test.Test.main(Test.java:10)

compile error

com/ansj/vec/Learn.java:17: error: package love.cq.util does not exist
import love.cq.util.MapCount;
^
com/ansj/vec/Learn.java:264: error: cannot find symbol
MapCount mc = new MapCount<>();
^
symbol: class MapCount
location: class Learn
com/ansj/vec/Learn.java:264: error: cannot find symbol
MapCount mc = new MapCount<>();
^
symbol: class MapCount
location: class Learn
3 errors

cbowGram模型中的梯度g

你好!我注意到Learn.java232行cbowGram的梯度g的计算公式与166行skipGram中的有些不一样:

// 'g' is the gradient multiplied by the learning rate
//     skipGram:    double g = (1 - word.codeArr[d] - f) * alpha;
//     cbowGram:    double g = f*(1-f)*( word.codeArr[i] - f) * alpha;
double g = f * (1 - f) * (word.codeArr[d] - f) * alpha;

我狠狠地看了下Mikolov的文章和原C代码,看到其在cbowGram和skipGram中的梯度公式是一样。是因为利用这样的公式能训练得更好

关于MapCount的size问题

 private void readVocab(File file) throws IOException {
    MapCount<String> mc = new MapCount<>();
    try (BufferedReader br = new BufferedReader(new InputStreamReader(
        new FileInputStream(file)))) {
      String temp = null;
      while ((temp = br.readLine()) != null) {
        String[] split = temp.split(" ");
        trainWordsCount += split.length;
        for (String string : split) {
          mc.add(string);
        }
      }
    }
    for (Entry<String, Integer> element : mc.get().entrySet()) {
      wordMap.put(element.getKey(), new WordNeuron(element.getKey(),
          (double) element.getValue() / mc.size(), layerSize));
    }
  }

其中的MapCount的size是否不应该用HashMap的size,而是应该用HashMap中所有value之和?

Word2VEC中distance函数的优化

Word2VEC.distrance函数的作用是查找与某最相近的若干词,通过遍历词表中的所有词,并不断查找和替换相似度最小的词,最后使用TreeSet进行排序,我的方法是直接使用TreeSet进行排序,每次同样去除相似度最小的词,性能上要快1毫秒,具体如下:

public Set<WordEntry> similar(String queryWord){

    float[] center = wordMap.get(queryWord);
    if (center == null){
        return Collections.emptySet();
    }

    int resultSize = wordMap.size() < topNSize ? wordMap.size() : topNSize;
    TreeSet<WordEntry> result = new TreeSet<WordEntry>();
    for (int i = 0; i < resultSize + 1; i++){
        result.add(new WordEntry("^_^", -Float.MAX_VALUE));
    }

    for (Map.Entry<String, float[]> entry : wordMap.entrySet()){
        float[] vector = entry.getValue();
        float dist = 0;
        for (int i = 0; i < vector.length; i++){
            dist += center[i] * vector[i];
        }
        result.add(new WordEntry(entry.getKey(), dist));
        result.pollLast();
    }
    result.pollFirst();

    return result;
}

最相似的词肯定是查询词queryWord自己,所以最后使用pollFirst去掉。

请问,为什么用您的语料库会出现以下错误

Exception in thread "main" com.alibaba.fastjson.JSONException: syntax error, pos 1, line 1, column 2迈向 充满 希望 的 新 世纪 —— 一九九八年 新年 讲话 ( 附 图片 1 张 )
at com.alibaba.fastjson.parser.DefaultJSONParser.parse(DefaultJSONParser.java:1447)
at com.alibaba.fastjson.parser.DefaultJSONParser.parse(DefaultJSONParser.java:1333)
at com.alibaba.fastjson.JSON.parse(JSON.java:152)
at com.alibaba.fastjson.JSON.parse(JSON.java:162)
at com.alibaba.fastjson.JSON.parse(JSON.java:131)
at com.alibaba.fastjson.JSON.parseObject(JSON.java:223)
at tets_word_vec.Test.parserFile(Test.java:61)
at tets_word_vec.Test.main(Test.java:29)

加载成功模型,但因编码问题无法成功向量化词

使用时遇到一个奇怪的bug,在Eclipse中能运行成功,maven打成jar包放在控制台里运行模型就无法向量化词语了,经过不断调试发现以下解决方案
在Word2vec.java的readString方法中
sb.append(new String(bytes));

sb.append(new String(bytes, 0, i + 1));
改为
sb.append(new String(bytes, "UTF-8"));

sb.append(new String(bytes, 0, i + 1, "UTF-8"));

缺少内容

org.ansj.domain.Term;
org.ansj.splitWord.analysis.ToAnalysis;

训练参数问题

private int layerSize = 300;
private int window = 5;
private double sample = 1e-3;
private double alpha = 0.025;
private Boolean isCbow = false;
上面参数size,window,负采样阈值,学习率,使用skip。

麻烦解释一下下面这二个参数是代表什么?用此java实现的代码训练词向量默认只迭代一次吗,如果不是,迭代的参数在哪里设置?
public int EXP_TABLE_SIZE = 1000;
private int MAX_EXP = 6;

negative sampling的实现

弱问作者为什么没有实现negative sampling呢?记得Google paper说skip-gram with negative sampling结果是最好的。不知基于现有code加上negative sampling是否困难呢? 刚刚入门的本科生,还请大牛指点~

Number Format Exception from loadGoogleModel

I got a corpus.bin from https://pan.baidu.com/s/1dFKevtv . When i tried to run the test program , shows error as
Exception in thread "main" java.lang.NumberFormatException: For input string: "****"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at com.ansj.vec.Word2VEC.loadGoogleModel(Word2VEC.java:80)

Word2VEC中的loadJavaModel中向量求模

你好!最近一直参看你的Java实现代码,帮助很大。但我在看从文件中加载模型的方法时发现了两个方法loadGoogleModelloadJavaModel,在loadJavaModel中需要对每个词的向量求模,并将其化为单位向量,但循环在每次开始计算模len之前没有重新置0,这样模len就会不断累加,变得越来越大line116

for (int i = 0; i < words; i++) {
    key = dis.readUTF();
    value = new float[size];
    for (int j = 0; j < size; j++) {
    vector = dis.readFloat();
        len += vector * vector;
        value[j] = vector;
    } 
    len = Math.sqrt(len);
    for (int j = 0; j < size; j++) {
        value[j] /= len;
    }
    wordMap.put(key, value);
}

这与loadGoogleModel中的相应方法不一样,故来问下这里是不是忘了在循环开始时对len置0?

无法成功加载自己使用gensim训练的Word2vec模型

有如下报错
Exception in thread "main" java.lang.NumberFormatException: For input string: "��cgensim.models.word2vec"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at com.ansj.vec.Word2VEC.loadGoogleModel(Word2VEC.java:81)
at test.Test.main(Test.java:12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.