Giter VIP home page Giter VIP logo

imdict-chinese-analyzer's People

Stargazers

Qiwei Chen avatar

Watchers

James Cloos avatar  avatar

imdict-chinese-analyzer's Issues

如何支持小数

我发现小数3.5会变成3和5.如果不用任何stopword,会变成3,5. 
大家是怎么解决小数的问题的。

Original issue reported on code.google.com by [email protected] on 24 Jan 2012 at 4:58

开发说明文档

高老师,能提供一份开发说明文档吗,我您的这个挺感兴趣��
�想学习一下


Original issue reported on code.google.com by [email protected] on 10 Mar 2011 at 1:38

分词匹配算法的问题

What steps will reproduce the problem?
1. 
对于“北京市招商办”的分词,结果为“北京市”、“招商��
�”。当我们检索“北京”的时
候,就无法检索到。

What is the expected output? What do you see instead?
lucene中的索引,创建时应该是最大匹配算法,检索时应该是��
�小匹配算法。如上,创建应该为
“北京市”、“北京”、“市”、“招商办”、“招商”。��
�样检索时才可以找到。

Original issue reported on code.google.com by [email protected] on 6 Jan 2010 at 3:26

词性标注

请问imdict-chinese-analyzer是否提供词性标注的功能?

Original issue reported on code.google.com by [email protected] on 6 May 2009 at 2:55

SegGraph.makeIndex()的一个小bug

public List<SegToken> makeIndex() 
内的局部变量index应该为int而非short。
short在处理长文章时会溢出。

再次感谢作者的劳动!

Original issue reported on code.google.com by [email protected] on 7 May 2009 at 6:41

Summary problem

首页对性能测试的图上,左边标注K/S
但是下面图标 却是 字节/S 标注失误?
我还没有测试,看说明觉得不适宜宣传,呵呵。
刚比较了下
http://code.google.com/p/mmseg4j/ 
这个的分词速度真高,咱得给它比比。

Original issue reported on code.google.com by [email protected] on 30 Oct 2012 at 5:44

分词存在问题

1. 
他从马上摔下来了。你马上下来一下。(期望“马上”应该��
�一个词)

2. 
薄熙来自从担任商务部长以来,一直兢兢业业。(期望“薄��
�来”应该是一个词,而“来自”不是一个词)

start: Tue May 17 15:26:16 CST 2011
1
他
从
马
上
摔
下
来
了
你
马
上
下来
一下
2
薄
熙
来自
从
担任
商务
部长
以来
一直
兢兢业业
time: 0.125 s

Original issue reported on code.google.com by [email protected] on 17 May 2011 at 7:26

BiSegGraph.java中NullPointer问题

将分词器加入Nutch中建索引的时候
BiSegGraph.java 181行处
for (SegTokenPair edge : edges) 出现NullPointer错误

由于List<SegTokenPair> edges = getToList(current);
注释说 getToList方法可能返回Null,这里是不是应该要判断一下

我不懂分词算法,只是简单的改了一下
if(edges == null )
   continue;

在Nutch中分词没有问题,不知道会不会导致算法错误。

感谢分析代码

Original issue reported on code.google.com by [email protected] on 11 Mar 2010 at 1:45

plz consider solr support

solr is the enterprise search solution. It is based on lucene and expends
lecene. it attracts a lot of users from the start.

imdict-chinese-analyzer is a nice chinese tokenizer for me, but I have to
write a bit extra code to support solr, if it is suported by imdict itself,
it would be so nice.

Original issue reported on code.google.com by [email protected] on 14 Jan 2010 at 10:04

对IP对应表类型的文件分词报错

What steps will reproduce the problem?
1. 
对类似于附件的内容进行分词,会报空指针等错误。错误如��
�:
java.lang.NullPointerException
    at org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.getShortPath
(BiSegGraph.java:188)
    at org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.process
(HHMMSegmenter.java:202)
    at org.apache.lucene.analysis.cn.smart.WordSegmenter.segmentSentence
(WordSegmenter.java:50)
    at org.apache.lucene.analysis.cn.smart.WordTokenFilter.incrementToken
(WordTokenFilter.java:69)
    at org.apache.lucene.analysis.PorterStemFilter.incrementToken
(PorterStemFilter.java:53)
    at org.apache.lucene.analysis.StopFilter.incrementToken
(StopFilter.java:225)
    at org.apache.lucene.index.DocInverterPerField.processFields
(DocInverterPerField.java:138)
    at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument
(DocFieldProcessorPerThread.java:244)
    at org.apache.lucene.index.DocumentsWriter.updateDocument
(DocumentsWriter.java:779)
    at org.apache.lucene.index.DocumentsWriter.addDocument
(DocumentsWriter.java:757)
    at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2472)
    at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2446)
java.lang.ArrayIndexOutOfBoundsException: -2
    at java.util.ArrayList.get(ArrayList.java:324)
    at org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.getShortPath
(BiSegGraph.java:191)
    at org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.process
(HHMMSegmenter.java:202)
    at org.apache.lucene.analysis.cn.smart.WordSegmenter.segmentSentence
(WordSegmenter.java:50)
    at org.apache.lucene.analysis.cn.smart.WordTokenFilter.incrementToken
(WordTokenFilter.java:69)
    at org.apache.lucene.analysis.PorterStemFilter.incrementToken
(PorterStemFilter.java:53)
    at org.apache.lucene.analysis.StopFilter.incrementToken
(StopFilter.java:225)
    at org.apache.lucene.index.DocInverterPerField.processFields
(DocInverterPerField.java:189)
    at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument
(DocFieldProcessorPerThread.java:244)
    at org.apache.lucene.index.DocumentsWriter.updateDocument
(DocumentsWriter.java:779)
    at org.apache.lucene.index.DocumentsWriter.addDocument
(DocumentsWriter.java:757)
    at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2472)
    at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2446)


Original issue reported on code.google.com by [email protected] on 28 Jan 2010 at 9:50

Attachments:

不支持Lucene3

用lucene3就出各种错误。好像api变了阿。不知道什么时候能支�
��lucene3

Original issue reported on code.google.com by arsenepark on 16 Oct 2010 at 4:43

英文分词效果不好

What steps will reproduce the problem?

试用一下代码进行分词测试,对部分英文效果不好:
public class AnalyzerTest {
    @Test
    public void test() throws Exception{
        Analyzer analyzer=new SmartChineseAnalyzer
(Version.LUCENE_CURRENT);
        String text="我是一名大学生,我来自China Jiliang University";
        TokenStream stream = analyzer.tokenStream("", new 
StringReader(text));
        TermAttribute att=stream.addAttribute(TermAttribute.class);
        while(stream.incrementToken()){
            System.out.println(att.term());
        }
        stream.close();
    }
}
What is the expected output? What do you see instead?
我希望输出这些内容:

大学生
来自
china
jiliang
university

实际的输出是:

我
是
一
名
大学生
我
来自
china
jiliang
univers

What version of the product are you using? On what operating system?

使用Lucene3.0.0,分词器为SmartChineseAnanlyzer

Please provide any additional information below.

如果我希望过滤掉一些不想索引的word,给如何做?Thank you 
very much!

Original issue reported on code.google.com by [email protected] on 7 Dec 2009 at 7:27

请问可以提供系统的相关文档吗?

高老师,您好。
我是武汉大学的学生,也在做中文分词的研究,下载了您的��
�序,但是看起来比较
难。请问可以提供系统的相关文档吗?比如说关键算法的论��
�,或者整个系统的流程图。
谢谢。

Original issue reported on code.google.com by [email protected] on 24 Feb 2010 at 2:31

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.