zzmjohn / imdict-chinese-analyzer Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 4.28 MB

Automatically exported from code.google.com/p/imdict-chinese-analyzer

Java 97.15% HTML 2.85%

imdict-chinese-analyzer's People

Stargazers

Watchers

imdict-chinese-analyzer's Issues

如何支持小数

我发现小数3.5会变成3和5.如果不用任何stopword,会变成3,5. 
大家是怎么解决小数的问题的。

Original issue reported on code.google.com by [email protected] on 24 Jan 2012 at 4:58

开发说明文档

高老师，能提供一份开发说明文档吗，我您的这个挺感兴趣��
�想学习一下

Original issue reported on code.google.com by [email protected] on 10 Mar 2011 at 1:38

分词匹配算法的问题

What steps will reproduce the problem?
1. 
对于“北京市招商办”的分词，结果为“北京市”、“招商��
�”。当我们检索“北京”的时
候，就无法检索到。

What is the expected output? What do you see instead?
lucene中的索引，创建时应该是最大匹配算法，检索时应该是��
�小匹配算法。如上，创建应该为
“北京市”、“北京”、“市”、“招商办”、“招商”。��
�样检索时才可以找到。

Original issue reported on code.google.com by [email protected] on 6 Jan 2010 at 3:26

词性标注

请问imdict-chinese-analyzer是否提供词性标注的功能？

Original issue reported on code.google.com by [email protected] on 6 May 2009 at 2:55

SegGraph.makeIndex()的一个小bug

public List<SegToken> makeIndex() 
内的局部变量index应该为int而非short。
short在处理长文章时会溢出。

再次感谢作者的劳动！

Original issue reported on code.google.com by [email protected] on 7 May 2009 at 6:41

support for 繁體中文

請問如果要利用此模組做繁體中文的分詞,要如何修改呢 ??

Original issue reported on code.google.com by [email protected] on 30 Sep 2009 at 8:48

Summary problem

首页对性能测试的图上，左边标注K/S
但是下面图标 却是 字节/S 标注失误？
我还没有测试，看说明觉得不适宜宣传，呵呵。
刚比较了下
http://code.google.com/p/mmseg4j/ 
这个的分词速度真高，咱得给它比比。

Original issue reported on code.google.com by [email protected] on 30 Oct 2012 at 5:44

Plz offer an ant build.xml to build the source to a stand alone jar

the default position of stopwords.txt and coredict.mem, bigramdict.mem is
relative to the class, so I think it will be better if you offer a .jar
file for users or a build.xml to make it easier .

btw, nice work!

Original issue reported on code.google.com by [email protected] on 23 Jul 2009 at 3:01

分词存在问题

1. 
他从马上摔下来了。你马上下来一下。（期望“马上”应该��
�一个词）

2. 
薄熙来自从担任商务部长以来，一直兢兢业业。（期望“薄��
�来”应该是一个词，而“来自”不是一个词）

start: Tue May 17 15:26:16 CST 2011
1
他
从
马
上
摔
下
来
了
你
马
上
下来
一下
2
薄
熙
来自
从
担任
商务
部长
以来
一直
兢兢业业
time: 0.125 s

Original issue reported on code.google.com by [email protected] on 17 May 2011 at 7:26

请问如何才可以加入新词？

怎样才可以从给字典加入新词？

Original issue reported on code.google.com by [email protected] on 15 Mar 2010 at 9:01

BiSegGraph.java中NullPointer问题

将分词器加入Nutch中建索引的时候
BiSegGraph.java 181行处
for (SegTokenPair edge : edges) 出现NullPointer错误

由于List<SegTokenPair> edges = getToList(current);
注释说 getToList方法可能返回Null，这里是不是应该要判断一下

我不懂分词算法，只是简单的改了一下
if(edges == null )
   continue;

在Nutch中分词没有问题，不知道会不会导致算法错误。

感谢分析代码

Original issue reported on code.google.com by [email protected] on 11 Mar 2010 at 1:45

plz consider solr support

solr is the enterprise search solution. It is based on lucene and expends
lecene. it attracts a lot of users from the start.

imdict-chinese-analyzer is a nice chinese tokenizer for me, but I have to
write a bit extra code to support solr, if it is suported by imdict itself,
it would be so nice.

Original issue reported on code.google.com by [email protected] on 14 Jan 2010 at 10:04

奇怪的分词结果，某些词会不正常地丢失

测试的句子如下：
“手机主题首页”
分词结果为：

“手机”
“主题”
“首”

最后的“页”字丢失了。如果要搜索“首页”则无法搜索到��
�确结果。

Original issue reported on code.google.com by [email protected] on 14 Jan 2010 at 3:05

对IP对应表类型的文件分词报错

What steps will reproduce the problem?
1. 
对类似于附件的内容进行分词，会报空指针等错误。错误如��
�：
java.lang.NullPointerException
    at org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.getShortPath
(BiSegGraph.java:188)
    at org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.process
(HHMMSegmenter.java:202)
    at org.apache.lucene.analysis.cn.smart.WordSegmenter.segmentSentence
(WordSegmenter.java:50)
    at org.apache.lucene.analysis.cn.smart.WordTokenFilter.incrementToken
(WordTokenFilter.java:69)
    at org.apache.lucene.analysis.PorterStemFilter.incrementToken
(PorterStemFilter.java:53)
    at org.apache.lucene.analysis.StopFilter.incrementToken
(StopFilter.java:225)
    at org.apache.lucene.index.DocInverterPerField.processFields
(DocInverterPerField.java:138)
    at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument
(DocFieldProcessorPerThread.java:244)
    at org.apache.lucene.index.DocumentsWriter.updateDocument
(DocumentsWriter.java:779)
    at org.apache.lucene.index.DocumentsWriter.addDocument
(DocumentsWriter.java:757)
    at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2472)
    at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2446)
java.lang.ArrayIndexOutOfBoundsException: -2
    at java.util.ArrayList.get(ArrayList.java:324)
    at org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.getShortPath
(BiSegGraph.java:191)
    at org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.process
(HHMMSegmenter.java:202)
    at org.apache.lucene.analysis.cn.smart.WordSegmenter.segmentSentence
(WordSegmenter.java:50)
    at org.apache.lucene.analysis.cn.smart.WordTokenFilter.incrementToken
(WordTokenFilter.java:69)
    at org.apache.lucene.analysis.PorterStemFilter.incrementToken
(PorterStemFilter.java:53)
    at org.apache.lucene.analysis.StopFilter.incrementToken
(StopFilter.java:225)
    at org.apache.lucene.index.DocInverterPerField.processFields
(DocInverterPerField.java:189)
    at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument
(DocFieldProcessorPerThread.java:244)
    at org.apache.lucene.index.DocumentsWriter.updateDocument
(DocumentsWriter.java:779)
    at org.apache.lucene.index.DocumentsWriter.addDocument
(DocumentsWriter.java:757)
    at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2472)
    at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2446)

Original issue reported on code.google.com by [email protected] on 28 Jan 2010 at 9:50

Attachments:

0.txt

不支持Lucene3

用lucene3就出各种错误。好像api变了阿。不知道什么时候能支�
��lucene3

Original issue reported on code.google.com by arsenepark on 16 Oct 2010 at 4:43

英文分词效果不好

What steps will reproduce the problem?

试用一下代码进行分词测试，对部分英文效果不好：
public class AnalyzerTest {
    @Test
    public void test() throws Exception{
        Analyzer analyzer=new SmartChineseAnalyzer
(Version.LUCENE_CURRENT);
        String text="我是一名大学生，我来自China Jiliang University";
        TokenStream stream = analyzer.tokenStream("", new 
StringReader(text));
        TermAttribute att=stream.addAttribute(TermAttribute.class);
        while(stream.incrementToken()){
            System.out.println(att.term());
        }
        stream.close();
    }
}
What is the expected output? What do you see instead?
我希望输出这些内容：

大学生
来自
china
jiliang
university

实际的输出是：

我
是
一
名
大学生
我
来自
china
jiliang
univers

What version of the product are you using? On what operating system?

使用Lucene3.0.0，分词器为SmartChineseAnanlyzer

Please provide any additional information below.

如果我希望过滤掉一些不想索引的word，给如何做？Thank you 
very much!

Original issue reported on code.google.com by [email protected] on 7 Dec 2009 at 7:27

不支持 Lucene 3

在 Lucene3下调用失败！

Original issue reported on code.google.com by [email protected] on 28 Mar 2010 at 11:45

请问可以提供系统的相关文档吗?

高老师，您好。
我是武汉大学的学生，也在做中文分词的研究，下载了您的��
�序，但是看起来比较
难。请问可以提供系统的相关文档吗？比如说关键算法的论��
�，或者整个系统的流程图。
谢谢。

Original issue reported on code.google.com by [email protected] on 24 Feb 2010 at 2:31

zzmjohn / imdict-chinese-analyzer Goto Github PK

imdict-chinese-analyzer's People

Stargazers

Watchers

imdict-chinese-analyzer's Issues

Recommend Projects

Recommend Topics

Recommend Org