zzmjohn / imdict-chinese-analyzer Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/imdict-chinese-analyzer
Automatically exported from code.google.com/p/imdict-chinese-analyzer
我发现小数3.5会变成3和5.如果不用任何stopword,会变成3,5.
大家是怎么解决小数的问题的。
Original issue reported on code.google.com by [email protected]
on 24 Jan 2012 at 4:58
高老师,能提供一份开发说明文档吗,我您的这个挺感兴趣��
�想学习一下
Original issue reported on code.google.com by [email protected]
on 10 Mar 2011 at 1:38
What steps will reproduce the problem?
1.
对于“北京市招商办”的分词,结果为“北京市”、“招商��
�”。当我们检索“北京”的时
候,就无法检索到。
What is the expected output? What do you see instead?
lucene中的索引,创建时应该是最大匹配算法,检索时应该是��
�小匹配算法。如上,创建应该为
“北京市”、“北京”、“市”、“招商办”、“招商”。��
�样检索时才可以找到。
Original issue reported on code.google.com by [email protected]
on 6 Jan 2010 at 3:26
请问imdict-chinese-analyzer是否提供词性标注的功能?
Original issue reported on code.google.com by [email protected]
on 6 May 2009 at 2:55
public List<SegToken> makeIndex()
内的局部变量index应该为int而非short。
short在处理长文章时会溢出。
再次感谢作者的劳动!
Original issue reported on code.google.com by [email protected]
on 7 May 2009 at 6:41
請問如果要利用此模組做繁體中文的分詞,要如何修改呢 ??
Original issue reported on code.google.com by [email protected]
on 30 Sep 2009 at 8:48
首页对性能测试的图上,左边标注K/S
但是下面图标 却是 字节/S 标注失误?
我还没有测试,看说明觉得不适宜宣传,呵呵。
刚比较了下
http://code.google.com/p/mmseg4j/
这个的分词速度真高,咱得给它比比。
Original issue reported on code.google.com by [email protected]
on 30 Oct 2012 at 5:44
the default position of stopwords.txt and coredict.mem, bigramdict.mem is
relative to the class, so I think it will be better if you offer a .jar
file for users or a build.xml to make it easier .
btw, nice work!
Original issue reported on code.google.com by [email protected]
on 23 Jul 2009 at 3:01
1.
他从马上摔下来了。你马上下来一下。(期望“马上”应该��
�一个词)
2.
薄熙来自从担任商务部长以来,一直兢兢业业。(期望“薄��
�来”应该是一个词,而“来自”不是一个词)
start: Tue May 17 15:26:16 CST 2011
1
他
从
马
上
摔
下
来
了
你
马
上
下来
一下
2
薄
熙
来自
从
担任
商务
部长
以来
一直
兢兢业业
time: 0.125 s
Original issue reported on code.google.com by [email protected]
on 17 May 2011 at 7:26
怎样才可以从给字典加入新词?
Original issue reported on code.google.com by [email protected]
on 15 Mar 2010 at 9:01
将分词器加入Nutch中建索引的时候
BiSegGraph.java 181行处
for (SegTokenPair edge : edges) 出现NullPointer错误
由于List<SegTokenPair> edges = getToList(current);
注释说 getToList方法可能返回Null,这里是不是应该要判断一下
我不懂分词算法,只是简单的改了一下
if(edges == null )
continue;
在Nutch中分词没有问题,不知道会不会导致算法错误。
感谢分析代码
Original issue reported on code.google.com by [email protected]
on 11 Mar 2010 at 1:45
solr is the enterprise search solution. It is based on lucene and expends
lecene. it attracts a lot of users from the start.
imdict-chinese-analyzer is a nice chinese tokenizer for me, but I have to
write a bit extra code to support solr, if it is suported by imdict itself,
it would be so nice.
Original issue reported on code.google.com by [email protected]
on 14 Jan 2010 at 10:04
测试的句子如下:
“手机主题首页”
分词结果为:
“手机”
“主题”
“首”
最后的“页”字丢失了。如果要搜索“首页”则无法搜索到��
�确结果。
Original issue reported on code.google.com by [email protected]
on 14 Jan 2010 at 3:05
What steps will reproduce the problem?
1.
对类似于附件的内容进行分词,会报空指针等错误。错误如��
�:
java.lang.NullPointerException
at org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.getShortPath
(BiSegGraph.java:188)
at org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.process
(HHMMSegmenter.java:202)
at org.apache.lucene.analysis.cn.smart.WordSegmenter.segmentSentence
(WordSegmenter.java:50)
at org.apache.lucene.analysis.cn.smart.WordTokenFilter.incrementToken
(WordTokenFilter.java:69)
at org.apache.lucene.analysis.PorterStemFilter.incrementToken
(PorterStemFilter.java:53)
at org.apache.lucene.analysis.StopFilter.incrementToken
(StopFilter.java:225)
at org.apache.lucene.index.DocInverterPerField.processFields
(DocInverterPerField.java:138)
at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument
(DocFieldProcessorPerThread.java:244)
at org.apache.lucene.index.DocumentsWriter.updateDocument
(DocumentsWriter.java:779)
at org.apache.lucene.index.DocumentsWriter.addDocument
(DocumentsWriter.java:757)
at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2472)
at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2446)
java.lang.ArrayIndexOutOfBoundsException: -2
at java.util.ArrayList.get(ArrayList.java:324)
at org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.getShortPath
(BiSegGraph.java:191)
at org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.process
(HHMMSegmenter.java:202)
at org.apache.lucene.analysis.cn.smart.WordSegmenter.segmentSentence
(WordSegmenter.java:50)
at org.apache.lucene.analysis.cn.smart.WordTokenFilter.incrementToken
(WordTokenFilter.java:69)
at org.apache.lucene.analysis.PorterStemFilter.incrementToken
(PorterStemFilter.java:53)
at org.apache.lucene.analysis.StopFilter.incrementToken
(StopFilter.java:225)
at org.apache.lucene.index.DocInverterPerField.processFields
(DocInverterPerField.java:189)
at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument
(DocFieldProcessorPerThread.java:244)
at org.apache.lucene.index.DocumentsWriter.updateDocument
(DocumentsWriter.java:779)
at org.apache.lucene.index.DocumentsWriter.addDocument
(DocumentsWriter.java:757)
at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2472)
at org.apache.lucene.index.IndexWriter.addDocument
(IndexWriter.java:2446)
Original issue reported on code.google.com by [email protected]
on 28 Jan 2010 at 9:50
Attachments:
用lucene3就出各种错误。好像api变了阿。不知道什么时候能支�
��lucene3
Original issue reported on code.google.com by arsenepark
on 16 Oct 2010 at 4:43
What steps will reproduce the problem?
试用一下代码进行分词测试,对部分英文效果不好:
public class AnalyzerTest {
@Test
public void test() throws Exception{
Analyzer analyzer=new SmartChineseAnalyzer
(Version.LUCENE_CURRENT);
String text="我是一名大学生,我来自China Jiliang University";
TokenStream stream = analyzer.tokenStream("", new
StringReader(text));
TermAttribute att=stream.addAttribute(TermAttribute.class);
while(stream.incrementToken()){
System.out.println(att.term());
}
stream.close();
}
}
What is the expected output? What do you see instead?
我希望输出这些内容:
大学生
来自
china
jiliang
university
实际的输出是:
我
是
一
名
大学生
我
来自
china
jiliang
univers
What version of the product are you using? On what operating system?
使用Lucene3.0.0,分词器为SmartChineseAnanlyzer
Please provide any additional information below.
如果我希望过滤掉一些不想索引的word,给如何做?Thank you
very much!
Original issue reported on code.google.com by [email protected]
on 7 Dec 2009 at 7:27
在 Lucene3下调用失败!
Original issue reported on code.google.com by [email protected]
on 28 Mar 2010 at 11:45
高老师,您好。
我是武汉大学的学生,也在做中文分词的研究,下载了您的��
�序,但是看起来比较
难。请问可以提供系统的相关文档吗?比如说关键算法的论��
�,或者整个系统的流程图。
谢谢。
Original issue reported on code.google.com by [email protected]
on 24 Feb 2010 at 2:31
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.