Giter VIP home page Giter VIP logo

simhash's People

Contributors

alexyangfox avatar bitdeli-chef avatar innernull avatar micheal-zhang-0111 avatar yanyiwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simhash's Issues

question about dict

Sorry, I am a newbie.

There are four utf8 files in the dict and I am confused where they are from, what each of them are used for, Can I change them?

Some more clearly information about them? Thanks.

Assertion failed:

tw:build tw$ ./bin/simhash.demo
Assertion failed: (buf.size() == DICT_COLUMN_NUM), function _loadDict, file /Users/taowei/simhash/src/CppJieba/DictTrie.hpp, line 161.
Abort trap: 6

我建立工程把src倒入到xcode里也会出现这个问题

simhash比特位的疑问

hi,您好
请教一个关于simhash比特位的问题,原论文中的是64bit,每一个char用4bit(0-f)表示的话,算出来的结果应该是长度为16的字符串。
我看到demo里面表示的是长度为20的字符串,如果每一个char是4bit的话(存疑,只看到了0-9没有看到a-f),多出来的16bit的作用是什么?

路径问题

我想使用demo来运行,但是发现路径都是相对路径,然后我这边运行不了。我不是太了解C++,一个一个该路径过于繁琐,有什么好的建议吗

算法优化问题

hi,
以下两个语句,取top32 ,海明距离为6.

都听别人说好做,自己还没尝试呢 我是个大学生,我也想开店,楼主要好好教我啊 加油把

都听别人说好做,自己还没尝试呢 我是个大学生,我也想开店,楼主要好好教我啊 学习了

请问,有哪些优化的思路。

compiling error

simhash/cppjieba/../limonp/StdExtension.hpp:19: error: 'unordered_map' is already declared in this scope
g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
Copyright © 2010 Free Software Foundation, Inc.

Dose anybody meet that?

How can I fixed that?

目前尝试使用了一下simhash做新闻的去重,有三个疑问,希望能帮助解答一下

一、您设置的词典的idf是怎么计算得到的,在海量文档处理的时候,是否需要更新idf?
二、对于形如“鍗楁棆鎺ц偂寮 姤1.32鍏 楂树笂甯备环10% 銆  鍗楁棆鎺ц偂锛”、“懆浜斿憿鏄 [富锷涢槾闄╃殑鐜╀竴鎶婏纴涓嶈Е纰”这样的语句,您是怎么处理的?
三、对若干文本进行汉明距离计算时,发现文档区别很大,但是汉明距离很小,这大概是什么原因?词频设置问题?

simhash的文档?

请问该项目有文档么?

请问demo.cpp中的

simhasher.extract(s, res, topN);
simhasher.make(s, topN, u64);

分别是什么意思?

Error happened when compiled with g++ 4.8.2 with option -std=c++11

In file included from /home/janfan/documents/simhash/src/CppJieba/MixSegment.hpp:5:0,
from /home/janfan/documents/simhash/src/CppJieba/KeywordExtractor.hpp:4,
from /home/janfan/documents/simhash/src/Simhasher.hpp:4,
from /home/janfan/documents/simhash/src/main.cpp:6:
/home/janfan/documents/simhash/src/CppJieba/MPSegment.hpp: In member function ‘bool CppJieba::MPSegment::cut(Limonp::LocalVector::const_iterator, Limonp::LocalVector::const_iterator, std::vector<Limonp::LocalVector >&) const’:
/home/janfan/documents/simhash/src/CppJieba/MPSegment.hpp:103:106: error: no matching function for call to ‘make_pair(size_t&, NULL)’
segmentChars[i].dag.insert(make_pair<DagType::key_type, DagType::mapped_type>(i, NULL));
^

gcc4.4.6编译失败,报错error: invalid conversion from ‘long int’ to ‘const CppJieba::DictUnit*’

[ 16%] Building CXX object src/CMakeFiles/simhash.demo.dir/main.cpp.o
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algobase.h:66,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/char_traits.h:41,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/ios:41,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/ostream:40,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/iostream:40,
from /search/billczhang/xfs/simhash/src/main.cpp:2:
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_pair.h: In constructor ‘std::pair<_T1, _T2>::pair(_U1&&, _U2&&) [with _U1 = size_t&, _U2 = long int, T1 = long unsigned int, T2 = const CppJieba::DictUnit]’:
/search/billczhang/xfs/simhash/src/CppJieba/MPSegment.hpp:96: instantiated from here
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_pair.h:90: error: invalid conversion from ‘long int’ to ‘const CppJieba::DictUnit

make[2]: *** [src/CMakeFiles/simhash.demo.dir/main.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/simhash.demo.dir/all] Error 2
make: *** [all] Error 2

将simhash与cppjieba分词放到同一个目录,无法编译

Simhasher.hpp: In member function ‘bool simhash::Simhasher::extract(const string&, std::vector<std::pair<std::__cxx11::basic_string, double> >&, size_t) const’:
Simhasher.hpp:23:58: error: void value not ignored as it ought to be
return _extractor.Extract(text, res, topN);

Assertion failed: (topN == wordweights.size())

我输入如下的话:
string s("我想吃饭,我最喜欢计算机了。");
结果运行时会出现这样的错误。
Assertion failed: (topN == wordweights.size()), function make, file /Users/taowei/Documents/工程/simhash/simhash/Simhasher.hpp, line 33.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.