Giter VIP home page Giter VIP logo

chenglongchen / kaggle-crowdflower Goto Github PK

View Code? Open in Web Editor NEW
1.8K 102.0 662.0 6.59 MB

1st Place Solution for CrowdFlower Product Search Results Relevance Competition on Kaggle.

Home Page: https://www.kaggle.com/c/crowdflower-search-relevance

Python 26.25% TeX 0.14% Makefile 0.08% C++ 66.37% Perl 0.30% Logos 5.16% Yacc 1.63% Mathematica 0.08%
kaggle-crowdflower search-relevance natural-language-processing nlp kaggle-competetion relevance-competition semantic-matching kaggle search-engine product-search

kaggle-crowdflower's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kaggle-crowdflower's Issues

Run the code in Python 3

Hi

Is there any possible to run the whole codes in Python 3?

When I run the code of run_all.py, I import pickle instead of cPickle in Python 3 and got such error:

File "./preprocess.py", line 79, in
pickle.dump(dfTrain, f, -1)
_pickle.PicklingError: Can't pickle <function at 0x7f8a35868ae8>: attribute lookup on main failed
Traceback (most recent call last):
File "./genFeat_id_feat.py", line 36, in
dfTrain = pickle.load(f)
EOFError: Ran out of input
Traceback (most recent call last):
File "./genFeat_counting_feat.py", line 172, in
dfTrain = pickle.load(f)
EOFError: Ran out of input
Traceback (most recent call last):
File "./genFeat_distance_feat.py", line 236, in
dfTrain = pickle.load(f)
EOFError: Ran out of input

Then I import dill and use dill in the command of pickle.dump(dfTrain, f, -1) as :
dill.dump(dfTrain, f ,-1)

But I got the new error when import the load method

File "./genFeat_id_feat.py", line 36, in
dfTrain = pickle.load(f)
ModuleNotFoundError: No module named 'builtin'
Traceback (most recent call last):
File "./genFeat_counting_feat.py", line 176, in
skf = dill.load(f)
File "/home/mwp141/anaconda3/envs/chenQA/lib/python3.6/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()

A mistake in compile this code

你好。我在编译的时候遇到了下面的错误,请问我是那个地方每配置好吗?

[zhouge@fly Feat]$ python run_all.py
Load data...
Done.
Pre-process data...
./preprocess.py:54: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
dfTrain["median_relevance_%d" % (i+1)][dfTrain["median_relevance"]==(i+1)] = 1
Traceback (most recent call last):
File "./preprocess.py", line 67, in
dfTrain = dfTrain.apply(clean, axis=1)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 3718, in apply
return self.apply_standard(f, axis, reduce=reduce)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 3808, in apply_standard
results[i] = func(v)
File "./preprocess.py", line 66, in
clean = lambda line: clean_text(line, drop_html_flag=config.drop_html_flag)
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/nlp_utils.py", line 184, in clean_text
l = drop_html(l)
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/nlp_utils.py", line 211, in drop_html
return BeautifulSoup(html).get_text(separator=" ")
TypeError: ("'NoneType' object is not callable", u'occurred at index 0')
Traceback (most recent call last):
File "./genFeat_id_feat.py", line 35, in
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./genFeat_counting_feat.py", line 171, in
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./genFeat_distance_feat.py", line 235, in
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./genFeat_basic_tfidf_feat.py", line 46, in
from sklearn.manifold import TSNE
ImportError: cannot import name TSNE
Traceback (most recent call last):
File "./genFeat_cooccurrence_tfidf_feat.py", line 144, in
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./combine_feat
[LSA_and_stats_feat_Jun09]
[Low].py", line 387, in
gen_info(feat_path_name="LSA_and_stats_feat_Jun09")
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/gen_info.py", line 38, in gen_info
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./combine_feat_[LSA_svd150_and_Jaccard_coef_Jun14][Low].py", line 387, in
gen_info(feat_path_name="LSA_svd150_and_Jaccard_coef_Jun14")
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/gen_info.py", line 38, in gen_info
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./combine_feat
[svd100_and_bow_Jun23][Low].py", line 391, in
gen_info(feat_path_name="svd100_and_bow_Jun23")
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/gen_info.py", line 38, in gen_info
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./combine_feat
[svd100_and_bow_Jun27]_[High].py", line 437, in
gen_info(feat_path_name="svd100_and_bow_Jun27")
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/gen_info.py", line 38, in gen_info
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'

Statistical Distance Features for Test data

How do you generate the statistical distance features (described in Sect. 3.2.2 of your notes) for test data? There is no median_relevance labels for test data. How could it possible to group the test data by median_relevance?

What version of sklearn did you use for this project? And could you show us the code to generate stratifiedKFold.query.pkl stratifiedKFold.relevance.pkl ?

Hi, I am trying to reproduce your solution, but the following error was raised when I executed

python3 getFeat_id_feat.py

/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

Generate id features...
For cross-validation...
Traceback (most recent call last):
File "genFeat_id_feat.py", line 56, in
for fold, (validInd, trainInd) in enumerate(skf[run]):
File "/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py", line 82, in iter
ind = np.arange(self.n)
AttributeError: 'StratifiedKFold' object has no attribute 'n'

So it should be the issue of depreciation of the attribute. I am wondering if you could help me by showing the code to generate stratifiedKFold.query.pkl stratifiedKFold.relevance.pkl this two files.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.