chenglongchen / kaggle-crowdflower Goto Github PK

1st Place Solution for CrowdFlower Product Search Results Relevance Competition on Kaggle.

Home Page: https://www.kaggle.com/c/crowdflower-search-relevance

Python 26.25% TeX 0.14% Makefile 0.08% C++ 66.37% Perl 0.30% Logos 5.16% Yacc 1.63% Mathematica 0.08%

kaggle-crowdflower search-relevance natural-language-processing nlp kaggle-competetion relevance-competition semantic-matching kaggle search-engine product-search

kaggle-crowdflower's People

Stargazers

Watchers

Forkers

amsqr guitarmind dimatolsto dengcy028 njuhugn shannonyu gucasbrg fdoperezi chandrad nkhuyu xypan1232 txd866 jayhetee shineleex mathkann josephluvanda ekoziol sandbreaker bourbaki pippobaudos averroes bayramiaa gonzalomoreno cxysteven zhimingz 466152112 ichdream kazumqct defaultrobot aygons mathn duthchao rahulkulhari hengqujushi kgoetsch gtaneja jenniferwx anuragch zebrabug philharmonikerzzy r-learner nurnoch pangtouyu thanhleviet ternaus bssrdf yychenca yang-tradelab yuelianghaoyuana wqslucifer zhenxu66 bikash alishakiba ledmaster lizequn randxyz gustavodemari ml-lab robinbing svmtracking francis7999 vasanthgx karamcse cc13ny bchirico lai-bluejay zhiyenbay wangxiong2015 oge77 chuyuhsu schigrinov yifanxie smartinsightsfromdata qqgeogor ma123shuai irwenqiang xtrigold andredotj fengfenghan omovchan qnix brandonason aravind-sundaresan lxy1992 xuerchen skyjiao kranthisai alpoza happyphonon orenov vkuznet sunqf sandy4321 gokul180288 0x0all kenoskylive utkarshkeshari dimetrix swang61 dimtim101

kaggle-crowdflower's Issues

Run the code in Python 3

Is there any possible to run the whole codes in Python 3?

When I run the code of run_all.py, I import pickle instead of cPickle in Python 3 and got such error:

File "./preprocess.py", line 79, in
pickle.dump(dfTrain, f, -1)
_pickle.PicklingError: Can't pickle <function at 0x7f8a35868ae8>: attribute lookup on main failed
Traceback (most recent call last):
File "./genFeat_id_feat.py", line 36, in
dfTrain = pickle.load(f)
EOFError: Ran out of input
Traceback (most recent call last):
File "./genFeat_counting_feat.py", line 172, in
dfTrain = pickle.load(f)
EOFError: Ran out of input
Traceback (most recent call last):
File "./genFeat_distance_feat.py", line 236, in
dfTrain = pickle.load(f)
EOFError: Ran out of input

Then I import dill and use dill in the command of pickle.dump(dfTrain, f, -1) as :
dill.dump(dfTrain, f ,-1)

But I got the new error when import the load method

File "./genFeat_id_feat.py", line 36, in
dfTrain = pickle.load(f)
ModuleNotFoundError: No module named 'builtin'
Traceback (most recent call last):
File "./genFeat_counting_feat.py", line 176, in
skf = dill.load(f)
File "/home/mwp141/anaconda3/envs/chenQA/lib/python3.6/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()

A mistake in compile this code

你好。我在编译的时候遇到了下面的错误，请问我是那个地方每配置好吗？

[zhouge@fly Feat]$ python run_all.py
Load data...
Done.
Pre-process data...
./preprocess.py:54: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
dfTrain["median_relevance_%d" % (i+1)][dfTrain["median_relevance"]==(i+1)] = 1
Traceback (most recent call last):
File "./preprocess.py", line 67, in
dfTrain = dfTrain.apply(clean, axis=1)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 3718, in apply
return self.apply_standard(f, axis, reduce=reduce)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 3808, in apply_standard
results[i] = func(v)
File "./preprocess.py", line 66, in
clean = lambda line: clean_text(line, drop_html_flag=config.drop_html_flag)
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/nlp_utils.py", line 184, in clean_text
l = drop_html(l)
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/nlp_utils.py", line 211, in drop_html
return BeautifulSoup(html).get_text(separator=" ")
TypeError: ("'NoneType' object is not callable", u'occurred at index 0')
Traceback (most recent call last):
File "./genFeat_id_feat.py", line 35, in
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./genFeat_counting_feat.py", line 171, in
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./genFeat_distance_feat.py", line 235, in
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./genFeat_basic_tfidf_feat.py", line 46, in
from sklearn.manifold import TSNE
ImportError: cannot import name TSNE
Traceback (most recent call last):
File "./genFeat_cooccurrence_tfidf_feat.py", line 144, in
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./combine_feat[LSA_and_stats_feat_Jun09][Low].py", line 387, in
gen_info(feat_path_name="LSA_and_stats_feat_Jun09")
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/gen_info.py", line 38, in gen_info
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./combine_feat_[LSA_svd150_and_Jaccard_coef_Jun14][Low].py", line 387, in
gen_info(feat_path_name="LSA_svd150_and_Jaccard_coef_Jun14")
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/gen_info.py", line 38, in gen_info
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./combine_feat[svd100_and_bow_Jun23][Low].py", line 391, in
gen_info(feat_path_name="svd100_and_bow_Jun23")
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/gen_info.py", line 38, in gen_info
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'
Traceback (most recent call last):
File "./combine_feat[svd100_and_bow_Jun27]_[High].py", line 437, in
gen_info(feat_path_name="svd100_and_bow_Jun27")
File "/home/zhouge/software/tool/kaggle/Kaggle_CrowdFlower/Code/Feat/gen_info.py", line 38, in gen_info
with open(config.processed_train_data_path, "rb") as f:
IOError: [Errno 2] No such file or directory: '../../Feat/solution/train.processed.csv.pkl'

Statistical Distance Features for Test data

How do you generate the statistical distance features (described in Sect. 3.2.2 of your notes) for test data? There is no median_relevance labels for test data. How could it possible to group the test data by median_relevance?

What version of sklearn did you use for this project? And could you show us the code to generate stratifiedKFold.query.pkl stratifiedKFold.relevance.pkl ?

Hi, I am trying to reproduce your solution, but the following error was raised when I executed

python3 getFeat_id_feat.py

/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

Generate id features...
For cross-validation...
Traceback (most recent call last):
File "genFeat_id_feat.py", line 56, in
for fold, (validInd, trainInd) in enumerate(skf[run]):
File "/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py", line 82, in iter
ind = np.arange(self.n)
AttributeError: 'StratifiedKFold' object has no attribute 'n'

So it should be the issue of depreciation of the attribute. I am wondering if you could help me by showing the code to generate stratifiedKFold.query.pkl stratifiedKFold.relevance.pkl this two files.

Thanks!

chenglongchen / kaggle-crowdflower Goto Github PK

kaggle-crowdflower's People

Stargazers

Watchers

Forkers

kaggle-crowdflower's Issues

Run the code in Python 3

A mistake in compile this code

Statistical Distance Features for Test data

What version of sklearn did you use for this project? And could you show us the code to generate stratifiedKFold.query.pkl stratifiedKFold.relevance.pkl ?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent