minerva-ml / open-solution-toxic-comments Goto Github PK

View Code? Open in Web Editor NEW

153.0 13.0 57.0 2.57 MB

Open solution to the Toxic Comment Classification Challenge

Home Page: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

License: MIT License

Jupyter Notebook 15.01% Python 83.78% Shell 1.21%

data-science machine-learning deep-learning kaggle python nlp kaggle-competition pipeline challenge prediction

open-solution-toxic-comments's Introduction

Starter code: Kaggle Toxic Comment Classification Challenge

More competitions 🎇

Check collection of public projects 🎁, where you can find multiple Kaggle competitions with code, experiments and outputs.

Here, at Neptune we enjoy participating in the Kaggle competitions. Toxic Comment Classification Challenge is especially interesting because it touches important issue of online harassment.

Ensemble our predictions in the cloud!

You need to be registered to neptune.ml to be able to use our predictions for your ensemble models.

click start notebook
choose browse button
select the neptune_ensembling.ipynb file from this repository.
choose worker type: gcp-large is the recommended one.
run first few cells to load our predictions on the held out validation set along with the labels
grid search over many possible parameter options. The more runs you choose the longer it will run.
train your second level, ensemble model (it should take less than an hour once you have the parameters)
load our predictions on the test set
feed our test set predictions to your ensemble model and get final predictions
save your submission file
click on browse files and find your submission file to download it.

Running the notebook as is got 0.986+ on the LB.

Disclaimer

In this open source solution you will find references to the neptune.ml. It is free platform for community Users, which we use daily to keep track of our experiments. Please note that using neptune.ml is not necessary to proceed with this solution. You may run it as plain Python script 😉.

The idea

We are contributing starter code that is easy to use and extend. We did it before with Cdiscount’s Image Classification Challenge and we believe that it is correct way to open data science to the wider community and encourage more people to participate in Challenges. This starter is ready-to-use end-to-end solution. Since all computations are organized in separate steps, it is also easy to extend. Check devbook.ipynb for more information about different pipelines.

Now we want to go one step further and invite you to participate in the development of this analysis pipeline. At the later stage of the competition (early February) we will invite top contributors to join our team on Kaggle.

Contributing

You are welcome to extend this pipeline and contribute your own models or procedures. Please refer to the CONTRIBUTING for more details.

Installation

option 1: Neptune cloud

on the neptune site

log in: neptune accound login
create new project named toxic: Follow the link Projects (top bar, left side), then click New project button. This action will generate project-key TOX, which is already listed in the neptune.yaml.

run setup commands

$ git clone https://github.com/neptune-ml/kaggle-toxic-starter.git
$ pip3 install neptune-cli
$ neptune login

start experiment

$ neptune send --environment keras-2.0-gpu-py3 --worker gcp-gpu-medium --config best_configs/fasttext_gru.yaml -- train_evaluate_predict_cv_pipeline --pipeline_name fasttext_gru --model_level first

This should get you to 0.9852 Happy Training :)

Refer to Neptune documentation and Getting started: Neptune Cloud for more.

option 2: local install

Please refer to the Getting started: local instance for installation procedure.

Solution visualization

Below end-to-end pipeline is visualized. You can run exactly this one!

We have also prepared something simpler to just get you started:

User support

There are several ways to seek help:

Read project's Wiki, where we publish descriptions about the code, pipelines and neptune.
Kaggle discussion is our primary way of communication.
You can submit an issue directly in this repo.

open-solution-toxic-comments's People

Contributors

Stargazers

Watchers

open-solution-toxic-comments's Issues

Congratulations on your silver medal

Hope to see your final solution to the problem......And thanks for providing the environment for running the models...it made a lot of difference in running massive stackers :)

Seems lots of package installing issues with requirements.txt

I have tried to run "neptune send experiment_manager.py --environment keras-2.0-gpu-py3 --worker gcp-gpu-medium --config neptune_config.yaml -- train_evaluate_predict_pipeline --pipeline_name glove_lstm"

but got lots of package installation issues. I tried to comment out some in requirements.txt, but seems they are so many. Any solution for this? Or I just missed something in configuration? Thanks.

Vanished experiment_manager.py

How can I run this code on the local machine? Accordance with your wiki, file experiment_manager.py exists in the repo and I should use it. But I don't see it. Could you help me?

https://github.com/neptune-ml/kaggle-toxic-starter/wiki/Experimentation-guideline

KeyError:'meta'

HI, I just pull out the master branch code and run the experiment without changing anything. But still has this problem:
2018-03-13 21-02-21 toxic >>> Training...
2018-03-13 21-02-21 toxic >>> step xy_train adapting inputs
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/job_wrapper.py", line 113, in
execute()
File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/job_wrapper.py", line 109, in execute
execfile(job_filepath, job_globals)
File "/usr/local/lib/python3.6/dist-packages/past/builtins/misc.py", line 82, in execfile
exec_(code, myglobals, mylocals)
File "main.py", line 382, in
action()
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "main.py", line 255, in train_evaluate_predict_cv_pipeline
pipeline_name)
File "main.py", line 311, in _fold_fit_loop
_ = pipeline.fit_transform(data_train)
File "/neptune/steps/base.py", line 71, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
File "/neptune/steps/base.py", line 71, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
File "/neptune/steps/base.py", line 71, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
File "/neptune/steps/base.py", line 74, in fit_transform
step_inputs = self.adapt(step_inputs)
File "/neptune/steps/base.py", line 153, in adapt
raw_inputs = [step_inputs[step_name][step_var] for step_name, step_var in step_mapping]
File "/neptune/steps/base.py", line 153, in
raw_inputs = [step_inputs[step_name][step_var] for step_name, step_var in step_mapping]
KeyError: 'meta'

I use pip3 and install all package in the requirement, so did I do it wrong? or could you please fix this?
Thank you

Is it possible to have access to the OOF files?

ModuleNotFoundError: No module named 'attrdict' when running

is there still public/toxic_comments directory?

Hi, I ran the experimental script and error shows that cannot find the /public/toxic_comments/single_model_predictions_20180226, so is this still there?

Unable to access auth server. Login url is incorrect

When I run command "neptune login"
I got nothing but "Unable to access auth server. Login url is incorrect"
Is there anything wrong with the server?

ModuleNotFoundError: No module named 'seaborn'

Hi,

I tried the notebook neptune_ensembling.ipynb on neptune, but I got this error message:
...
ModuleNotFoundError: No module named 'seaborn'

I am not sure which combination of worker type, Python version and leading library I should use where this seaborn module is installed. Thanks.

Hard to reproduce results locally

Hi, I was able to run all models locally (run_end_to_end.sh) but weren't able to run catboost on top of the models.

TypeError: fit() missing 1 required positional argument: 'validation_data'

It looks like I am missing something. Is it possible to reproduce your pipeline without loading data from the cloud?

old test data

The test data in /public folder is old and so the submission output is generated for old test data. The test data for this competition has been recently changed.

Bugs and errors I ran into during experiment

Found bugs:

missing "," in the end of line 28 in file "pipeline_config.py".
missing "import nltk" in file "/steps/preprocessing.py".

Error:

FileNotFoundError
30.2639 | /usr/local/lib/python3.6/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
30.26557 | from ._conv import register_converters as register_converters
30.417965 | Using TensorFlow backend.
35.018845 | Traceback (most recent call last):
35.019116 | File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/job_wrapper.py", line 113, in
35.019273 | execute()
35.019429 | File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/job_wrapper.py", line 109, in execute
35.019568 | execfile(job_filepath, job_globals)
35.019697 | File "/usr/local/lib/python3.6/dist-packages/past/builtins/misc.py", line 82, in execfile
35.019829 | exec(code, myglobals, mylocals)
35.019959 | File "main.py", line 12, in
35.020092 | from pipelines import PIPELINES
35.020226 | File "/neptune/pipelines.py", line 8, in
35.020365 | from steps.preprocessing import XYSplit, TextCleaner, TfidfVectorizer, WordListFilter, Normalizer, TextCounter,
35.020523 | File "/neptune/steps/preprocessing.py", line 24, in
35.020654 | with open('../external_data/apostrophes.json', 'r') as f:
35.020784 | FileNotFoundError: [Errno 2] No such file or directory: '../external_data/apostrophes.json'

After I copy the data from that file into the code as following, another error pops up.
#with open('../external_data/apostrophes.json', 'r') as f:
# APPO = json.load(f)
APPO = {
"arent": "are not",
...,
"well": "will"
}

2. TypeError
645.742938 Traceback (most recent call last):
645.743234 File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/job_wrapper.py", line 113, in
645.743377 execute()
645.74352 File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/job_wrapper.py", line 109, in execute
645.743696 execfile(job_filepath, job_globals)
645.743897 File "/usr/local/lib/python3.6/dist-packages/past/builtins/misc.py", line 82, in execfile
645.744034 exec_(code, myglobals, mylocals)
645.744164 File "main.py", line 382, in
645.744294 action()
645.744436 File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 722, in call
645.744561 return self.main(*args, **kwargs)
645.744717 File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 697, in main
645.744849 rv = self.invoke(ctx)
645.744996 File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke
645.745155 return _process_result(sub_ctx.command.invoke(sub_ctx))
645.745335 File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 895, in invoke
645.745471 return ctx.invoke(self.callback, **ctx.params)
645.745602 File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 535, in invoke
645.745742 return callback(*args, **kwargs)
645.745871 File "main.py", line 129, in train_evaluate_predict_pipeline
645.746002 _train_pipeline(pipeline_name)
645.746266 File "main.py", line 68, in _train_pipeline
645.746432 _ = pipeline.fit_transform(data)
645.746566 File "/neptune/steps/base.py", line 71, in fit_transform
645.746704 step_inputs[input_step.name] = input_step.fit_transform(data)
645.746835 File "/neptune/steps/base.py", line 77, in fit_transform
645.746964 step_output_data = self._cached_fit_transform(step_inputs)
645.747093 File "/neptune/steps/base.py", line 91, in _cached_fit_transform
645.747223 step_output_data = self.transformer.fit_transform(**step_inputs)
645.747351 File "/neptune/steps/base.py", line 213, in fit_transform
645.747482 self.fit(*args, **kwargs)
645.747618 File "/neptune/models.py", line 54, in fit
645.747749 self.callbacks = self._create_callbacks(**self.callbacks_config)
645.747895 File "/neptune/models.py", line 30, in _create_callbacks
645.748021 neptune = NeptuneMonitor(**kwargs['neptune_monitor'])
645.748166 File "/neptune/steps/keras/callbacks.py", line 11, in init
645.748296 self.batch_loss_channel_name = get_correct_channel_name(self.ctx, 'Batch Log-loss training')
645.748426 File "/neptune/steps/keras/callbacks.py", line 38, in get_correct_channel_name
645.748554 channels_with_name = [channel for channel in ctx.job._channels if name in channel.name]
645.74869 File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/client_library/context_factory.py", line 91, in job
645.748819 , JobPropertyDeprecationWarning)
645.748948 File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/common/utils/neptune_warnings.py", line 52, in neptune_warn
645.749108 warnings.warn(message, warning_type)
645.749242 File "/usr/lib/python3.6/warnings.py", line 101, in _showwarnmsg
645.749421 _showwarnmsg_impl(msg)
645.749568 File "/usr/lib/python3.6/warnings.py", line 28, in _showwarnmsg_impl
645.749706 text = _formatwarnmsg(msg)
645.749838 File "/usr/lib/python3.6/warnings.py", line 116, in _formatwarnmsg
645.749987 msg.filename, msg.lineno, line=msg.line)
645.75013 TypeError: custom_formatwarning() got an unexpected keyword argument 'line'

How should I get it running successfully? I am running on Neptune using command "neptune send --environment keras-2.0-gpu-py3 --worker gcp-gpu-medium -- train_evaluate_predict_pipeline --pipeline_name glove_lstm". I've tried other pipeline, all fail :(

How to add parameter to select GPU at runtime?

Hi,
how do we add a command line parameter to the select the GPU at run time, especially on amulti gpu machine?
I tried adding

@action.command()

@click.option('-g', '--gpu', help='select gpu', default=1, required=true)
def select_gpu(gpu):
    os.environ["CUDA_VISIBLE_DEVICES"]=gpu

I tried to add a similar code block for train_evaluate_predict_pipeline and pass the gpu parameter
but i keep getting invalid option during runtime. I know its not the package issue, but the documentation did not help either

requirements.txt doesn't seem to work.

When I tried to run $ neptune send experiment_manager.py --environment keras-2.0-gpu-py3 --worker gcp-gpu-medium --config neptune_config.yaml -- train_evaluate_predict_pipeline --pipeline_name glove_lstm, the experiment failed because it failed to import attrdict.

However, this lib is already written in the requirements, and I guess somehow the program failed to check the requirements?