qedan / quora-insincere Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 169 KB

Private team repo for Quora Insincere Questions Kaggle competition

Jupyter Notebook 8.17% Python 91.79% Shell 0.05%

quora-insincere's People

Contributors

Watchers

quora-insincere's Issues

Implement a search through modeling decisions

Assuming all important modeling decisions are put into a configuration file (see issue #7 ), we would like to be able to search (e.g. grid search, random search, bayesian optimization) through the space of modeling and hyperparameter choices to find the best generalization performance.

additional features at character level

Additional features include, is_capital, is_punctuation, and is_number... To be fed into the character cnn (and/or lstm?) model, which then feeds into bi-lstm word model.

Create a configuration file

Configuration file containing most important hyperparameters and modeling decisions.

Restore saving of wrongest predictions

As a competitor, I can get feedback on the questions that my model gets the most wrong. Functionality existed in this commit, and needs to be restored in the new framework.

8672ff7

Embedding-specific preprocessing steps

Depending on the embedding used, it may be useful to have specific preprocessing steps. For example, some embeddings contain special characters while others do not.

Anomaly detection trained on sincere questions

Is there a way to improve the classification performance by using an anomaly detection approach trained only on sincere questions?
https://da-cloud-team.slack.com/archives/GE5NHK38F/p1542597024020400

normalization of characters

There will be a lot of strange characters in our corpus. Many of which will be read the same as a character that we recognize. For example: é, è, â, î and ô. Maybe someone can create a mapping from these to their better representation? ie. é -> e, etc.

Part of this task will be to look at the count of characters in our corpus, which I have code written for in my branch "reformat_data_class". In there, look at class CorpusInfo in Data.py

edit: I see that some of these are already in the data preprocessing "clean_specials". But there are definitely more that can be added

Data type error in callback

For some of the models, I will occasionally get the following error after a long time of successful training (i.e. it is a weird edge case).

Traceback (most recent call last):
  File "../src/script.py", line 266, in <module>

    main()
  File "../src/script.py", line 241, in main
    model_config=model.get('args')))
  File "../src/script.py", line 168, in cross_validate
    models[-1].fit(train_indices=train, val_indices=test, curve_file_suffix=str(i))
  File "/tmp/tmpwkqs60qt/src/InsincereModel.py", line 139, in fit
  File "/opt/conda/lib/python3.6/site-packages/Keras-2.2.4-py3.6.egg/keras/engine/training.py", line 1047, in fit
    validation_steps=validation_steps)
  File "/opt/conda/lib/python3.6/site-packages/Keras-2.2.4-py3.6.egg/keras/engine/training_arrays.py", line 200, in fit_loop

    callbacks._call_batch_hook('train', 'end', batch_index, batch_logs)
  File "/opt/conda/lib/python3.6/site-packages/Keras-2.2.4-py3.6.egg/keras/callbacks.py", line 95, in _call_batch_hook
    delta_t_median)
TypeError: integer argument expected, got float

Hyperparameter search using kernels

As a data scientist, I can define a grid or random search through hyperparameter values and launch each choice as a Kaggle Kernel.

This is probably best achieved by:

For each hyperparameter configuration:
== Generate src/config.py
== Use generate_script.sh to generate the kernel script
== Use the kaggle API to 'kaggle kernel push' the kernel script

Bonus points:

Make it easy to support also searching through model architectures
Some kind of collection and automated analysis of the results

Handling of unknown words

Improve the handling of words not found in embeddings so that the information they contain can still be used by the model(s)

CNN for letter features of each word

Learnable loss function parameters

An idea to try out, that may help with the noisy labels, is to allow the model to learn parameters of it's loss function. For example, it might be that the punishment of records that are "very wrong" shouldn't be that severe (definitely not approaching inf. as in the case in logloss/cross-entropy)
(Ask Matt)

qedan / quora-insincere Goto Github PK

quora-insincere's People

Contributors

Watchers

quora-insincere's Issues

Recommend Projects

Recommend Topics

Recommend Org