qedan / quora-insincere Goto Github PK
View Code? Open in Web Editor NEWPrivate team repo for Quora Insincere Questions Kaggle competition
Private team repo for Quora Insincere Questions Kaggle competition
Assuming all important modeling decisions are put into a configuration file (see issue #7 ), we would like to be able to search (e.g. grid search, random search, bayesian optimization) through the space of modeling and hyperparameter choices to find the best generalization performance.
Additional features include, is_capital, is_punctuation, and is_number... To be fed into the character cnn (and/or lstm?) model, which then feeds into bi-lstm word model.
Configuration file containing most important hyperparameters and modeling decisions.
As a competitor, I can get feedback on the questions that my model gets the most wrong. Functionality existed in this commit, and needs to be restored in the new framework.
Depending on the embedding used, it may be useful to have specific preprocessing steps. For example, some embeddings contain special characters while others do not.
Is there a way to improve the classification performance by using an anomaly detection approach trained only on sincere questions?
https://da-cloud-team.slack.com/archives/GE5NHK38F/p1542597024020400
There will be a lot of strange characters in our corpus. Many of which will be read the same as a character that we recognize. For example: é, è, â, î and ô. Maybe someone can create a mapping from these to their better representation? ie. é -> e, etc.
Part of this task will be to look at the count of characters in our corpus, which I have code written for in my branch "reformat_data_class". In there, look at class CorpusInfo in Data.py
edit: I see that some of these are already in the data preprocessing "clean_specials". But there are definitely more that can be added
For some of the models, I will occasionally get the following error after a long time of successful training (i.e. it is a weird edge case).
Traceback (most recent call last):
File "../src/script.py", line 266, in <module>
main()
File "../src/script.py", line 241, in main
model_config=model.get('args')))
File "../src/script.py", line 168, in cross_validate
models[-1].fit(train_indices=train, val_indices=test, curve_file_suffix=str(i))
File "/tmp/tmpwkqs60qt/src/InsincereModel.py", line 139, in fit
File "/opt/conda/lib/python3.6/site-packages/Keras-2.2.4-py3.6.egg/keras/engine/training.py", line 1047, in fit
validation_steps=validation_steps)
File "/opt/conda/lib/python3.6/site-packages/Keras-2.2.4-py3.6.egg/keras/engine/training_arrays.py", line 200, in fit_loop
callbacks._call_batch_hook('train', 'end', batch_index, batch_logs)
File "/opt/conda/lib/python3.6/site-packages/Keras-2.2.4-py3.6.egg/keras/callbacks.py", line 95, in _call_batch_hook
delta_t_median)
TypeError: integer argument expected, got float
As a data scientist, I can define a grid or random search through hyperparameter values and launch each choice as a Kaggle Kernel.
This is probably best achieved by:
Bonus points:
Improve the handling of words not found in embeddings so that the information they contain can still be used by the model(s)
An idea to try out, that may help with the noisy labels, is to allow the model to learn parameters of it's loss function. For example, it might be that the punishment of records that are "very wrong" shouldn't be that severe (definitely not approaching inf. as in the case in logloss/cross-entropy)
(Ask Matt)
Preprocessing alters data so that information can be lost. What if we build a model that combines the preprocessed inputs with the raw inputs so that all information is present?
this can be added in parallel or used in an ensemble
'Unknown' words = words that pass our word_count threshold but that do not have representations in any of our pre-trained embeddings.
Can we use surrounding word embeddings to represent these 'unknown' words?
allow part of the word embeddings to be trained by adding a new trainable layer and then concatenating it with the pretrained layer
Does using pseudo-labels (i.e. predicting on the testing data and retraining with all data) improve the results over a suitable baseline model?
See discussion: https://da-cloud-team.slack.com/archives/GE5NHK38F/p1542688752024200
Control the model selection and hyperparameters of the new model types through the config file and the kernel launcher.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.