Giter VIP home page Giter VIP logo

quora-insincere's People

Contributors

mkierans avatar qedan avatar

Watchers

 avatar  avatar  avatar

quora-insincere's Issues

Implement a search through modeling decisions

Assuming all important modeling decisions are put into a configuration file (see issue #7 ), we would like to be able to search (e.g. grid search, random search, bayesian optimization) through the space of modeling and hyperparameter choices to find the best generalization performance.

additional features at character level

Additional features include, is_capital, is_punctuation, and is_number... To be fed into the character cnn (and/or lstm?) model, which then feeds into bi-lstm word model.

Embedding-specific preprocessing steps

Depending on the embedding used, it may be useful to have specific preprocessing steps. For example, some embeddings contain special characters while others do not.

normalization of characters

There will be a lot of strange characters in our corpus. Many of which will be read the same as a character that we recognize. For example: é, è, â, î and ô. Maybe someone can create a mapping from these to their better representation? ie. é -> e, etc.

Part of this task will be to look at the count of characters in our corpus, which I have code written for in my branch "reformat_data_class". In there, look at class CorpusInfo in Data.py

edit: I see that some of these are already in the data preprocessing "clean_specials". But there are definitely more that can be added

Data type error in callback

For some of the models, I will occasionally get the following error after a long time of successful training (i.e. it is a weird edge case).

Traceback (most recent call last):
  File "../src/script.py", line 266, in <module>

    main()
  File "../src/script.py", line 241, in main
    model_config=model.get('args')))
  File "../src/script.py", line 168, in cross_validate
    models[-1].fit(train_indices=train, val_indices=test, curve_file_suffix=str(i))
  File "/tmp/tmpwkqs60qt/src/InsincereModel.py", line 139, in fit
  File "/opt/conda/lib/python3.6/site-packages/Keras-2.2.4-py3.6.egg/keras/engine/training.py", line 1047, in fit
    validation_steps=validation_steps)
  File "/opt/conda/lib/python3.6/site-packages/Keras-2.2.4-py3.6.egg/keras/engine/training_arrays.py", line 200, in fit_loop

    callbacks._call_batch_hook('train', 'end', batch_index, batch_logs)
  File "/opt/conda/lib/python3.6/site-packages/Keras-2.2.4-py3.6.egg/keras/callbacks.py", line 95, in _call_batch_hook
    delta_t_median)
TypeError: integer argument expected, got float

Hyperparameter search using kernels

As a data scientist, I can define a grid or random search through hyperparameter values and launch each choice as a Kaggle Kernel.

This is probably best achieved by:

  • For each hyperparameter configuration:
    == Generate src/config.py
    == Use generate_script.sh to generate the kernel script
    == Use the kaggle API to 'kaggle kernel push' the kernel script

Bonus points:

  • Make it easy to support also searching through model architectures
  • Some kind of collection and automated analysis of the results

Handling of unknown words

Improve the handling of words not found in embeddings so that the information they contain can still be used by the model(s)

Learnable loss function parameters

An idea to try out, that may help with the noisy labels, is to allow the model to learn parameters of it's loss function. For example, it might be that the punishment of records that are "very wrong" shouldn't be that severe (definitely not approaching inf. as in the case in logloss/cross-entropy)
(Ask Matt)

Combine raw inputs with preprocessed inputs

Preprocessing alters data so that information can be lost. What if we build a model that combines the preprocessed inputs with the raw inputs so that all information is present?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.