Giter VIP home page Giter VIP logo

text-difficulty-detection's Introduction

Data Mining and Machine Learning 2021 - Team Microsoft

Detecting the difficulty level of french texts

Team members :

  • Melinda Femminis
  • Estelle Valerie Tsague Mbialeu
  • Catherine Pedroni

Project description

Thsi project is part of a kaggle competition. The main goal of this competition is to be able to predict the difficulty level of a french text according to the The Common European Framework of Reference for Languages that describes 6 level of language : A1, A2, B1, B2, C1, C2.

The data

For this project we have two dataset. The first dataset called training_data has three features : id, sentence, difficulty. The second dataset called unlabelled_test_data has only two features : id and sentence. By training several models we are going to try and predict the difficulty level for the unlabelled_test_data with the best accuracy score possible.

First results

Logistic regression KNN Decision Tree Random Forest
Precision 0.48 0.45 0.31 0.38
Recall 0.48 0.27 0.31 0.37
F1-Score 0.48 0.34 0.31 0.38
Accuracy 0.48 0.27 0.31 0.37

We can clearly see on this table that the best model is the logistic regression. An interesting see to observe is that for both the regression and the decision tree, all score (Precision, Recall, F1, Accuracy) have the same value. This means that we have as many false positives as false negatives. For our knn model, we can observe a clear difference in the scores' values. The difference between precision and recall will probably mean that the confusion matrix won't display a clear diagonal as opposed to the regression matrix or decision tree matrix. Overall, this results could still improve but are however above the baseline of 0.169

Methodology

After Loading the training_data and doing some Exploratory Data Analysis, we did several steps to build our classifier :

  • split the dataset in 2 sets; 80% for training and 20 % for test set
  • create the metrics function that prints a classification report (accuracy, precision, recall, f1 score)
  • create TfIdf vectorisers with different (sublinear_tf= True, tokenizer=word_tokenize, ngram_range= (1,1))
  • test multiple models such as Logistic regression , K-Nearest Neighbors, Decision Tree and Random Forest to find the one that has the best accuracy. After this step, we tested our models on the unlabeled data to submit it to the Kaggle competition.

What we did to improve the accuracy :

  • Preprocessing: remove punctuation, remove stopword, remove X most/least frequent words, stemming
  • Dimentionality reduction: reduce the number of feature using the TruncatedSVD
  • Try other models: multinomail naive Bayes, Support Vector Machine, Neural Networks with Keras

Unortunatelly, all these methods and models didn't improve the accuracy.

The video

DMML 2021 - Groupe Microsoft | Detecting the difficulty level of french texts

text-difficulty-detection's People

Contributors

cpedroni avatar melindafemminis avatar valerie112299 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.