Giter VIP home page Giter VIP logo

nlp-area-of-law-classification's Introduction

NLP Task: Legal Text Classification

There are two parts of this project:

  1. Text classification: The dataset contains ~1000 legal text documents and 900 of them are labeled with the corresponding area of law ( i.e., LNIND_1993_DEL_112 has been labeled as Criminal Laws ), the goal is to predict the rest of the 100 files with the correct area of laws.
  2. Topic Modeling: with some specific area of law selected, the goal is to extract the topics from the documents, therefore analysis the correlation within the same area of law.

Still in progress to achieve higher accuracy.

Introduction

In this project I accomplished the following things:

  • Text preprocessing
  • Feature extraction and evaluation
  • Model selection, training, and result comparison
  • Setup pipeline and hyperparameter tuning
  • Topic modeling
  • Data Visualization

Process:

Load Data
Text Cleaning and Preprocessing

Tools/Libraries used: NLTK ,Lexnlp

Used WordNetLemmatizer to do lemmatization

Remove punctuations and stopwords

Feature Extraction
  • first attempt: TF-IDF

term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

  • second attempt: combination of bag of words first then TF-IDF
Fit Models

Naive Bayes worst performance: 32.9%

Logistic Regression: 63.1%

  • Use binary classifier to solve multivariate problems

SVM: 57.3%

  • Decomposition first: SVD
  • Standardize the data
  • Before Standardize, the performance for SVM was 18.2%

XGBoost: 61.3%

All performance scores above were in terms of test set accurary.

Prediction
Build Pipeline and tune Hyperparameters

GridSearchCV

I bulit the pipeline specifically for Logistic Regression and XGBoost models since they had higher accuracy in the first place.

Topic Modeling

Any document seems to be a mixture of topics, especially in legal documents. Essencially, topic modeling is a text clustering problem.

Here I used LDA

I realize that 'to guess' how many topics in a file/ area of law is difficult.

Data Visualization

I used`mglearn* library to display the top 10 words within each specific topic model.

And PyldaVis library was used to visualize the topic models.

The last but not least I uesd wordcloud to generate the entire legal document for the selected area of law to note the most recurrent terms.


Problems:

  1. After some models built in the project, I realized that legal documents in the Natural Language Processing area is a very special topic that requires different techniques and tools than regular text data. I plan to do Information Extraction on these text project first and then see will that improve the accuracy.
  2. I realize that 'to guess' how many topics in a file/ area of law is difficult.

I notice there are some powerful packages, such as LexNLP, which deals with the NLP problems with legal documents.

Conclusion(s)/Discussion.

In Progress:
  • Information Extraction

Appendix:

Some Useful Links:

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

Approaching (Almost) Any Machine Learning Problem | Abhishek Thakur

LexNLP: Natural language processing and information extraction for legal and regulatory texts

approaching almost any machine learning problem

nlp-area-of-law-classification's People

Contributors

ffflora avatar

Stargazers

 avatar  avatar  avatar

Forkers

litomore

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.