There are two parts of this project:
- Text classification: The dataset contains ~1000 legal text documents and 900 of them are labeled with the corresponding area of law ( i.e.,
LNIND_1993_DEL_112
has been labeled asCriminal Laws
), the goal is to predict the rest of the 100 files with the correct area of laws. - Topic Modeling: with some specific area of law selected, the goal is to extract the topics from the documents, therefore analysis the correlation within the same area of law.
Still in progress to achieve higher accuracy.
- Text preprocessing
- Feature extraction and evaluation
- Model selection, training, and result comparison
- Setup pipeline and hyperparameter tuning
- Topic modeling
- Data Visualization
Tools/Libraries used: NLTK
,Lexnlp
Used WordNetLemmatizer
to do lemmatization
Remove punctuations and stopwords
- first attempt:
TF-IDF
term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
- second attempt: combination of
bag of words
first thenTF-IDF
Naive Bayes
worst performance: 32.9%
Logistic Regression
: 63.1%
- Use binary classifier to solve multivariate problems
SVM
: 57.3%
- Decomposition first: SVD
- Standardize the data
- Before Standardize, the performance for SVM was
18.2%
XGBoost
: 61.3%
All performance scores above were in terms of test set accurary.
GridSearchCV
I bulit the pipeline specifically for Logistic Regression
and XGBoost
models since they had higher accuracy in the first place.
Any document seems to be a mixture of topics, especially in legal documents. Essencially, topic modeling is a text clustering problem.
Here I used LDA
I realize that 'to guess' how many topics in a file/ area of law is difficult.
I used`mglearn* library to display the top 10 words within each specific topic model.
And PyldaVis library was used to visualize the topic models.
The last but not least I uesd wordcloud to generate the entire legal document for the selected area of law to note the most recurrent terms.
- After some models built in the project, I realized that legal documents in the Natural Language Processing area is a very special topic that requires different techniques and tools than regular text data. I plan to do Information Extraction on these text project first and then see will that improve the accuracy.
- I realize that 'to guess' how many topics in a file/ area of law is difficult.
I notice there are some powerful packages, such as LexNLP, which deals with the NLP problems with legal documents.
- Information Extraction
Complete Guide to Parameter Tuning in XGBoost (with codes in Python)
Approaching (Almost) Any Machine Learning Problem | Abhishek Thakur
LexNLP: Natural language processing and information extraction for legal and regulatory texts