Giter VIP home page Giter VIP logo

lawma's Introduction

Lawma: The power of specialization for legal tasks

This is the primary code base for the project:

Lawma: The Power of Specizalization for Legal Tasks. Ricardo Dominguez-Olmedo and Vedant Nanda and Rediet Abebe and Stefan Bechtold and Christoph Engel and Jens Frankenreiter and Krishna Gummadi and Moritz Hardt and Michael Livermore. 2024

Lawma 8B and Lawma 70B are language models fine-tuned on 260 legal classification tasks derived from the Supreme Court and Songer Court of Appeals databases. The Lawma models outperform GPT-4 on 95% of these legal classification tasks, on average by over 17 accuracy points.

  • The models: Lawma 8B and Lawma 70B are fine-tunes of Llama 3 Instruct.
  • The fine-tuning dataset: our fine-tuning dataset contains a diverse set of 260 legal classification tasks, with around 500k task examples and 2 billion tokens.
  • The legal classification tasks: they comprise almost all of the variables of the Supreme Court and Songer Court of Appeals databases, see Appendix B.
  • The details: see our arXiv preprint for more details, including a number of fine-tuning experiments on the scaling behaviour of fine-tuning, its sample efficiency, its generalization to unseen tasks and Courts, and the effect of single task specialization.

What are the Lawma models useful for? We recommend using the Lawma models only for the legal classification tasks that they models were fine-tuned on. The main take-away of our paper is that specializing models leads to large improvements in performance. Therefore, we strongly recommend practitioners to further fine-tune Lawma on the actual tasks that the models will be used for. Relatively few examples --i.e, dozens or hundreds-- may already lead to large gains in performance.

Why these legal classification tasks? Our reasons to study legal classification tasks are both technical and substantive. From a technical machine learning perspective, these tasks provide highly non-trivial classification problems where even the best models leave much room for improvement. From a substantive legal perspective, efficient solutions to such classification problems have rich and important applications in legal research. We provide code to evaluate the performance of HF models on these classification tasks.

Evaluation

To evaluate language models on each of the 260 legal tasks, please refer to the evaluation folder, and in particular hf_eval.py. You must first download the task files from here, or generate them yourself by following the instructions in the data_generation folder. We evaluated a range of language models:

Model All tasks Supreme Court tasks Court of Appeals tasks
Lawma 70B 81.9 84.1 81.5
Lawma 8B 80.3 82.4 79.9
GPT4 62.9 59.8 63.4
Llama 3 70B Inst 58.4 47.1 60.3
Mixtral 8x7B Inst 43.2 24.4 46.4
Llama 3 8B Inst 42.6 32.8 44.2
Majority classifier 41.7 31.5 43.5
Mistral 7B Inst 39.9 19.5 43.4
Saul 7B Inst 34.4 20.2 36.8
LegalBert 24.6 13.6 26.4

The Lawma models substantially outperform all other models tested, and in particular GPT-4. Note that, while Lawma 70B generally outperforms Lawma 8B, the difference in performance is typically rather small. Therefore, practitioners may prefer to use Lawma 8B for its significantly cheaper inference and fine-tuning, with little cost in terms of model performance.

Note: evaluating models on all 260 classification tasks is reasonably compute intensive. However, for the purposes of language model benchmarking we may be mostly interested in aggregate performance. We are currently working on making aggregate evaluations less resource intensive by only considering a limited number of examples per task.

Fine-tuning on our dataset

We fine-tune Lawma using the axolotl library. Please refer to the README in the fine-tune folder for the training scripts and configuration files that we used to fine-tune Lawma.

To fine-tune on our dataset of legal classification tasks, simply indicate so in your config.yml file:

datasets:
  - path: ricdomolm/lawma-all-tasks
    type: alpaca

and then train using axolotl as usual

accelerate launch -m axolotl.cli.train config.yml

Fine-tuning Lawma 8B on 7xH100 GPUs required a total of 600 H100 hours (3 epochs), whereas fine-tuning Lawma 70B on 8 H100 nodes of 8 GPUs each required around 1600 H100 hours (1 epoch). We find that further epochs hurt average task performance.

Reproducing the experiments and figures of the paper

To reproduce the results of the paper, take the following steps:

  • Go to data_generation for all code to create the classification tasks and the fine-tuning dataset.
  • The directory evaluation contains code used to evaluate various models on the classification tasks.
  • The directory fine-tune contains code to fine-tune Lawma, as well as the for the additional fine-tuning experiments included in the paper.
  • The directory notebooks contains ipynb files to generate the plots and tables of the paper.

See the README.md files in the subdirectories for additional documentation.

Citation

Please cite as:

@misc{dominguezolmedo2024lawmapowerspecializationlegal,
      title={Lawma: The Power of Specialization for Legal Tasks}, 
      author={Ricardo Dominguez-Olmedo and Vedant Nanda and Rediet Abebe and Stefan Bechtold and Christoph Engel and Jens Frankenreiter and Krishna Gummadi and Moritz Hardt and Michael Livermore},
      year={2024},
      eprint={2407.16615},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.16615}, 
}

lawma's People

Contributors

ricardodominguez avatar mrtzh avatar

Stargazers

André Cruz avatar Amirhossein Kamrani avatar  avatar

Watchers

 avatar  avatar

lawma's Issues

Lawma-8b output weird, seems not right

Actually I'm not sure if the version of lawma-8b model on huggingface is correct. Because the data prepare stage has got some problem (data not available) so that I cannot conduct the evaluation, so I wrote a simple load llama script and just tried inference.

And I got the following output:

image
image
image

Your work seems great, but it just not quite easy to reproduce your repo : )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.