Light

harvard-lil / nocap Goto Github PK

Access tons of case law data in an easy format with no cap(s)

Python 48.91% Jupyter Notebook 51.09%

nocap's Introduction

Easy Bulk export, no cap

This repository provides scripts and notebooks that make it easy to export data in bulk from CourtListener's freely available downloads.

Create first version of notebook suitable for Data Scientists
- Create the appropriate dtypes to optimize panda storage
- Select necessary cols usecols, for example 'created_by' date field indicating a database insert isn't necessary
- Read the opinions.csv (190+gb) chunk at a time from disk while converting into JSON
Create a standalone script that can be piped to other tools
- Create PyPi library using Poetry: package
- Output script using json lines format
Improve speed by using DASK DataFrame

nocap's People

Contributors

Stargazers

Watchers

Forkers

varun-magesh kandy22 chriss-0x01

nocap's Issues

Running locally is impractical

Several issues when running locally:

Memory to import the datasets (need at least 32GB) besides the opinions dataset
When importing the opinions dataset chunk by chunk (500 rows at a time) there's a memory allocation error that shows up after several hours:

OSError: [Errno 12] Cannot allocate memory
2023-03-21 08:48:17,358 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 6 memory: 22 MB fds: 26>>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/system_monitor.py", line 134, in update
    net_ioc = psutil.net_io_counters()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/psutil/__init__.py", line 2114, in net_io_counters
    rawdict = _psplatform.net_io_counters()
OSError: [Errno 12] Cannot allocate memory
 SystemMonitor.update of <SystemMonitor: cpu: 6 memory: 22 MB fds: 26>>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/system_monitor.py", line 134, in update
    net_ioc = psutil.net_io_counters()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/psutil/__init__.py", line 2114, in net_io_counters
    rawdict = _psplatform.net_io_counters()
OSError: [Errno 12] Cannot allocate memory
m_monitor.py", line 134, in update
    net_ioc = psutil.net_io_counters()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/psutil/__init__.py", line 2114, in net_io_counters
    rawdict = _psplatform.net_io_counters()
OSError: [Errno 12] Cannot allocate memory
2023-03-21 08:49:12,051 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 1 memory: 24 MB fds: 26>>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/system_monitor.py", line 134, in update
    net_ioc = psutil.net_io_counters()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/psutil/__init__.py", line 2114, in net_io_counters
    rawdict = _psplatform.net_io_counters()
OSError: [Errno 12] Cannot allocate memory
2023-03-21 08:51:10,108 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 6 memory: 25 MB fds: 26>>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/system_monitor.py", line 134, in update
    net_ioc = psutil.net_io_counters()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/psutil/__init__.py", line 2114, in net_io_counters
    rawdict = _psplatform.net_io_counters()
OSError: [Errno 12] Cannot allocate memory

Solution:

Create a distributed cloud cluster with hundreds of cores at least that can be turned on and off automatically or through a click

For example something like this:

Converting Opinion cluster CSV into a Dict causes error

When converting opinion_cluster to csv (6.7GB) results in the following error:

field larger than field limit (131072)

Error                                     Traceback (most recent call last)
File <timed exec>:1

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/lil_nocap/__init__.py:34, in NoCap.__init__(self, opinions_fn, opinion_clusters_fn, courts_fn, dockets_fn, citation_fn)
     32 # if memory is < 32 GB use Dask for large files, otherwise use dicts
     33 self._df_dockets = self.get_pickled_docket() #self.init_dockets_dict() # self.init_dockets_df()
---> 34 self._df_opinion_clusters = self.init_opinion_clusters_dict() #self.init_opinion_clusters_df()
     35 self._df_courts = self.init_courts_df()
     36 self._df_citation = self.init_citation_df()

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/lil_nocap/__init__.py:161, in NoCap.init_opinion_clusters_dict(self, fn)
    159 with open(fn) as oc:
    160     reader = csv.DictReader(oc)
--> 161     for row in reader:
    162         self._cluster_dict[int(row['id'])] = row['judges'], row['docket_id'], row['case_name'], row['case_name_full']
    163 end = time.perf_counter()

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/csv.py:111, in DictReader.__next__(self)
    108 if self.line_num == 0:
    109     # Used only for its side effect.
    110     self.fieldnames
--> 111 row = next(self.reader)
    112 self.line_num = self.reader.line_num
    114 # unlike the basic reader, we prefer not to return blanks,
    115 # because we will typically wind up with a dict full of None
    116 # values

Error: field larger than field limit (131072)

Incorporate opinion_cluster 'head notes' into taxonomy

Using CAP's taxonomy, it's possible to use court listener's head_notes into CAP's head_matter

casebody['data']['head_matter'] = cluster_row.head_notes.iloc

Upload datasets to Huggingface

Hugginface has 50 GB limit, we can upload multiple files there.

Add appropriate dtypes

To conserve memory pandas needs to have the correct content type for columns. Add correct dtypes for:

Dockets: columns 7,20,22,31,34
Opinion clusters: columns 10,17,18,19,20,21,22,23,24,25,26,29

Memory error

Using 16GB or less

Traceback (most recent call last):
  File "../lil_nocap/__init__.py", line 481, in <module>
    NoCap.cli()
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "../lil_nocap/__init__.py", line 56, in cli
    NoCap(o, oc, c, d, cm).start()
  File "../lil_nocap/__init__.py", line 43, in __init__
    self.init_citation_dict() #self.init_citation_df()
  File "../lil_nocap/__init__.py", line 282, in init_citation_dict
    list(map(lambda x: self.cites_to.setdefault(x['citing_opinion_id'], []).append(x['cited_opinion_id']), df_dict))
  File "../lil_nocap/__init__.py", line 282, in <lambda>
    list(map(lambda x: self.cites_to.setdefault(x['citing_opinion_id'], []).append(x['cited_opinion_id']), df_dict))
MemoryError

Parallel processing failing due to a silent error related to a missing column

Issue

Debugging the parallel processing was a bit misleading because the errors were silent only resulting in my machine freezing and restarting.

After retesting the library, function by function, I realized the issue is that when I set a dataframe to use a column as an index, the column no longer is a column but an index. This happened for the id column on a few of the dataframes (court opinions, citation maps, opinions etc.,). The library would look for the "id" column when that column had been transformed to be used as an index instead. This would fail silently when using the Dask library.

Setting the ID as an index would be useful because pandas allows a sort_index() so that the dataframe can be sorted for faster searches

Solution

Look up a file by the index rather than a column containing a key.
Or... keep the 'id' column as is and look for a value inside the id column, althoug ID column won't be sorted... essentially a linear search.

Dockets file not working for sqlite

For some reason, the Dockets file when read into sqllite has no keys:

This is causing "key doesn't exist" errors

to reproduce:

initialize nocap with .bz2 files

# get file paths
opinions_fn = f'{path}opinions-2022-12-31.csv.bz2'
courts_fn = f'{path}courts-2022-12-31.csv.bz2'
dockets_fn = f'{path}dockets-2022-12-31.csv.bz2'
opinion_clusters_fn = f'{path}opinion-clusters-2022-12-31.csv.bz2'
citation_fn = f'{path}citation-map-2022-12-31.csv.bz2'

# initalize nc
nc = lnc.NoCap(opinions_fn, opinion_clusters_fn, courts_fn, dockets_fn, citation_fn)

# check length of docket keys
len(list(nc._df_dockets.keys()))

## Current status

After days of running script and exhausting hard drive space I purchases an external drive. Ran script for 33 hours (1.2TB) and the dataset compilation was still not complete. I began to suspect there must be duplicates:

Courtlistener has ~9M opinions

nocap makes minibatches of 1000 opinions each. So there should be < 10,000 files maximum produced by nocap. After 1.2TB nocap produced ~238,000 files.

I suspect the multithreading process the same batch of 1000 rows, instead of each thread processing a batch of 1000 rows separately.

Next steps

Tech

remove duplicates

Storage

Harvard Dataverse

Licensing

Pending -- but best suit for academic purposes only

Ethics & Data Nutrition

Pending -- more analysis

Read files as compressed form in order to save disk space

Currently users have to decompress the files before running nocap, this takes up too much space. Goal is to read the files without decompressing

library broken when importing directly as opposed to running from commandline

The addition of the command line functionality broke the library import functionality. Running for ex: nocap -o ..... on the commandline works, but importing:

import lil_nocap as nc
...
nc.start()

won't work.

This has been broken as of 02dca3c the library has been broken

- [ ] distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/core.py", line 820, in _handle_comm
    result = await result
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/scheduler.py", line 5250, in add_client
    await self.handle_stream(comm=comm, extra={"client": client})
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/core.py", line 873, in handle_stream
    msgs = await comm.read()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/comm/tcp.py", line 235, in read
    n = await stream.read_into(chunk)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 474, in read_into
    self._try_inline_read()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 844, in _try_inline_read
    pos = self._read_to_buffer_loop()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 757, in _read_to_buffer_loop
    if self._read_to_buffer() == 0:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 869, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 1138, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
OSError: [Errno 22] Invalid argument
Task exception was never retrieved
future: <Task finished name='Task-31363' coro=<Server._handle_comm() done, defined at /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/core.py:726> exception=OSError(22, 'Invalid argument')>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/core.py", line 820, in _handle_comm
    result = await result
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/scheduler.py", line 5250, in add_client
    await self.handle_stream(comm=comm, extra={"client": client})
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/core.py", line 873, in handle_stream
    msgs = await comm.read()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/comm/tcp.py", line 235, in read
    n = await stream.read_into(chunk)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 474, in read_into
    self._try_inline_read()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 844, in _try_inline_read
    pos = self._read_to_buffer_loop()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 757, in _read_to_buffer_loop
    if self._read_to_buffer() == 0:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 869, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 1138, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
OSError: [Errno 22] Invalid argument

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.