Giter VIP home page Giter VIP logo

nocap's Introduction

Easy Bulk export, no cap

This repository provides scripts and notebooks that make it easy to export data in bulk from CourtListener's freely available downloads.

  • Create first version of notebook suitable for Data Scientists
    • Create the appropriate dtypes to optimize panda storage
    • Select necessary cols usecols, for example 'created_by' date field indicating a database insert isn't necessary
    • Read the opinions.csv (190+gb) chunk at a time from disk while converting into JSON
  • Create a standalone script that can be piped to other tools
  • Improve speed by using DASK DataFrame

nocap's People

Contributors

sabzo avatar varun-magesh avatar

Stargazers

 avatar  avatar Matteo Cargnelutti avatar

Watchers

Jack Cushman avatar James Cloos avatar Ben Steinberg avatar Rebecca Lynn Cremona avatar  avatar

nocap's Issues

Running locally is impractical

Several issues when running locally:

  • Memory to import the datasets (need at least 32GB) besides the opinions dataset
  • When importing the opinions dataset chunk by chunk (500 rows at a time) there's a memory allocation error that shows up after several hours:
OSError: [Errno 12] Cannot allocate memory
2023-03-21 08:48:17,358 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 6 memory: 22 MB fds: 26>>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/system_monitor.py", line 134, in update
    net_ioc = psutil.net_io_counters()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/psutil/__init__.py", line 2114, in net_io_counters
    rawdict = _psplatform.net_io_counters()
OSError: [Errno 12] Cannot allocate memory
 SystemMonitor.update of <SystemMonitor: cpu: 6 memory: 22 MB fds: 26>>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/system_monitor.py", line 134, in update
    net_ioc = psutil.net_io_counters()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/psutil/__init__.py", line 2114, in net_io_counters
    rawdict = _psplatform.net_io_counters()
OSError: [Errno 12] Cannot allocate memory
m_monitor.py", line 134, in update
    net_ioc = psutil.net_io_counters()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/psutil/__init__.py", line 2114, in net_io_counters
    rawdict = _psplatform.net_io_counters()
OSError: [Errno 12] Cannot allocate memory
2023-03-21 08:49:12,051 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 1 memory: 24 MB fds: 26>>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/system_monitor.py", line 134, in update
    net_ioc = psutil.net_io_counters()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/psutil/__init__.py", line 2114, in net_io_counters
    rawdict = _psplatform.net_io_counters()
OSError: [Errno 12] Cannot allocate memory
2023-03-21 08:51:10,108 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 6 memory: 25 MB fds: 26>>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/system_monitor.py", line 134, in update
    net_ioc = psutil.net_io_counters()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/psutil/__init__.py", line 2114, in net_io_counters
    rawdict = _psplatform.net_io_counters()
OSError: [Errno 12] Cannot allocate memory

Solution:

Create a distributed cloud cluster with hundreds of cores at least that can be turned on and off automatically or through a click

For example something like this:

Converting Opinion cluster CSV into a Dict causes error

When converting opinion_cluster to csv (6.7GB) results in the following error:

field larger than field limit (131072)
Error                                     Traceback (most recent call last)
File <timed exec>:1

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/lil_nocap/__init__.py:34, in NoCap.__init__(self, opinions_fn, opinion_clusters_fn, courts_fn, dockets_fn, citation_fn)
     32 # if memory is < 32 GB use Dask for large files, otherwise use dicts
     33 self._df_dockets = self.get_pickled_docket() #self.init_dockets_dict() # self.init_dockets_df()
---> 34 self._df_opinion_clusters = self.init_opinion_clusters_dict() #self.init_opinion_clusters_df()
     35 self._df_courts = self.init_courts_df()
     36 self._df_citation = self.init_citation_df()

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/lil_nocap/__init__.py:161, in NoCap.init_opinion_clusters_dict(self, fn)
    159 with open(fn) as oc:
    160     reader = csv.DictReader(oc)
--> 161     for row in reader:
    162         self._cluster_dict[int(row['id'])] = row['judges'], row['docket_id'], row['case_name'], row['case_name_full']
    163 end = time.perf_counter()

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/csv.py:111, in DictReader.__next__(self)
    108 if self.line_num == 0:
    109     # Used only for its side effect.
    110     self.fieldnames
--> 111 row = next(self.reader)
    112 self.line_num = self.reader.line_num
    114 # unlike the basic reader, we prefer not to return blanks,
    115 # because we will typically wind up with a dict full of None
    116 # values

Error: field larger than field limit (131072)

image

Add appropriate dtypes

To conserve memory pandas needs to have the correct content type for columns. Add correct dtypes for:

  • Dockets: columns 7,20,22,31,34
  • Opinion clusters: columns 10,17,18,19,20,21,22,23,24,25,26,29

Known issues

Memory error

Using 16GB or less

Traceback (most recent call last):
  File "../lil_nocap/__init__.py", line 481, in <module>
    NoCap.cli()
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "../lil_nocap/__init__.py", line 56, in cli
    NoCap(o, oc, c, d, cm).start()
  File "../lil_nocap/__init__.py", line 43, in __init__
    self.init_citation_dict() #self.init_citation_df()
  File "../lil_nocap/__init__.py", line 282, in init_citation_dict
    list(map(lambda x: self.cites_to.setdefault(x['citing_opinion_id'], []).append(x['cited_opinion_id']), df_dict))
  File "../lil_nocap/__init__.py", line 282, in <lambda>
    list(map(lambda x: self.cites_to.setdefault(x['citing_opinion_id'], []).append(x['cited_opinion_id']), df_dict))
MemoryError

Parallel processing failing due to a silent error related to a missing column

Issue

Debugging the parallel processing was a bit misleading because the errors were silent only resulting in my machine freezing and restarting.

After retesting the library, function by function, I realized the issue is that when I set a dataframe to use a column as an index, the column no longer is a column but an index. This happened for the id column on a few of the dataframes (court opinions, citation maps, opinions etc.,). The library would look for the "id" column when that column had been transformed to be used as an index instead. This would fail silently when using the Dask library.

Setting the ID as an index would be useful because pandas allows a sort_index() so that the dataframe can be sorted for faster searches

Solution

  • Look up a file by the index rather than a column containing a key.
  • Or... keep the 'id' column as is and look for a value inside the id column, althoug ID column won't be sorted... essentially a linear search.

Dockets file not working for sqlite

For some reason, the Dockets file when read into sqllite has no keys:

image

This is causing "key doesn't exist" errors

to reproduce:

initialize nocap with .bz2 files

# get file paths
opinions_fn = f'{path}opinions-2022-12-31.csv.bz2'
courts_fn = f'{path}courts-2022-12-31.csv.bz2'
dockets_fn = f'{path}dockets-2022-12-31.csv.bz2'
opinion_clusters_fn = f'{path}opinion-clusters-2022-12-31.csv.bz2'
citation_fn = f'{path}citation-map-2022-12-31.csv.bz2'

# initalize nc
nc = lnc.NoCap(opinions_fn, opinion_clusters_fn, courts_fn, dockets_fn, citation_fn)

# check length of docket keys
len(list(nc._df_dockets.keys()))

## Current status

After days of running script and exhausting hard drive space I purchases an external drive. Ran script for 33 hours (1.2TB) and the dataset compilation was still not complete. I began to suspect there must be duplicates:

image

Courtlistener has ~9M opinions
image

nocap makes minibatches of 1000 opinions each. So there should be < 10,000 files maximum produced by nocap. After 1.2TB nocap produced ~238,000 files.

I suspect the multithreading process the same batch of 1000 rows, instead of each thread processing a batch of 1000 rows separately.

Next steps

Tech

  • remove duplicates

Storage

Harvard Dataverse

Licensing

Pending -- but best suit for academic purposes only

Ethics & Data Nutrition

Pending -- more analysis

Using Dask's map functionality slower than using single process native map function

When using Dask's client.map slower than using native map is much slower.

Additionally another error arrives:

- [ ] distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/core.py", line 820, in _handle_comm
    result = await result
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/scheduler.py", line 5250, in add_client
    await self.handle_stream(comm=comm, extra={"client": client})
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/core.py", line 873, in handle_stream
    msgs = await comm.read()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/comm/tcp.py", line 235, in read
    n = await stream.read_into(chunk)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 474, in read_into
    self._try_inline_read()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 844, in _try_inline_read
    pos = self._read_to_buffer_loop()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 757, in _read_to_buffer_loop
    if self._read_to_buffer() == 0:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 869, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 1138, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
OSError: [Errno 22] Invalid argument
Task exception was never retrieved
future: <Task finished name='Task-31363' coro=<Server._handle_comm() done, defined at /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/core.py:726> exception=OSError(22, 'Invalid argument')>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/core.py", line 820, in _handle_comm
    result = await result
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/scheduler.py", line 5250, in add_client
    await self.handle_stream(comm=comm, extra={"client": client})
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/core.py", line 873, in handle_stream
    msgs = await comm.read()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/comm/tcp.py", line 235, in read
    n = await stream.read_into(chunk)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 474, in read_into
    self._try_inline_read()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 844, in _try_inline_read
    pos = self._read_to_buffer_loop()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 757, in _read_to_buffer_loop
    if self._read_to_buffer() == 0:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 869, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tornado/iostream.py", line 1138, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
OSError: [Errno 22] Invalid argument

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.