Giter VIP home page Giter VIP logo

mueller's Introduction

Report On The Investigation Into
Russian Interference In The
2016 Presidential Election

Volume 1 to 11

Special Counsel Robert S. Mueller, III

2019

This document was generated with the Tesseract Optical Character Recognition engine. Because of the poor quality of the original pdf and the numerous redacted sections, the text contains many errors and should not be regarded as the definitive text. This document should be used by those who wish to process the text at scale. The difficulties associated with the OCR process also signals the failure of the Justice Department to accommodate citizens using screen readers to access this document.

from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
import os
from PyPDF2 import PdfFileWriter, PdfFileReader
#splits full report into pages
inputpdf = PdfFileReader(open("report.pdf", "rb"))
for i in range(inputpdf.numPages):
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    with open(f"pages/report-page{i:03}.pdf", "wb") as outputStream:
        output.write(outputStream)
#reports list of images
file_list = []
for path, subdirs, files in os.walk("pages"):  # change depending on system
    for file in files:
        a = os.path.join(file)
        file_list.append(a)
file_list = sorted(file_list)
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()[0]
image = []
text = []
for file in file_list:
    pdf = Image(filename="pages/"+file, resolution=300)
    jpg = pdf.convert('jpeg')
    img_page = Image(image=jpg)
    image.append(img_page.make_blob('jpeg'))
for img in image: 
    txt = tool.image_to_string(
        PI.open(io.BytesIO(img)),
        lang=lang,
        builder=pyocr.builders.TextBuilder())
    text.append(txt)
with open("report.txt", "w") as file:
    for page in text:
        file.write(page)

mueller's People

Contributors

aaronmauro avatar psudhlab avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

ghonk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.