Giter VIP home page Giter VIP logo

Comments (2)

emcf avatar emcf commented on June 1, 2024

Hi @sisyga , the ai_extraction parameter is only available from the API at the moment.

When running locally on PDFs with lots of pages, I experience this problem too. That is a reasonable workaround, although I don't think it is sufficient for the reasons you mentioned.

I am actually not sure what would be sufficient -- I am toying with the idea of training a page-image classifier to filter pages without visuals/tables, but this is quite demanding. If you had any additional ideas I would love to hear them!

from thepipe.

sisyga avatar sisyga commented on June 1, 2024

Hey, thanks for working to open-source the AI classifier. In the meantime, I use the following workaround:

def extract_pdf(file_path: str, ai_extraction: bool = False, text_only: bool = False, verbose: bool = False, limit: int = None) -> List[Chunk]:
    chunks = []
    if ai_extraction:
        with open(file_path, "rb") as f:
            response = requests.post(
                url=API_URL,
                files={'file': (file_path, f)},
                data={'api_key': THEPIPE_API_KEY, 'ai_extraction': ai_extraction, 'text_only': text_only, 'limit': limit}
            )
        try:
            response_json = response.json()
        except json.JSONDecodeError:
            raise ValueError(f"Our backend likely couldn't handle this request. This can happen with large content such as videos, streams, or very large files/websites. Re")
        if 'error' in response_json:
            raise ValueError(f"{response_json['error']}")
        messages = response_json['messages']
        chunks = create_chunks_from_messages(messages)
    else:
        import fitz
        # extract text and images of each page from the PDF
        with open(file_path, 'rb') as file:
            doc = fitz.open(file_path)
            for page in doc:
                text = page.get_text()
                image_list = page.get_image_info()
                drawing_commands = page.get_drawings()
                drawing_count = len(drawing_commands)

                if text_only:
                    chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
                elif image_list or drawing_count > 5:  # only make a snapshot if there is an image or more than 5 lines drawn
                    pix = page.get_pixmap()
                    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                    chunks.append(Chunk(path=file_path, text=text, image=img, source_type=SourceTypes.PDF))

                else: chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))

            doc.close()
    return chunks

Basically, I extract the number of drawing commands, and if it is higher than a threshold (here: 5, which could be implemented as an option), I make an image snapshot. This is working all right since complex formulas and table lines also count toward the drawing commands, which is what I want.

from thepipe.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.