Giter VIP home page Giter VIP logo

txtmarker's Introduction

Highlight text in documents

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


demo

txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scan an input document and creates a modified version with highlights embedded.

Current file formats supported:

  • pdf

Installation

The easiest way to install is via pip and PyPI

pip install txtmarker

You can also install txtmarker directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/txtmarker

Python 3.8+ is supported

Examples

The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.

Notebooks

Notebook Description
Introducing txtmarker Overview of the functionality provided by txtmarker Open In Colab
Highlighting with Transformers AI-driven highlighting with Transformers Open In Colab

Configuration

The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.

Create a new highlighter

from txtmarker.factory import Factory
highlighter = Factory.create("pdf")

extension

extension: string

Type of highlighter to create (i.e. pdf)

Optional constructor arguments:

formatter

formatter: callable

Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.

chunks

chunks: int

Splits queries into multiple chunks. This is designed for very long text matches.

Highlight text

highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])

infile

infile: string

Full path to input file

outfile

outfile: string

Full path to output file, i.e. the highlighted file

highlights

highlights: list of (string, string|regex)

List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression.

txtmarker's People

Contributors

davidmezzetti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

txtmarker's Issues

Highlighter Can't Highlight All the Text in the Document

I tried to highlight the entire document with its list of sentences that parsed by txtai pipeline extractor, but not all of them were highlighted. Everything should be highlighted if this is done. Can anyone help me?

highlighter = Factory.create("pdf")
highlights = [(None, re.escape(sent)) for sent in sent_list]
highlighter.highlight(in, out, highlights)

Single word marking?

Hey dear devs,
this is a wonderful project, I have been looking for something similar for days.
However I got a small question, is that possible to markup single word?
and also marking all this single word everytime it appears?
have a nice day!

Support txt

Add support for highlighting matches in raw text files

How to increase text label size of annotation

Very interesting lib, it is highly appreciated but it seems changing pdf-annotate default text annotate size from pdf.py has no effect on the label that is annotated on the output.pdf .

Any help is deeply appreciated

Highlighter too slow to highlight sentences in pdf

I want to do highlight some sentences in pdf, but the process is too slow. The average annotation process per sentence takes 0.04 seconds. The problem is in my use case that I have to annotate thousands of sentences. For example 1000 sentences instead becomes 1000*0.04 = 40 seconds, this is too slow. How to speed up the annotation process?

Support docx

Add support for highlighting matches in docx files

Wrong annotation places

Need fix to correctly annotate the pdf text from query text that has the different pages, columns, or others placing positions in the pdf. In the screenshots, the annotator trying to annotate text that in the different positions only by per page consideration rather than all placing positions consideration. That method made the annotator annotate text that should not be annotated because the annotator only found the text in its current scope only. Also, the annotation that covers texts that should not be annotated leads to confusing annotation indicators too.

Columns problem:

  • Query
    image
  • Annotations
    image

Pages problem:

  • Query
    image
  • Annotations
    image
    image

Unable to find and highlight all requested sentences

Hello,
I was looking for an automated way to highlight specific sentences in a pdf and I was very happy to find your work.
I am identifying sentences based on affinity to topics using an NLP tool and producing the dictionary to use in highlighter.highlight.
However, I find that highlight is not able to find all the sentences but only a subset of them, and I cannot understand what the issue might be.

For example, for the attached document:
P011117.pdf
I use the dictionary:
[('#255', 'Regression algo(.|\n)+utcomes\).'), ('#390', 'Likewise, the a(.|\n)+k models.'), ('#105', 'Firms usually h(.|\n)+tivities.'), ('#397', '2 Model risk ma(.|\n)+visible.'), ('#53', 'research highli(.|\n)+be used.'), ('#255', 'In order to sel(.|\n)+election.'), ('#397', 'In order to sup(.|\n)+e model\).'), ('#105', 'For example, cu(.|\n)+nd where?'), ('#394', 'Some supervisor(.|\n)+s launch.'), ('#43', 'Such liability(.|\n)+damages.')]

Output of highlight is
[('#255', (0.914, 0.118, 0.388), 9, 70.91999999999985, 614.436, 527.52, 641.436), ('#105', (0.129, 0.588, 0.953), 17, 70.91999999999996, 665.436, 527.5188, 692.436), ('#255', (1.0, 0.757, 0.027), 26, 70.92000000000002, 695.436, 527.5200000000001, 722.436), ('#397', (0.298, 0.686, 0.314), 31, 70.91999999999996, 290.436, 527.5211999999999, 362.436), ('#394', (0.404, 0.227, 0.718), 37, 70.9199999999999, 368.436, 527.52, 410.436)]
that is only 5 out of 10 sentences have been found.
As you can see I am using regex match for multiline, because I found it more reliable than basic multiline match.

Any suggestion about how to improve the reliability of sentence matching is greatly appreciated. Would really love to be able to use txtmarker! thank you!

multiple marks in one page?

Hey @davidmezzetti ,

i'm back to bother you again lol.
Do you know how to mark one word's multiple results in a single page?
Somehow I have only managed to mark the first result of that word in every page, all of the other results remained unmarked.

best regards

Support html

Add support for highlighting matches in html files

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.