neuml / txtmarker Goto Github PK

Highlight text in documents

License: Apache License 2.0

Python 96.20% Makefile 3.80%

txtmarker's Introduction

Highlight text in documents

txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scan an input document and creates a modified version with highlights embedded.

Current file formats supported:

Installation

The easiest way to install is via pip and PyPI

pip install txtmarker

You can also install txtmarker directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/txtmarker

Python 3.8+ is supported

Examples

The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.

Notebooks

Notebook	Description
Introducing txtmarker	Overview of the functionality provided by txtmarker
Highlighting with Transformers	AI-driven highlighting with Transformers

Configuration

The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.

Create a new highlighter

from txtmarker.factory import Factory
highlighter = Factory.create("pdf")

extension

extension: string

Type of highlighter to create (i.e. pdf)

Optional constructor arguments:

formatter

formatter: callable

Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.

chunks

chunks: int

Splits queries into multiple chunks. This is designed for very long text matches.

Highlight text

highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])

infile

infile: string

Full path to input file

outfile

outfile: string

Full path to output file, i.e. the highlighted file

highlights

highlights: list of (string, string|regex)

List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression.

txtmarker's People

Contributors

Stargazers

Watchers

Forkers

c-chaitanya connorslamowitz odnodn yahyaghani lauragug personx000 lia2790 paljsingh danielbichuetti aalksii nguyenvannghiand

txtmarker's Issues

Highlighter Can't Highlight All the Text in the Document

I tried to highlight the entire document with its list of sentences that parsed by txtai pipeline extractor, but not all of them were highlighted. Everything should be highlighted if this is done. Can anyone help me?

highlighter = Factory.create("pdf")
highlights = [(None, re.escape(sent)) for sent in sent_list]
highlighter.highlight(in, out, highlights)

Single word marking?

Hey dear devs,
this is a wonderful project, I have been looking for something similar for days.
However I got a small question, is that possible to markup single word?
and also marking all this single word everytime it appears?
have a nice day!

Support txt

Add support for highlighting matches in raw text files

How to increase text label size of annotation

Very interesting lib, it is highly appreciated but it seems changing pdf-annotate default text annotate size from pdf.py has no effect on the label that is annotated on the output.pdf .

Any help is deeply appreciated

Highlighter too slow to highlight sentences in pdf

I want to do highlight some sentences in pdf, but the process is too slow. The average annotation process per sentence takes 0.04 seconds. The problem is in my use case that I have to annotate thousands of sentences. For example 1000 sentences instead becomes 1000*0.04 = 40 seconds, this is too slow. How to speed up the annotation process?

Support docx

Add support for highlighting matches in docx files

Wrong annotation places

Need fix to correctly annotate the pdf text from query text that has the different pages, columns, or others placing positions in the pdf. In the screenshots, the annotator trying to annotate text that in the different positions only by per page consideration rather than all placing positions consideration. That method made the annotator annotate text that should not be annotated because the annotator only found the text in its current scope only. Also, the annotation that covers texts that should not be annotated leads to confusing annotation indicators too.

Columns problem:

Query
Annotations

Pages problem:

Query
Annotations

Unable to find and highlight all requested sentences

Hello,
I was looking for an automated way to highlight specific sentences in a pdf and I was very happy to find your work.
I am identifying sentences based on affinity to topics using an NLP tool and producing the dictionary to use in highlighter.highlight.
However, I find that highlight is not able to find all the sentences but only a subset of them, and I cannot understand what the issue might be.

For example, for the attached document:
P011117.pdf
I use the dictionary:
[('#255', 'Regression algo(.|\n)+utcomes\).'), ('#390', 'Likewise, the a(.|\n)+k models.'), ('#105', 'Firms usually h(.|\n)+tivities.'), ('#397', '2 Model risk ma(.|\n)+visible.'), ('#53', 'research highli(.|\n)+be used.'), ('#255', 'In order to sel(.|\n)+election.'), ('#397', 'In order to sup(.|\n)+e model\).'), ('#105', 'For example, cu(.|\n)+nd where?'), ('#394', 'Some supervisor(.|\n)+s launch.'), ('#43', 'Such liability(.|\n)+damages.')]

Output of highlight is
[('#255', (0.914, 0.118, 0.388), 9, 70.91999999999985, 614.436, 527.52, 641.436), ('#105', (0.129, 0.588, 0.953), 17, 70.91999999999996, 665.436, 527.5188, 692.436), ('#255', (1.0, 0.757, 0.027), 26, 70.92000000000002, 695.436, 527.5200000000001, 722.436), ('#397', (0.298, 0.686, 0.314), 31, 70.91999999999996, 290.436, 527.5211999999999, 362.436), ('#394', (0.404, 0.227, 0.718), 37, 70.9199999999999, 368.436, 527.52, 410.436)]
that is only 5 out of 10 sentences have been found.
As you can see I am using regex match for multiline, because I found it more reliable than basic multiline match.

Any suggestion about how to improve the reliability of sentence matching is greatly appreciated. Would really love to be able to use txtmarker! thank you!

Is it possible to use single color on the word I input?

Hey dear devs,

I'd like to ask if that is possible to use one color on one word i give to the machine.
just like yellow for the word "first", blue for the "second".

have a nice day

multiple marks in one page?

Hey @davidmezzetti ,

i'm back to bother you again lol.
Do you know how to mark one word's multiple results in a single page?
Somehow I have only managed to mark the first result of that word in every page, all of the other results remained unmarked.

best regards

Support html

Add support for highlighting matches in html files