Giter VIP home page Giter VIP logo

obsidian-extract-pdf-highlights's Introduction

Extract your PDF text-highlights into Obsidian

This plugin allows you to extract highlighted and underlined text from your PDFs into a markdown file in your Obsidian vault.

How it works

After you've installed and activated the plugin:

  1. Drop your highlighted PDF into your Obsidian vault
  2. Open the PDF in Obsidian
  3. Click the "PDF" icon in the left sidebar

Demo with default settings

Simple

Demo with all optional settings turned on

Settings

Optional settings

  • Include page number (Default: off)
  • Include highlight color (Default: off)
  • Create links (Default: off)

Backlog

The list of features and improvements for this plugin.

ICEBOX

  • Record demo video, quick-start walk-through for new users

TODO

  • Auto-create notes from links with highlight/annotation as quote with backlink to source PDF
  • Group highlights by highlight color (Optional)
  • Add progress bar/modal to show "Processed Page 5/10 (50%)" or similar for longer PDFs
  • Fix missing space after newline (Very complex)

DOING

...

DONE

  • Refactor pdfjs import to not overload Obsidian worker (Ideas from @lishid?)
  • Show highlight color (Optional)
  • Auto-link list items (Optional)
  • Refactor/extract PDF from main.ts
  • Add Page-number to each highlight (Optional)
  • Sort highlights by position in document and page (Mandatory)
  • Extract unsorted list of HIGHLIGHT annotations
  • Extract unsorted list of TEXT annotations
  • Extract unsorted list of UNDERLINE annotations
  • Decide if to integrate with existing Highlights Plugin

Contribute

I'd love to hear from you, so please check out the Contribution page to get in touch!

Major Thanks

This plugin stands on the shoulders of Joseph Devietti and his 2012 pull-request for PDFJS.

obsidian-extract-pdf-highlights's People

Contributors

akaalias avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

obsidian-extract-pdf-highlights's Issues

No effect when clicking pdf button

I tried the following:

  • install + activate plugin
  • open a pdf in PDF Expert
  • highlight a few sentences
  • copy the pdf into Obsidian vault
  • open in Obsidian (I can see the highlighted text)
  • click PDF button

Nothing happens. Does it matter which tool is used to create the highlights?

No effect on clicking extract PDF icon

Hello @akaalias,

Thank you for creating the plug in.

I have been trying to use it but it does not seem to work. These are the steps I have followed:

  1. I highlight the text on PDF (on a PDF preview or acrobat reader)
  2. I import the pdf into obsidian by dragging it.
  3. I preview the pdf into obsidian
  4. I click the pdf icon on the left bar.
  5. No new file is created with the annotations.

Could you suggest how to address the issue?

Thank you.
Regards

Expose API to process PDFs

Hi there,

I recently built a plugin to pull in data from Zotero to Obsidian: https://github.com/mgmeyers/obsidian-zotero-desktop-connector

My plugin can access PDFs stored in Zotero, and I'd love to be able to send them to this plugin to extract highlights. I think I could do this if ExtractPDFHighlightsPlugin had something similar to processPDFHighlights that received an ArrayBuffer. That way I could load the PDF and send it to your plugin for processing.

Let me know what you think!

Advice on the highlight-note's format

Hello,thanks for your excellent plugin! It's very useful for me to arrange my pdf documents.
I have a suggestion about the highlight-note's format. Is it possible to give each highlight a wiki link with the page information just as [[filename.pdf#page=n]] ? In this way, it can be convenient and quick to preview the context!
image
Thanks!

Allowing "screenshot" of the page inside a rectangle/square box to capture images/charts/diagrams

Benefits

Having a way to include the screenshot of the region inside a created square(or rectangle) would help in

  • Extracting useful images or charts off the pdf by simply drawing a rectangle around it.
  • Extracting hand written scribbles or diagrams by simple enclosing them in a rectangle.

Image naming and handling

The extracted images could be sent directly to the asset/image/attachment location in the vault while assigning a incremental naming to them or based on the page numbers or plain random just like what obsidian does when pasting.

Example

PDF sample

image

Outpul sample

image

I had implemented it in a crude way but using Python, and I am a newbie to programming. If you could work on something like this, it would be really helpful as I have a lot of diagrams that I now manually screenshot and paste. This could make life easier for everyone who need to extract images or diagrams.

Also:

  1. A post I wrote on Forum : https://forum.obsidian.md/t/discussion-extracting-annotations-from-pdfs/24411
  2. A pdfAnnotate library that I found which might be useful : https://github.com/highkite/pdfAnnotate

duplicate highlights when extracting a second time

I was trying to import a PDF book with 300 pages/5mb. It took about a minute for the note with the extracted highlights to load.
After reading and highlighting more pages in the PDF I clicked the "PDF" button again and it appended all highlights below the original highlights.
Maybe my use case isn't a good use case for this plugin?

Accept pull request

Hi can you please accept the pull request by steven kraft. It works a lot faster that way.

Output Pdf Page Num with Obsidian Style

After obsidian ver 0.10.8, obsidian allows you use [[book.pdf#page=3]] to jump to page 3 of pdf.

So is it possible to add a feature to output those highlights and notes with this style?

BLABLABLABLA โ€”โ€”[[book.pdf#page=3]]

image extraction?

Hi folks, I often grab image selection of portions of pdfs (for things other than text like mathematics/figures/plots etc). Is there anyway to use this plugin to import these things into my obsidian note?

Highlights are extracted, annotations aren't

I opened a pdf of my tax draft, highlighted one line, and added an annotation.

The corresponding note only shows the highlight, not the annotation.

The note extracted is

  • [[New York's 529 college savings program deduction/earnings. (Page 42) ]]

Source

[[Client Copy Return for TF4089.pdf]]

screenshot of the pdf highlight and annotation:

Screenshot from 2021-04-26 11-59-05

Problem with two-column PDF

Thank you @akaalias for this amazing contribution!
I have observed an issue regarding the ordering of the extracted highlights from a pdf where the text is arranged in two separate columns where text flow in each page runs from the 1st column to the 2nd. Specifically, it seems that the plugin extracts highlights in the order they first appear in the pdf but when it comes to a two-column pdf (which is often the case for research articles) this means that the flow of the actual highlights is discontinued in the note of the extracted highlights.
For example, in the first page of a two-column pdf I highlight the text as follows:

  1. Column 1
  • lines 10 to 15
  • line 20
  1. Column 2
  • lines 2 to 5
  • line 18

In the extracted note, the ordering of the highlighted lines is shown as:

  • highlight in lines 2-5 from 2nd column
  • highlight in lines 10-15 from 1st column
  • highlight in line 18 from 2nd column
  • highlight in line 20 from 1st column

So, I am wondering whether any workaround can be made to tackle this issue!
Thank again!

Icon not showing in ribbon bar

The icon from the plugin is very faint in the Sidebar with the Minimal Theme. Is there any way for you to increase the contrast on the gif?
IMG_0329

PDF extract with no spaces and no colors

Hello, I annotated a journal article with different colors. I turned on the optional checkboxes for page numbers and colors. However, the file that is created, all the PDF text has no spaces, my annotation comments are not extracted, and there are no color indications. Any solutions?

Unnecessary page rendering when extracting highlights

Hi, I noticed that await page.render(renderContext, annotations); gets called even if annotations.length is 0. Is there a reason for this?

I think making it so that only pages with annotations are rendered would greatly improve the time it takes to extract the highlights.

I think it depends on the PDF, but I timed that it takes about 100-200ms per page to render, making a 500 page pdf take 1-2 minutes regardless of the amount of annotations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.