lucasrla / remarks Goto Github PK

Extract annotations (highlights and scribbles) from PDF, EPUB, and notebooks marked with reMarkable tablets. Export to Markdown, PDF, PNG, SVG

License: GNU General Public License v3.0

Python 100.00%

remarkable-tablet markdown pymupdf pdf-converter annotations highlighting pdf svg-images ocr ocrmypdf

remarks's Introduction

remarks

⚠️ remarks does NOT work with annotations created by reMarkable sofware >= 3.0 yet. Follow issue #58 for updates ⚠️

Extract annotations (text highlights and scribbles) and convert them to Markdown, PDF, PNG, and SVG.

remarks works with documents annotated on reMarkable™ paper tablets — both 1st and 2nd generation — up to software version 2.15.0.1067.

Note that remarks not only is highly experimental, it is very likely to break after you update the software of your tablet. I might find some spare time to continue to maintain it, but I make no promises.

remarks code is fairly straightforward but not elegant at all. It has been put together in a couple of hours. You are free to fork and run with it though ;)

Most of the actual heavy lifting has been done by the open source community and PyMuPDF. See Credits and Acknowledgements.

Some use cases

In: PDF highlighted on reMarkable → Out: PDF with parseable highlights
Someone who highlights lots of PDFs (e.g., researchers, academics, etc) can export their highlights for processing with a reference management tool, like Zotero (e.g., issue #2).
Extract highlighted text from PDF to Markdown
Infovores of the world can export highlighted text to Markdown and insert them into their preferred "tool for networked thought", like Obsidian or Roam Research.
Export annotated PDF pages to full-page images
Sometimes having just the textual content is not enough, sometimes you need the actual (visual) context around your annotation. To help you in such situations, remarks can export each annotated PDF page to a PNG image file. Images can be easily uploaded or embedded anywhere, from personal websites to "tools for networked thought".

A visual example

Highlight and annotate PDFs with your Marker on your reMarkable tablet:

And then use remarks to export annotated pages to Markdown, PDF, PNG, or SVG on your computer:

WHAT IS LIFE?

Based on lectures delivered under the auspices of the Dublin Institute for Advanced Studies at Trinity College, Dublin, in February 1943

To the memory of My Parents

Compatibility and dependencies

Because remarks depends only on PyMuPDF and Shapely, there is no need to install imagemagick, opencv, or any additional image library. Both PyMuPDF and Shapely have pre-built wheels for several platforms (macOS, Linux, Windows) and recent Python 3 versions, so installing them should be smooth and easy for most people.

I currently use remarks with reMarkable 1 and reMarkable 2 tablets running software versions 2.14.3.1047 and 2.15.0.1067 on macOS Ventura (13.2.1) with CPython 3.10.9. I don't have other equipment to test it thoroughly, but I expect remarks to work just fine in all common setups.

Incidentally, help the community keeping track of remarks compatibility across different setups:

If it is working well for you, make a quick comment with your setup
If you run into any problems, raise an issue

If OCRmyPDF is available on your computer, remarks may (optionally) use it to OCR PDFs before extracting their highlighted text.

Setup

To get remarks up and running on your local machine, follow the instructions below:

1. Copy reMarkable's "raw" document files to your computer

In order to reconstruct your highlights and annotations, remarks relies on specific files that are created by the reMarkable device as you use it. Because these specific files are internal to the reMarkable device, first we need to transfer them to your computer.

There are several options for getting them to your computer. Find below some suggestions. Choose whatever fits you:

Use rsync (i)
Check out the repository @lucasrla/remarkable-utils for the SSH & rsync setup I use (which includes automatic backups based on cron).
Use scp (i)
On your reMarkable tablet, go to Menu > Settings > Help, then under About tap on Copyrights and licenses. In General information, right after the section titled "GPLv3 Compliance", there will be the username (root), password and IP address needed for SSHing into it. Using these credentials, scp the contents of /home/root/.local/share/remarkable/xochitl from your reMarkable to a directory on your computer. (Copying may take a while depending on the size of your document collection and the quality of your WiFi network.) To prevent any unintented interruptions, you can (optionally) switch off the Auto sleep feature in Menu > Settings > Battery before transferring your files.
Use @juruen/rmapi or @subutux/rmapy
Both are free and open source software that allow you to access your reMarkable tablet files through reMarkable's cloud service.
Copy from reMarkable's official desktop application
If you have a reMarkable's official desktop app installed, most of the files we need are already easily available on your computer. For macOS users, the files are located at ~/Library/Application\ Support/remarkable/desktop. To avoid interfering with reMarkable's official app, copy and paste all the contents of ~/Library/Application\ Support/remarkable/desktop to another directory (one that you can safely interact with – say, ~/Documents/remarkable/docs). Please note that this method won't allow you to use remarks' EPUB functionality. That's because this directory doesn't seem to include the PDF files that reMarkable auto converts your EPUBs to.

2. Clone this repository and install the dependencies

### 2.1 Clone
git clone https://github.com/lucasrla/remarks.git && cd remarks


### 2.2 Create and activate a virtual environment

# If you're using poetry, a new virtual env should be created automatically (as set forth in our `poetry.toml`)
# But feel free to manage your virtual env needs with any of the alternatives (e.g. virtualenv, virtualenvwrapper, etc)


### 2.3 Install the dependencies

# Install dependencies with:
poetry install

Usage and Demo

Run remarks and check out what arguments are available:

python -m remarks --help

Next, for a quick hands-on experience of remarks, run the demo:

# Alan Turing's 1936 foundational paper (with a few highlights and scribbles)

# Original PDF file downloaded from:
# "On Computable Numbers, with an Application to the Entscheidungsproblem"
# https://londmathsoc.onlinelibrary.wiley.com/doi/abs/10.1112/plms/s2-42.1.230

python -m remarks demo/on-computable-numbers/xochitl remarks-example/ --per_page_targets png md pdf --modified_pdf

A few other examples:

# Assuming your `xochitl` files are at `~/backups/remarkable/xochitl/`

python -m remarks ~/backups/remarkable/xochitl/ example_1/ --ann_type highlights --per_page_targets md

python -m remarks ~/backups/remarkable/xochitl/ example_2/ --per_page_targets png

Tests

Run pytest in the root directory of the project after installing the dependencies using poetry. This will create files in the tests/out directory. The contents of this directory can safely be deleted.

Example:

python -m pytest -q remarks/test_initial.py
..                                         [100%]
2 passed in 2.51s

ls tests/out
  1936 On Computable Numbers, with an Application to the Entscheidungsproblem - A. M. Turing _highlights.md  
  Gosper _remarks.pdf
  1936 On Computable Numbers, with an Application to the Entscheidungsproblem - A. M. Turing _remarks.pdf

Credits and Acknowledgements

@JorjMcKie who wrote and maintains the great PyMuPDF
u/stucule who posted to r/RemarkableTablet the first account (that I could find online) about reverse engineering .rm files
@ax3l who wrote lines-are-rusty / lines-are-beautiful and also contributed to reverse engineering of .rm files
@edupont, @Liblor, @florian-wagner, and @jackjackk for their contributions to rM2svg
@ericsfraga, @jmiserez, @peerdavid, @phill777 and @lschwetlick for updating rM2svg to the most recent .rm format
@lschwetlick who wrote rMsync and also two blog posts about reMarkable-related software [1, 2]
@soulisalmed who wrote biff
@benlongo who wrote remarkable-highlights

For more reMarkable resources, check out awesome-reMarkable and remarkablewiki.com.

License

remarks is Free Software distributed under the GNU General Public License v3.0.

Disclaimers

This is a hobby project of an enthusiastic reMarkable user. There is no warranty whatsoever. Use it at your own risk.

The author(s) and contributor(s) are not associated with reMarkable AS, Norway. reMarkable is a registered trademark of reMarkable AS in some countries. Please see https://remarkable.com for their products.

remarks's People

Contributors

Stargazers

Watchers

Forkers

folofjc gigahawk karbon0x clement-elbaz samuelyeewl daleonpz phill93 gianpaj sabidib azeirah jantoebes mathias-sm opal06 nicole-hong wittmeis brprkr

remarks's Issues

Review the eraser tool

I've never really tested the eraser tool thoroughly. So I won't be surprised at all that remarks has issues with it...

Ref #63 (comment) and #63 (comment).

Upgrade to reMarkable 3.0

The upgrade to reMarkable 3.0 has broken this extractor. The first issue I found was that the pages are now more structured and nested in a "cPages" field. I was able to add a local fix for that but the more fundamental issue is that the binary .rm file seems pretty fundamentally different.

Are there any plans to modify remarks to work with 3.0? Is there any information I can provide which would help? (Also, if there are pointers to how the binary format was reverse engineered in the past, then I am happy to help do so for 3.0.)

Publish to PyPI

Any plans to publish this to PyPI?

I'm about to publish an open source project that uses this library and realized this isn't published on PyPI 😅 !

I'll open a PR shortly to prepare the repo for publication... but looking to know if there was a reason this was not a priority?

Thanks @lucasrla !

Multiple, wrong highlights in certain edge cases

What I mean by this are highlight patterns like seen in this image:

I have come across this same issue when working on zotero2remarkable_bridge and it seems we are using very similar approaches for highlight generation: Use pymupdf's search feature to find the extracted text from the .highlights file and then create an annotation based on the quads from this search. Now the issue comes from the fact that if pymupdf finds multiple occurenses of the same string, it returns all of them. So generating highlights based off of these quads will generate highlights on all occurenses. Most of the time strings are sufficiently unique for this to not be an issue, but edge cases exist. In the picture above for example in the first column I highlighted the phrase "recent research indicates ...", but as "recent" was broken over to lines, reMarkable created to hightlights, one with text "re" and another with text "cent research ...". "re" was than found all over the text because it is such a common sequence of letters.

I must admit that I have never come around to find a solution for this problem for my own program, which is why I would like to see if we could come up with a solution together. My line of thinking for now always was that one could probably use the "rect" or "start" and "length" data from the .highlights file to find the correct sequence in case of more then one result, but I could never get it to work...

Skipping out_path.mkdir step when processing paths in directories and per_page_targets isn't used

Command (MacOS):
python3 -m remarks "/Users/../Library/Application Support/remarkable/desktop" ./exper-output/ --file_path zotero

Exception:

Traceback (most recent call last):
 File "<frozen runpy>", line 198, in _run_module_as_main
 File "<frozen runpy>", line 88, in _run_code
 File "/Users/../Documents/temp/remarks/remarks/__main__.py", line 151, in <module>
   main()
 File "/Users/../Documents/temp/remarks/remarks/__main__.py", line 147, in main
   run_remarks(input_dir, output_dir, **args_dict)
 File "/Users/../Documents/temp/remarks/remarks/remarks.py", line 92, in run_remarks
   process_document(metadata_path, out_path, doc_type, **kwargs)
 File "/Users/../Documents/temp/remarks/remarks/remarks.py", line 432, in process_document
   pdf_src.save(f"{out_doc_path_str} _remarks.pdf")
 File "/Users/../Documents/temp/remarks/.venv/lib/python3.12/site-packages/fitz/__init__.py", line 5392, in save
   mupdf.pdf_save_document(pdf, filename, opts)
 File "/Users/../Documents/temp/remarks/.venv/lib/python3.12/site-packages/fitz/mupdf.py", line 50393, in pdf_save_document
   return _mupdf.pdf_save_document(doc, filename, opts)
fitz.mupdf.FzErrorSystem: code=2: cannot open file 'exper-output-1/zotero/my_file.pdf _remarks.pdf': No such file or directory

Debug:

out_path.parent: exper-output/zotero
per_page_targets: []
has_ann: True
has_smart_hl: True

per_page_targets being empty list means out_path.mkdir isn't being executed. Addingout_path.mkdir again before where files are saved avoids the exception. (If per_page_targets argument is required, I think this should be reported to the user instead of raising a file-related exception.)

Export either just scribbles or just highlights on the pdf

Might be useful to choose which type of annotations to generate on pdf pages. See this discussion

Syntax Error after installing

I cloned the repository like this: git clone https://github.com/lucasrla/remarks.git && cd remarks
Then I installed the requirements: pip install -r requirements.txt

But then, when I tried running remarks, with this:
python -m remarks --help

I get this:

Traceback (most recent call last):
  File "/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 163, in _run_module_as_main
    mod_name, _Error)
  File "/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 111, in _get_module_details
    __import__(mod_name)  # Do not catch exceptions initializing package
  File "remarks/__init__.py", line 1, in <module>
    from . import conversion
  File "remarks/conversion/__init__.py", line 1, in <module>
    from .parsing import (
  File "remarks/conversion/parsing.py", line 106
    name_code = f"{tool}_{pen}"

I'm not sure if the problem is me installing it wrong, but it would be great if somebody could help.

Annotations at wrong position

I believe this is happening with pdf files in landscape format. One test example is https://pressable.docraptor.com/wp-content/uploads/2012-12-10-landscape-format.pdf . The extracted annotations are positioned at the wrong location and are rotated. Depending on the pdf, I have seen 90 degree and 180 degree rotations.

Extracting annotated PDFs from `rmapi`

Hi, thank you so much for working on this project!

I am using a new Remarkable 2 with the rmapi toolchain to get PDFs and docs from my device. I expected to be able to run:

rmapi get Document # get document from remarkable
unzip Document.zip -d document #unzip into `document` dir
python -m remarks document output --combined_pdf #export to combined pdf in `output` dir

But because the current rmapi get command does not export the metadata file for the document, remarks fails with no error (but creates the output directory). here's the output of unzipping the rmapi get zip file.

Archive:  Document.zip
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd.content  
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd.pagedata  
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd/0.rm  
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd/0-metadata.json  
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd/1.rm  
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd/1-metadata.json  
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd/2.rm  
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd/2-metadata.json  
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd.highlights/9cc49bf2-0e47-4e0a-8c4b-2cfb45362def.json  
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd.highlights/164af9e1-0b5b-444d-95fe-b22732563d10.json  
  inflating: f547ab4e-23aa-420a-83ae-1de02555d6fd.pdf

If I look up a document ID using rmapi stat, and manually copy the files matching that ID from my remarkable desktop app library directory, then run remarks, the export works perfectly.

I imagine this might be something I should bring up with the rmapi devs, but wanted to flag it here in case someone else has the same issue, and in case someone here can point me in the right direction!

Handwritten annotations searchable in output PDF

(This is far from trivial but proposing it here as an enhancement, as this is the nearest project to it to my knowledge).

It would be very useful to have HWR-texts (obtained in some way* from the texts handwritten on reMarkable on blank pages or annotations on PDF) placed on each page of the output PDF as invisible (searchable) text, so to have the search tool also consider the handwritten notes.

If the original PDF is something that has been completely handwritten on reMarkable from scratch, this would be used to make it somehow searchable: not with a word-by-word matching of image and invisible text area, but useful to find a page.
If the original PDF is e.g. a scientific paper notes handwritten with reMarkable, this would be used to consider also the handwritten notes when searching something through it.

*The HWR-texts to use as input could be extracted by processing an email obtained from [email protected] when using the HWR feature on the reMarkable, or maybe using the myscript.com service.

Change default page offset to 1

I think that while 0 seems the logical default offset from a programmer's perspective, 1 would actually be a better default as it would fit the average document pagination better. I can make a pull request, but as it's not a big change I wanted to bring it up here first.

Thanks for all the great work, sorry to bother you again!

How to run?

Sorry bit of a noob but how to run?

python3 -m remarks --pdf_name c:\Users\Lee\Downloads\test.pdf --ann_type highlights --targets md --combined_pdf

remarks: error: the following arguments are required: INPUT_DIRECTORY, OUTPUT_DIRECTORY

Didn't notice the input or output directory specified in the examples

pip install fails when installing pymupdf on macOS Big Sur

I want to export annotations from my PDFs on my reMarkable 2, so I'm trying to install this, but it fails:

$% sudo -H pip3 install -r ./requirements.txt
Collecting pymupdf==1.17.4
  Using cached PyMuPDF-1.17.4.tar.gz (202 kB)
Collecting shapely==1.7.0
  Using cached Shapely-1.7.0.tar.gz (349 kB)
Using legacy 'setup.py install' for pymupdf, since package 'wheel' is not installed.
Using legacy 'setup.py install' for shapely, since package 'wheel' is not installed.
Installing collected packages: pymupdf, shapely
    Running setup.py install for pymupdf ... error
    ERROR: Command errored out with exit status 1:
     command: /opt/remarks/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/tmp/pip-install-i49yvpm6/pymupdf/setup.py'"'"'; __file__='"'"'/private/tmp/pip-install-i49yvpm6/pymupdf/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/tmp/pip-record-bdzeg8ed/install-record.txt --single-version-externally-managed --compile --install-headers /opt/remarks/venv/include/site/python3.9/pymupdf
         cwd: /private/tmp/pip-install-i49yvpm6/pymupdf/
    Complete output (209 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-11-x86_64-3.9
    creating build/lib.macosx-11-x86_64-3.9/fitz
    copying fitz/__init__.py -> build/lib.macosx-11-x86_64-3.9/fitz
    copying fitz/fitz.py -> build/lib.macosx-11-x86_64-3.9/fitz
    copying fitz/utils.py -> build/lib.macosx-11-x86_64-3.9/fitz
    copying fitz/__main__.py -> build/lib.macosx-11-x86_64-3.9/fitz
    running build_ext
    building 'fitz._fitz' extension
    creating build/temp.macosx-11-x86_64-3.9
    creating build/temp.macosx-11-x86_64-3.9/fitz
    clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -I/usr/local/include -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I/usr/local/include/mupdf -I/usr/local/include -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/opt/remarks/venv/include -I/usr/local/Cellar/[email protected]/3.9.0_5/Frameworks/Python.framework/Versions/3.9/include/python3.9 -c fitz/fitz_wrap.c -o build/temp.macosx-11-x86_64-3.9/fitz/fitz_wrap.o
    fitz/fitz_wrap.c:4381:30: warning: expression result unused [-Wunused-value]
                if (dest->alpha) *s++;
                                 ^~~~
    fitz/fitz_wrap.c:4380:19: warning: unsequenced modification and access to 's' [-Wunsequenced]
                    *s++ = 255 - *s;
                      ^           ~
    fitz/fitz_wrap.c:4403:11: warning: assigning to 'unsigned char *' from 'char *' converts between pointers to integer types with different sign [-Wpointer-sign]
            c = PyBytes_AS_STRING(imagedata);
              ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    fitz/fitz_wrap.c:4407:11: warning: assigning to 'unsigned char *' from 'char *' converts between pointers to integer types with different sign [-Wpointer-sign]
            c = PyByteArray_AS_STRING(imagedata);
              ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    fitz/fitz_wrap.c:4867:18: warning: unused variable 'popup' [-Wunused-variable]
            pdf_obj *popup = pdf_dict_get(ctx, annot->obj, PDF_NAME(Popup));
                     ^
    fitz/fitz_wrap.c:4959:14: warning: unused variable 'name' [-Wunused-variable]
        pdf_obj *name = NULL;
                 ^
    fitz/fitz_wrap.c:5052:12: warning: unused variable 'len' [-Wunused-variable]
        size_t len = 0;
               ^
    fitz/fitz_wrap.c:5696:47: warning: passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign [-Wpointer-sign]
        res = fz_new_buffer_from_copied_data(ctx, data, strlen(data));
                                                  ^~~~
    /usr/local/include/mupdf/fitz/buffer.h:83:81: note: passing argument to parameter 'data' here
    fz_buffer *fz_new_buffer_from_copied_data(fz_context *ctx, const unsigned char *data, size_t size);
                                                                                    ^
    fitz/fitz_wrap.c:6024:19: warning: unused variable 'pdf' [-Wunused-variable]
        pdf_document *pdf = pdf_get_bound_document(ctx, annot->obj);
                      ^
    fitz/fitz_wrap.c:6081:16: warning: unused variable 'res' [-Wunused-variable]
        fz_buffer *res = NULL;
                   ^
    fitz/fitz_wrap.c:6080:27: warning: unused variable 'js' [-Wunused-variable]
        pdf_obj *obj = NULL, *js = NULL, *o = NULL;
                              ^
    fitz/fitz_wrap.c:6080:39: warning: unused variable 'o' [-Wunused-variable]
        pdf_obj *obj = NULL, *js = NULL, *o = NULL;
                                          ^
    fitz/fitz_wrap.c:6486:64: warning: passing 'char [3]' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign [-Wpointer-sign]
                               fz_new_buffer_from_copied_data(ctx, "  ", 1),
                                                                   ^~~~
    /usr/local/include/mupdf/fitz/buffer.h:83:81: note: passing argument to parameter 'data' here
    fz_buffer *fz_new_buffer_from_copied_data(fz_context *ctx, const unsigned char *data, size_t size);
                                                                                    ^
    fitz/fitz_wrap.c:7408:40: warning: declaration of 'struct Document' will not be visible outside of this function [-Wvisibility]
    SWIGINTERN void delete_Document(struct Document *self){
                                           ^
    fitz/fitz_wrap.c:8030:17: warning: unused variable 'entry' [-Wunused-variable]
                int entry = 0;
                    ^
    fitz/fitz_wrap.c:8150:17: warning: unused variable 'page_n' [-Wunused-variable]
                int page_n = -1;
                    ^
    fitz/fitz_wrap.c:8380:68: warning: passing 'unsigned char *' to parameter of type 'const char *' converts between pointers to integer types with different sign [-Wpointer-sign]
                            LIST_APPEND_DROP(idlist, JM_UnicodeFromStr(hex));
                                                                       ^~~
    fitz/fitz_wrap.c:3448:41: note: passing argument to parameter 'c' here
    PyObject *JM_UnicodeFromStr(const char *c)
                                            ^
    fitz/fitz_wrap.c:8626:17: warning: unused variable 'cwlen' [-Wunused-variable]
                int cwlen = 0;
                    ^
    fitz/fitz_wrap.c:8630:36: warning: unused variable 'fb_font' [-Wunused-variable]
                fz_font *font = NULL, *fb_font= NULL;
                                       ^
    fitz/fitz_wrap.c:8627:17: warning: unused variable 'lang' [-Wunused-variable]
                int lang = 0;
                    ^
    fitz/fitz_wrap.c:8743:24: warning: unused variable 'len' [-Wunused-variable]
                Py_ssize_t len = 0;
                           ^
    fitz/fitz_wrap.c:9325:69: warning: passing 'char [3]' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign [-Wpointer-sign]
                                   fz_new_buffer_from_copied_data(gctx, "  ", 1), NULL, 0);
                                                                        ^~~~
    /usr/local/include/mupdf/fitz/buffer.h:83:81: note: passing argument to parameter 'data' here
    fz_buffer *fz_new_buffer_from_copied_data(fz_context *ctx, const unsigned char *data, size_t size);
                                                                                    ^
    fitz/fitz_wrap.c:9284:22: warning: unused variable 'page2' [-Wunused-variable]
                pdf_obj *page2 = NULL;
                         ^
    fitz/fitz_wrap.c:9361:26: warning: unused variable 'page2' [-Wunused-variable]
                    pdf_obj *page2 = pdf_lookup_page_loc(gctx, pdf, nb, &parent2, &i2);
                             ^
    fitz/fitz_wrap.c:9504:29: warning: unused variable 'seps' [-Wunused-variable]
                fz_separations *seps = NULL;
                                ^
    fitz/fitz_wrap.c:9943:104: warning: declaration of 'struct Colorspace' will not be visible outside of this function [-Wvisibility]
    SWIGINTERN struct Pixmap *Page__makePixmap(struct Page *self,struct Document *doc,PyObject *ctm,struct Colorspace *cs,int alpha,int annots,PyObject *clip){
                                                                                                           ^
    fitz/fitz_wrap.c:10204:142: warning: declaration of 'struct Graftmap' will not be visible outside of this function [-Wvisibility]
    SWIGINTERN PyObject *Page__showPDFpage(struct Page *self,struct Page *fz_srcpage,int overlay,PyObject *matrix,int xref,PyObject *clip,struct Graftmap *graftmap,char *_imgname){
                                                                                                                                                 ^
    fitz/fitz_wrap.c:10276:29: warning: unused variable 'seps' [-Wunused-variable]
                fz_separations *seps = NULL;
                                ^
    fitz/fitz_wrap.c:10494:35: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
                        for (i = 0; i < n; i++) {
                                    ~ ^ ~
    fitz/fitz_wrap.c:10542:53: warning: declaration of 'struct Colorspace' will not be visible outside of this function [-Wvisibility]
    SWIGINTERN struct Pixmap *new_Pixmap__SWIG_0(struct Colorspace *cs,PyObject *bbox,int alpha){
                                                        ^
    fitz/fitz_wrap.c:10552:53: warning: declaration of 'struct Colorspace' will not be visible outside of this function [-Wvisibility]
    SWIGINTERN struct Pixmap *new_Pixmap__SWIG_1(struct Colorspace *cs,struct Pixmap *spix){
                                                        ^
    fitz/fitz_wrap.c:10621:53: warning: declaration of 'struct Colorspace' will not be visible outside of this function [-Wvisibility]
    SWIGINTERN struct Pixmap *new_Pixmap__SWIG_4(struct Colorspace *cs,int w,int h,PyObject *samples,int alpha){
                                                        ^
    fitz/fitz_wrap.c:10633:32: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
                    if (stride * h != size) THROWMSG("bad samples length");
                        ~~~~~~~~~~ ^  ~~~~
    fitz/fitz_wrap.c:11683:19: warning: unused variable 'data' [-Wunused-variable]
                char *data = NULL;              // for new file content
                      ^
    fitz/fitz_wrap.c:11686:21: warning: unused variable 'size' [-Wunused-variable]
                int64_t size = 0;
                        ^
    fitz/fitz_wrap.c:12267:40: warning: declaration of 'struct Graftmap' will not be visible outside of this function [-Wvisibility]
    SWIGINTERN void delete_Graftmap(struct Graftmap *self){
                                           ^
    fitz/fitz_wrap.c:12284:42: warning: declaration of 'struct TextWriter' will not be visible outside of this function [-Wvisibility]
    SWIGINTERN void delete_TextWriter(struct TextWriter *self){
                                             ^
    fitz/fitz_wrap.c:12299:96: warning: declaration of 'struct Font' will not be visible outside of this function [-Wvisibility]
    SWIGINTERN PyObject *TextWriter_append(struct TextWriter *self,PyObject *pos,char *text,struct Font *font,float fontsize,char *language,int wmode,int bidi_level){
                                                                                                   ^
    fitz/fitz_wrap.c:12362:36: warning: declaration of 'struct Font' will not be visible outside of this function [-Wvisibility]
    SWIGINTERN void delete_Font(struct Font *self){
                                       ^
    fitz/fitz_wrap.c:12739:19: warning: incompatible pointer types passing 'struct Document *' to parameter of type 'struct Document *' [-Wincompatible-pointer-types]
      delete_Document(arg1);
                      ^~~~
    fitz/fitz_wrap.c:7408:50: note: passing argument to parameter 'self' here
    SWIGINTERN void delete_Document(struct Document *self){
                                                     ^
    fitz/fitz_wrap.c:16769:63: warning: incompatible pointer types passing 'struct Colorspace *' to parameter of type 'struct Colorspace *' [-Wincompatible-pointer-types]
        result = (struct Pixmap *)Page__makePixmap(arg1,arg2,arg3,arg4,arg5,arg6,arg7);
                                                                  ^~~~
    fitz/fitz_wrap.c:9943:116: note: passing argument to parameter 'cs' here
    SWIGINTERN struct Pixmap *Page__makePixmap(struct Page *self,struct Document *doc,PyObject *ctm,struct Colorspace *cs,int alpha,int annots,PyObject *clip){
                                                                                                                       ^
    fitz/fitz_wrap.c:17217:74: warning: incompatible pointer types passing 'struct Graftmap *' to parameter of type 'struct Graftmap *' [-Wincompatible-pointer-types]
        result = (PyObject *)Page__showPDFpage(arg1,arg2,arg3,arg4,arg5,arg6,arg7,arg8);
                                                                             ^~~~
    fitz/fitz_wrap.c:10204:152: note: passing argument to parameter 'graftmap' here
    SWIGINTERN PyObject *Page__showPDFpage(struct Page *self,struct Page *fz_srcpage,int overlay,PyObject *matrix,int xref,PyObject *clip,struct Graftmap *graftmap,char *_imgname){
                                                                                                                                                           ^
    fitz/fitz_wrap.c:17603:50: warning: incompatible pointer types passing 'struct Colorspace *' to parameter of type 'struct Colorspace *' [-Wincompatible-pointer-types]
        result = (struct Pixmap *)new_Pixmap__SWIG_0(arg1,arg2,arg3);
                                                     ^~~~
    fitz/fitz_wrap.c:10542:65: note: passing argument to parameter 'cs' here
    SWIGINTERN struct Pixmap *new_Pixmap__SWIG_0(struct Colorspace *cs,PyObject *bbox,int alpha){
                                                                    ^
    fitz/fitz_wrap.c:17638:50: warning: incompatible pointer types passing 'struct Colorspace *' to parameter of type 'struct Colorspace *' [-Wincompatible-pointer-types]
        result = (struct Pixmap *)new_Pixmap__SWIG_1(arg1,arg2);
                                                     ^~~~
    fitz/fitz_wrap.c:10552:65: note: passing argument to parameter 'cs' here
    SWIGINTERN struct Pixmap *new_Pixmap__SWIG_1(struct Colorspace *cs,struct Pixmap *spix){
                                                                    ^
    fitz/fitz_wrap.c:17777:50: warning: incompatible pointer types passing 'struct Colorspace *' to parameter of type 'struct Colorspace *' [-Wincompatible-pointer-types]
        result = (struct Pixmap *)new_Pixmap__SWIG_4(arg1,arg2,arg3,arg4,arg5);
                                                     ^~~~
    fitz/fitz_wrap.c:10621:65: note: passing argument to parameter 'cs' here
    SWIGINTERN struct Pixmap *new_Pixmap__SWIG_4(struct Colorspace *cs,int w,int h,PyObject *samples,int alpha){
                                                                    ^
    fitz/fitz_wrap.c:21495:19: warning: incompatible pointer types passing 'struct Graftmap *' to parameter of type 'struct Graftmap *' [-Wincompatible-pointer-types]
      delete_Graftmap(arg1);
                      ^~~~
    fitz/fitz_wrap.c:12267:50: note: passing argument to parameter 'self' here
    SWIGINTERN void delete_Graftmap(struct Graftmap *self){
                                                     ^
    fitz/fitz_wrap.c:21557:21: warning: incompatible pointer types passing 'struct TextWriter *' to parameter of type 'struct TextWriter *' [-Wincompatible-pointer-types]
      delete_TextWriter(arg1);
                        ^~~~
    fitz/fitz_wrap.c:12284:54: note: passing argument to parameter 'self' here
    SWIGINTERN void delete_TextWriter(struct TextWriter *self){
                                                         ^
    fitz/fitz_wrap.c:21678:59: warning: incompatible pointer types passing 'struct Font *' to parameter of type 'struct Font *' [-Wincompatible-pointer-types]
        result = (PyObject *)TextWriter_append(arg1,arg2,arg3,arg4,arg5,arg6,arg7,arg8);
                                                              ^~~~
    fitz/fitz_wrap.c:12299:102: note: passing argument to parameter 'font' here
    SWIGINTERN PyObject *TextWriter_append(struct TextWriter *self,PyObject *pos,char *text,struct Font *font,float fontsize,char *language,int wmode,int bidi_level){
                                                                                                         ^
    fitz/fitz_wrap.c:21817:15: warning: incompatible pointer types passing 'struct Font *' to parameter of type 'struct Font *' [-Wincompatible-pointer-types]
      delete_Font(arg1);
                  ^~~~
    fitz/fitz_wrap.c:12362:42: note: passing argument to parameter 'self' here
    SWIGINTERN void delete_Font(struct Font *self){
                                             ^
    49 warnings generated.
    clang -bundle -undefined dynamic_lookup -L/usr/local/lib -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk build/temp.macosx-11-x86_64-3.9/fitz/fitz_wrap.o -L/usr/local/lib -L/usr/local/lib -L/usr/local/opt/[email protected]/lib -L/usr/local/opt/sqlite/lib -lmupdf -lmupdf-third -o build/lib.macosx-11-x86_64-3.9/fitz/_fitz.cpython-39-darwin.so
    ld: library not found for -lmupdf
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    error: command '/usr/bin/clang' failed with exit code 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /opt/remarks/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/tmp/pip-install-i49yvpm6/pymupdf/setup.py'"'"'; __file__='"'"'/private/tmp/pip-install-i49yvpm6/pymupdf/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/tmp/pip-record-bdzeg8ed/install-record.txt --single-version-externally-managed --compile --install-headers /opt/remarks/venv/include/site/python3.9/pymupdf Check the logs for full command output.

I had a bunch of other installation issues before, that I fixed. First of, I had to brew install geos to get past the first issue. Then the fitz.c was missing, because apparently pymupdf needs mupdf, but according to this comment by the author of PyMuPDF it needs to the correct version (same major and minor version). Since remarks uses PyMuPDF 1.17.4 I downloaded MuPDF 1.17.0, extracted it and moved the mupdf folder in the include folder into /usr/local/include. This didn't work, so I followed this other comment by the author of PyMuPDF, so I replaced /usr/local/include/mupdf/fitz/config.h with the fitz/_config.h file from the PyMuPDF release 1.17.4.

An so, I get this error above. I'm stuck and I feel this whole process is way to complicated. It would be great to automate this installation somehow, but for now I just want this to work, any suggestions how I can get this to work?

My setup:

$% sw_vers
ProductName:	macOS
ProductVersion:	11.1
BuildVersion:	20C5048k
$% python3 -V
Python 3.9.0
$% pip3 -V
pip 20.3.1 from /usr/local/lib/python3.9/site-packages/pip (python 3.9)

Upgrade to rM v2.11

In my previous issue #37, I noted that the rM software will "snap" to text, and therefore make a new type of highlight. At the time, rM made each line a new highlight rectangle. Therefore, when I fixed this in PR #38, I assumed that there was one rectangle per highlight.

In rM v2.11, this is no longer true. You can highlight multiple lines or even a whole paragraph. Therefore there are multiple rectangles per highlight. Therefore I need to fix this. I will branch off my previous pull request.

Replace tools code with Maxio's tool code

In ricklupton's rmc project, I found that he took drawing tools specific values from the maxio project.

I think it might be worth replacing the tools logic in remarks with this logic, here's why:

The code is clean
It includes logic for tilt and pressure
Based on the screenshots I see on their github page, the tool output looks very very good:

Compatibiltity with older highlight colors

I encountered the following error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/$user/.programs/remarks/remarks/__main__.py", line 141, in <module>
    main()
  File "/home/$user/.programs/remarks/remarks/__main__.py", line 137, in main
    run_remarks(input_dir, output_dir, **args_dict)
  File "/home/$user/.programs/remarks/remarks/remarks.py", line 84, in run_remarks
    process_document(metadata_path, out_path, doc_type, **kwargs)
  File "/home/$user/.programs/remarks/remarks/remarks.py", line 332, in process_document
    ann_page = add_smart_highlight_annotations(smart_hl_data, ann_page)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/$user/.programs/remarks/remarks/conversion/drawing.py", line 186, in add_smart_highlight_annotations
    color_array = fitz.utils.getColor(HL_COLOR_CODES[hl["color"]])
                                      ~~~~~~~~~~~~~~^^^^^^^^^^^^^

KeyError: 0

From working on zotero2remarkable_bridge I remember that at some point reMarkable shuffled around the number codes for their highlight colors and moved "Yellow" from 0 to 3. But they have not updated older highlights files that were created before, so you can still encounter code 0 for yellow.

I think this could either be solved by adding 0 as an additional key mapped to yellow or add something like this to the add_smart_highlight_annotations function:

try:
     color_array = fitz.utils.getColor(HL_COLOR_CODES[hl["color"]])
except KeyError:
     color_array = fitz.utils.getColor(HL_COLOR_CODES[3]) # default to yellow if color is not yet defined

Add basic tests

Woops! Sorry for that, and thank you for catching it! I meant to type continue there but I slipped. Testing the codebase on the demo directory did not raise problems so I went along.

No promises here, but would you be open to PRs related to building an integration test suites? Such regressions could be caught pretty easily with basic tests.

Originally posted by @clement-elbaz in #12 (comment)

remarks started a few months ago as a set of hacks/scripts that I wrote for my own use. Now that it has grown to a project that dozens use, we must level up.

Going forward, having a set of basic tests would be very helpful.

Add support for converting PDFs to Remarkable bundles

I am very interested in having a program that converts your PDF and annotations into the remarkable bundle format. I believe a solution exists for PDFs to RM, but none that supports annotations. In fact, something with this capability (being able to convert a "remarkable bundle" into a PDF and PDF to "remarkable bundle") would create the possibility for having bidirectional syncing, which would be very useful.

For this case, one must use the same annotation formats (and PDF parser) for both types of conversions. Then, we would be able to make comparisons between annotated files. I propose the following:

RM to PDF:

Convert remarkable bundles to PDFs as you do now (*.rm files become scribbles/highlights, which overlay the stored PDF file)

PDF to RM:

Extract any PyMuPDF-compatible scribbles or highlights as .rm files. Ignore any annotations that would not be considered "scribbles" or "highlights" by the remarkable.
Remove the extracted annotations and create an updated PDF file (which might still include ignored annotations in the above step).
Then, the remarkable bundle will include the .rm files and updated PDF file.

I think this an important feature to have and would help make the open-source remarkable pipeline more feature-complete. Hopefully, this approach could be a decent starting point. I think remarks is the best "RM to PDF" codebase and, as you are most familiar with it, it would be fantastic if you could consider adding support for this or letting me know your thoughts. Thanks so much for your consideration!

Feature request: Highlighter/markup height same as Textheight

It would be great if highlights had their usual (as created by other applications) height, i.e. same as the text they markup (or slightly larger even not sure).

Segments with just one point (report them here)

Following #63, please report issues related to 1-point segments here.

Crop and split original PDF, then reconstruct it back as original, but with marks

remarks could enable a better workflow with reMarkable, if it could support this feature:
being able to "prepare" a PDF for the best reading and writing experience on a reMarkable in landscape mode (and reconstruct back with marks).

That would be achieved by cropping all margins (or only some of them) and splitting a page in two halves, so to have bigger font size on reMarkable in landscape mode. The two halves should have some overlap, so to avoid problems on the split edge. The cropping and splitting data should be inserted into the PDF metadata (because they will be needed for reconstruction).

The "prepared" PDF could be used on reMarkable normally, making marks (highlights, scribbles, annotations).

In order to get the PDF back in shape (but with marks), remarks should detect from metadata that the PDF had been "prepared", and create output PDF taking that metadata into account, thus reconstructing the original PDF as it was before cropping and splitting, and adding the marks made on reMarkable.

Notebooks and Quick Sheets support?

Hello and thank you for your work on this project!

I'm trying to use my brand new remarkable V2 without the remarkable cloud and have my data stay within my local network at all time. So far it seems no open-source project provide a feature-complete solution to convert the entire xoshitl directory into a directory of well-located, well-named PDFs files. Your project seems to be the most mature solution to get there, as you already handle the annotated PDF use-case, arguably the most complex. Handling notebooks and quick sheets seem doable in comparison.

Would you consider that "in scope" of your project and accept pull requests in that regard? (assuming the pull requests themselves are decent enough of course) Or should I consider contributing to other projects for this use case? Thanks !

Give pdf

Hi, would it be possible to give the pdf or the raw of the document in order to extract highlights, instead of giving its name and getting it on the remarkable ?
Have a good day :)

incorrect order of pdf pages with --modified_pdf

It generates the pdf with the annotations but the pages seem to be in any order. But when I use --combined-pdf it works. The annotations are there and the pages are in order. I tried it with two books, one of them is a two columns book and the other one is a one column book.

Check the order of the page indexes:

Book Writers (2017, Createspace Independent Publishing Platform) - libgen.lc.pdf"
PDF in-device directory: .
-------PAGE IDX #104
-------PAGE IDX #114
-------PAGE IDX #8
-------PAGE IDX #132
-------PAGE IDX #26
-------PAGE IDX #43
-------PAGE IDX #79
-------PAGE IDX #88
-------PAGE IDX #14
-------PAGE IDX #107
-------PAGE IDX #119
-------PAGE IDX #115
-------PAGE IDX #52

Probably it should be sorted before saving if we save the order of pages in an array.
Maybe something like this:

 pages_order = []
 ....
#  at remarks.py: 180
   if modified_pdf:
        mod_pdf.insertPDF(ann_doc, start_at=-1)
        pages_order.append(page_idx)

# at remark.py: 203
if modified_pdf:
          mod_pdf = _sort_document( mod_pdf, pages_order) 
          mod_pdf.save(f"{output_dir}/{name} _remarks-only.pdf")
          mod_pdf.close()

or put everything together and delete the blank pages after.

for example
at remarks.py: 180

 if modified_pdf:
      mod_pdf.insertPDF(ann_doc, start_at=page_idx)

and
at remarks.py:203

if modified_pdf:
         l = list(range(mod_pdf.pageCount))          # list of all pages
         for i in l:
                 if not doc.getPageText(i)        # if no text on page number i ...
                            l.remove(i)                   # delete that page from list
          mod_pdf.select(l)                           # select remaining pages from the PDF
          mod_pdf.save(f"{output_dir}/{name} _remarks-only.pdf")
          mod_pdf.close()

Pdf whitout annotations

Hi,
I have a problem when I try to use remarks (the normal one or the azeirah PR one). It returns a pdf without any annotation or overliganege. What's strange is that when I pass the --per-page-target option, it returns a file for each page where annotations are present, but this file is systematically empty.

Thanks a lot!

Combined pdf?

Thanks for this!

I notice that when it runs, it simply makes one pdf of each annotated page. Does it provide a total pdf with the annotations?

EDIT: I just saw that it is commented out in the source. The problem is the size of the page and the ToC working? Is it not possible to simply "apply" the annotations on top of the original pdf page?

Not able to understand how to install the script

Hi, would it be possible to make a Wiki for people like me who don't understand anything of programming? I really want/need to use the script, so it's quite frustrating not understanding how to do it.

About ### 2.1 Clone git clone https://github.com/lucasrla/remarks.git && cd remarks

2.1 Clone

git clone https://github.com/lucasrla/remarks.git && cd remarks

Do I copy it and paste it to Terminal in Mac?
But it doesn't work.
Do you have an instruction video?

Can this work with non-remarkable PDFs?

I have pdfs highlighted in Adobe Acrobat or Preview on macOS.
Would it be possible to get this to work with those or is a xochitl folder required?

More general questions if anyone can help:
What is the difference between how Remarkable, Acrobat, and Preview store annotations? Do anyone of them save them into the pdf (modifying it)? If so, how?

Any help with this would be greatly appreciated. Thanks.

Two-column PDFs and markdown

The current process for extracting the highlights into markdown files sorts the words by coordinates and joining them. While this may be important for PDFs that aren't formatted correctly, I found that it basically prevents using remarks on e.g. PDFs with two columns.
Instead, for well-formatted PDFs, one could use the PDF text ordering in order to generate the markdown files.

Identifying a specific pdf when filename is uuid

Great software/docs, it looks to be very promising.

However with my remarkable 2, I have downloaded the xochitl directory but can anyone suggest how to identify a specific pdf when all the filenames are uuids?

Upgrade to rM 2.8

In the last version of the rM software, they added a functionality where they will attempt to adjust your hand highlights to the text. So if the pdf has text (not an image of text), it will "reformat" the highlight to be a block over the text.

The highlight extraction for this no longer works in remarks. I tested this by making a malformatted highlight on a pdf (one that rM would not "refactor") and it shows up just fine after remarks in the annotated pdf. But the highlights that rM "refactors" do not show up.

I tried to look through the parse_rm_file function, but I could not easily see what the problem would be without digging into the rM output to see how the new highlights are stored.

Any chance for an update?

Full markdown with annotations

First of all, thanks for this great app ! I run it on WSL without any issue.

I didn't found a way to export all the text with highlights in Markdown, but only the highlighted parts. Is there such functionality ? Or is there a way to add it ?

ATX headers for Markdown

Right now the combined md file uses setex headers. Would be nice to support Atx headers such as:

# Name of File

## Page number

- highlight 
- highlight

add_smart_highlight_annotations does not work for paragraphs

add_smart_highlight_annotations does not work when having full paragraph smart highlights.

For example:
Highlighted text is a 5 lines paragraph. hl["text"] contains whole 5 lines of text and page.search_for(hl["text"], quads=True) returns no matches. page.search_for matches only on text max length of 1 line.

Example highlights json:

{
    "highlights": [
        [
            {
                "color": 3,
                "length": 454,
                "rects": [
                    {
                        "height": 27.12016196910554,
                        "width": 1009.0730312247251,
                        "x": 140.4741632434343,
                        "y": 599.9028268116126
                    },
                    {
                        "height": 27.974953197517607,
                        "width": 1029.1417209793326,
                        "x": 114.89519231887243,
                        "y": 630.2660855855902
                    },
                    {
                        "height": 27.12016196910554,
                        "width": 1033.5790201634973,
                        "x": 115.73704027993699,
                        "y": 662.0772446970841
                    },
                    {
                        "height": 27.12016196910554,
                        "width": 1034.2260248394723,
                        "x": 115.55571551852154,
                        "y": 693.1527501490382
                    },
                    {
                        "height": 27.12016196910554,
                        "width": 807.2974610806432,
                        "x": 115.11536671875547,
                        "y": 724.2282556009923
                    }
                ],
                "start": 819,
                "text": "This work is motivated by our own environment at the European Organization for NuclearResearch (CERN), a large multinational research organization where thousands of visiting researchers bring IoT devices into the network for research purposes. Devices includes networkedoscilloscopes, storage, printers, cameras, all kinds of networked sensors that researchers can easilyintegrate into the network with the help of a bring-your-own-device policy."
            }
        ]
    ]
}

Would it be possible to fix it using provided rects directly in highlights json (instead of page.search_for)? Do you know whats the relation between rects dimensions and quads in pymupdf?

The rect[0] from highlights json would be (x0, y0, x1, y1):

(140.4741632434343, 599.9028268116126, 1149.5471944681594, 627.0229887807182)

Bbox highlighting the same line in pymupdf (got from page.get_text("dict")):

(55.73711013793945, 228.21629333496094, 440.51470947265625, 242.69195556640625)

In this case it would somehow fix by scaling down rect[0] coordinates by numbers in range (2.52, 2.63).

The rect[1] from highlights json:

(114.89519231887243, 630.2660855855902, 1144.036913298205, 658.2410387831078)

and corresponding bbox:

(45.77450942993164, 240.16844177246094, 441.9892578125, 254.64410400390625)

again possible fixing by scaling down by similar range? The scaling factor for individual coordinates (x0, y0, x1, y1) seems pretty constant (delta 0.01).

If you don't want to go this way, simple fix would be any of:

determining line length and searching for substrings of hl["text"]
bruteforcing the longest matchable prefix from hl["text"], removing matched text - repeat until hl["text"] empty
iterating page.get_text("dict")["blocks"] and their ["lines"]

Do you have any preference how it should be fixed?

Generate a single combined markdown file

It would be helpful to be able to generate a single markdown file, just as there's a combined_pdf option.

Drawing many annotations is very slow

I tried converting my private diary which has over 250 pages of hand-written text. It takes well over two minutes on my high-end PC to convert this file to PDFs.

The primary bottleneck lies in drawing the line segments, see the "update_annotations" call below.

See this code: https://github.com/lucasrla/remarks/blob/master/remarks/conversion/drawing.py#L178

Only top layer of annotations exported

Hi! When I use several layers on the remarkable to write my annotations on, only the top layer seems to be handled by remarks. So If I added a second layer, only the contents of Layer 2 will be visible in the exported pdf.

Page offset for markdown

In order to aid in citations, would be nice to have a page offset so that in the combined markdown file, the page has the "real" page numbers and not the pdf page numbers. For a journal article, this would mean that the page offset is the value of the first page (since it is 0 indexed).

Also, this is useful for when a pdf has a copyright page, can include this in the offset so that everything works.

Will the code work for ipad annotated pdfs as well?

I am interested in taking out highlights and associated comments as markdown or even text. If the drawn annotations are also captured its a bonus

Error when attempting running script with --pdf_name

Hi there...

I'm not sure what I'm doing wrong here?

C:\Users\brian\Google Drive\Zotfile\remarks>python -m remarks "C:\Users\brian\Google Drive\Zotfile\remarks\desktop" "C:\Users\brian\Google Drive\Zotfile\remarks" --pdf_name "Sperry_Klich_1992_Speech Breathing in Senescent and Younger Women During Oral Reading" --targets pdf --combined_pdf
Traceback (most recent call last):
File "C:\Users\brian\Python\lib\runpy.py", line 197, in run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\brian\Python\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\brian\Google Drive\Zotfile\remarks\remarks_main.py", line 94, in
main()
File "C:\Users\brian\Google Drive\Zotfile\remarks\remarks_main.py", line 90, in main
run_remarks(input_dir, output_dir, **args_dict)
File "C:\Users\brian\Google Drive\Zotfile\remarks\remarks\remarks.py", line 48, in run_remarks
if not is_document(path):
File "C:\Users\brian\Google Drive\Zotfile\remarks\remarks\utils.py", line 16, in is_document
metadata = read_meta_file(path)
File "C:\Users\brian\Google Drive\Zotfile\remarks\remarks\utils.py", line 11, in read_meta_file
data = json.loads(open(file).read())
File "C:\Users\brian\Python\lib\json_init.py", line 346, in loads
return _default_decoder.decode(s)
File "C:\Users\brian\Python\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\brian\Python\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Any idea what I'm doing wrong?

[Feature Request] Maintain directory structure

In the output directory, would be great if it maintained the same directory structure as on the reMarkable. I have a lot of pdfs and have them organized, so the output directory gets a bit cluttered.