shreevatsa / pdf-glyph-mapping Goto Github PK
View Code? Open in Web Editor NEWSome scripts to help make sense of individual glyphs in a PDF, and map them to actual text equivalents.
License: MIT License
Some scripts to help make sense of individual glyphs in a PDF, and map them to actual text equivalents.
License: MIT License
I am getting segfault at this line while dumping the Tjs. The page number on which this happens (sometimes on 1000+, sometimes on 10000+) is not deterministic, but out of ten times I have tried it, it has happened every single time.
pdf-glyph-mapping/src/dump-tjs.rs
Line 121 in 438e712
I get (with mupdf-tools 1.18.0-2
, Arch Linux) .jpg
output by this command instead of .png
. So we need to explicitly specify the format in the command (if it is possible) for doit.sh
to work.
I had abandoned this project in favour of https://github.com/shreevatsa/pdf-explorer but that itself seems abandoned so it may be good to resurrect one or both of them.
At minimum, it seems the Cargo.toml
here needs to be changed to:
clap = "=3.0.0-beta.2"
clap_derive = "=3.0.0-beta.2"
per here. On top of that, may be good to clarify the API a bit, e.g. remove dependency on mupdf and have clear boundaries / separate binaries: one for just dumping the text content stream, etc?
On running without making any changes to the code (other than .png
being replaced with .jpg
in doit.sh
), no Tjs-*.map
files were generated even after dump-tjs
ran successfully.
Bold (can be got from font name) and Italic.
To save compute time spent in extracting operators, etc. from PDF, save the results in a convenient format (perhaps text, or use 'serde') for text-only analysis for quick improvements and testing of stuff like glyph maps and regex.
[vvasuki:~/shreevatsa/pdf-glyph-mapping/work:main]─[08:54:13]─$make
RUST_BACKTRACE=1 cargo +nightly run --release --bin dump-tjs -- /home/vvasuki/Documents/books/granthasangrahaH/purANam/unabridged_full.pdf font-usage/ --phase phase1
error: no such subcommand: `+nightly`
make: *** [Makefile:41: font-usage/] Error 101
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.