lapwat / papeer Goto Github PK
View Code? Open in Web Editor NEWScrape the web in the eink era. Convert websites into ebooks and markdown.
Home Page: https://papeer.tech
License: GNU General Public License v3.0
Scrape the web in the eink era. Convert websites into ebooks and markdown.
Home Page: https://papeer.tech
License: GNU General Public License v3.0
As per JohannesKaufmann/html-to-markdown#49, some websites don't use semantic markup but specify <tt>
directly.
Adding this rule for the markdown converter improves the output considerably:
converter.AddRules(
md.Rule {
Filter: []string{"tt"},
Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
content = "`" + content + "`"
return &content
},
},
);
I'm trying to convert this page gamemath.com to epub, but it has an img link broken at some point and it just exits the whole process.
The command I want to run and fails.
papeer get https://gamemath.com/book/index.html --depth=1 --format=epub
This is the command pinned down to the specific chapter.
papeer get https://gamemath.com/book/index.html --format=epub --selector='#MainColumn > div.TableOfContents > div:nth-child(69) > a'
This is the broken link: https://gamemath.com/book/figs/matrixintro/2d_matrix_l.png (it may work in the future if the author fixes it).
Posible solution: just emit a warning and continue?
Edit: Is this something that should go in https://github.com/bmaupin/go-epub instead?
In the documentation, selector
is described as a flag that can be chained in order to capture multiple 'outline' levels of a table of contents, which seems like it would be super useful for mirroring the format of some sites that don't necessarily follow a flat single-level chapter structure.
So, just as an example, if one were to use papeer
to download a product manual with the following TOC:
I. Intro (located at 'section.toc>h2>a')
II. Main Document
A. Chapter One. (located at 'section.toc>h3>ul>li>a')
B. Chapter Two.
...
III. Appendices
A. Appendix A.
1. Addendum A-1 (located at 'section.toc>h4>ul>li>a')
B. Appendix B
and so on...
What would the command syntax look like to include these three selectors chained?
Many valuable web resources require authentication, limiting the ability of "Papeer" to access and scrape these sites. Users frequently encounter password-protected pages, making it challenging to include these resources in their e-reader compatible formats. Current functionality does not support a way to authenticate against these sites directly through "Papeer."
Introduce functionality within "Papeer" to allow users to provide authentication details (e.g., username and password, API keys, session cookies) that can be used to log in to password-protected sites before scraping. This could be implemented in several ways:
I believe this feature would be a valuable addition to "Papeer" and would greatly appreciate the team's consideration in implementing it. Thank you for your work on making "Papeer" such a useful tool for the community!
The list
command implicitly allows using no selector and uses the default of ""
. Get doesn't work when only passing -l
, nor does it work with passing -s ""
. In my case the "default" selector that get
uses works quite well, but simply spits out a few chapters at the end that I intend to omit. Using papeer list -l 5 <uri>
does that, however I cannot do the same with papeer get -l 5 <uri>
.
I think ideally it should be allowed to use -l
and others with a default selector.
https://pokde.net/system/software/mobile-application/chatgpt-ios-app
https://www.dailydot.com/news/instacart-shopper-criticized-rebagging-groceries/
https://visualstudiomagazine.com/articles/2023/05/23/build-2023-ai.aspx
https://www.inman.com/2023/05/23/ai-on-the-go-the-new-chatgpt-ios-app-is-finally-here/
This isn't working for me for nytimes, the seem to have some way of blocking it
The title says all.
While we at Nixpkgs were packaging it, I decided to look at the repository, but there is no explicit license I could find.
I was trying to make an ebook of my site which has pages a few layers deep. Some of those pages then link out to other sites. If I use ---depth 5
then papeer also tries to scrape those external sites which I don't want. There isn't a depth setting I can use that incorporates my content but doesn't overlap with external links.
wget --mirror
has a nice --no-parent
option which prevents any recursion outside of the provided url. You do need to make an exception for directly linked css etc as those are often near the root of a site.
The tool 'papeer' is unable to download a tweet and convert it into a markdown file.
G:\>papeer.exe get https://twitter.com/SmokeAwayyy/status/1661494617156775936 2023/05/25 15:32:38 Get "/SmokeAwayyy/status/1661494617156775936": stopped after 10 redirects
Is it possible to disable all hyperlinks and generate a plain text EPUB? I downloaded the Wikipedia webpage, but whenever I click on a new word in the EPUB reader, it takes me back to the webpage.
Nice work. If article title contain ":" it download a 0 byte file and if contain "?" it give error " The filename, directory name, or volume label syntax is incorrect.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.