lapwat / papeer Goto Github PK

View Code? Open in Web Editor NEW

195.0 4.0 14.0 602 KB

Scrape the web in the eink era. Convert websites into ebooks and markdown.

Home Page: https://papeer.tech

License: GNU General Public License v3.0

Go 99.65% Makefile 0.35%

eink ereader epub mobi kindle remarkable markdown scraper ebook command-line

papeer's People

Stargazers

Watchers

Forkers

jrtberlin ljrk0 yinshanyang joshklein averyneumann gbogarinb r2d2meuleu dei-layborer dolanor-galaxy muhmud joaovitor123jv ibndias poa00

papeer's Issues

"Just a moment..." appears instead of screaped content (in EPUB)

Scraping seems to time out at some point and templated waiting text from the title is inserted instead.

Support for <tt> for code/typewriter text

As per JohannesKaufmann/html-to-markdown#49, some websites don't use semantic markup but specify <tt> directly.

Adding this rule for the markdown converter improves the output considerably:

	converter.AddRules(
		md.Rule {
			Filter: []string{"tt"},
			Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
				content = "`" + content + "`"

				return &content
			},
		},
	);

Command get --format=epub fails entirely if a single image link is broken

I'm trying to convert this page gamemath.com to epub, but it has an img link broken at some point and it just exits the whole process.

The command I want to run and fails.
papeer get https://gamemath.com/book/index.html --depth=1 --format=epub

This is the command pinned down to the specific chapter.
papeer get https://gamemath.com/book/index.html --format=epub --selector='#MainColumn > div.TableOfContents > div:nth-child(69) > a'
This is the broken link: https://gamemath.com/book/figs/matrixintro/2d_matrix_l.png (it may work in the future if the author fixes it).

Posible solution: just emit a warning and continue?

Edit: Is this something that should go in https://github.com/bmaupin/go-epub instead?

[Question] How does one chain the `selector` flag?

In the documentation, selector is described as a flag that can be chained in order to capture multiple 'outline' levels of a table of contents, which seems like it would be super useful for mirroring the format of some sites that don't necessarily follow a flat single-level chapter structure.

So, just as an example, if one were to use papeer to download a product manual with the following TOC:

I.  Intro (located at 'section.toc>h2>a')
II. Main Document
A. Chapter One. (located at 'section.toc>h3>ul>li>a')
B. Chapter Two.
...
III. Appendices 
A. Appendix A.
1. Addendum A-1 (located at 'section.toc>h4>ul>li>a')
B. Appendix B
and so on...

What would the command syntax look like to include these three selectors chained?

Feature Request: Support for Password-Protected Sites

Problem Description

Many valuable web resources require authentication, limiting the ability of "Papeer" to access and scrape these sites. Users frequently encounter password-protected pages, making it challenging to include these resources in their e-reader compatible formats. Current functionality does not support a way to authenticate against these sites directly through "Papeer."

Proposed Solution

Introduce functionality within "Papeer" to allow users to provide authentication details (e.g., username and password, API keys, session cookies) that can be used to log in to password-protected sites before scraping. This could be implemented in several ways:

Command Line Arguments: Allow users to pass authentication credentials as arguments when invoking "Papeer."
Configuration File: Enable users to store authentication details securely in a configuration file "Papeer" reads from.
Interactive Prompt: When encountering a password-protected site, "Papeer" could prompt the user to enter their credentials interactively, with an option to save these for future sessions.

Additional Context

Ensuring security and privacy of the stored credentials is paramount. Consideration should be given to encrypting the stored credentials if they are saved on disk.
This feature would significantly enhance "Papeer's" utility by expanding the range of accessible content for users, making it a more versatile tool for e-reader content preparation.

Potential Challenges

Handling different authentication mechanisms (basic auth, OAuth, form-based login, etc.) might require a flexible implementation.
Security concerns around storing and handling user credentials.

I believe this feature would be a valuable addition to "Papeer" and would greatly appreciate the team's consideration in implementing it. Thank you for your work on making "Papeer" such a useful tool for the community!

list allows usage of `-l` w/o `-s` while get doesn't

The list command implicitly allows using no selector and uses the default of "". Get doesn't work when only passing -l, nor does it work with passing -s "". In my case the "default" selector that get uses works quite well, but simply spits out a few chapters at the end that I intend to omit. Using papeer list -l 5 <uri> does that, however I cannot do the same with papeer get -l 5 <uri>.

I think ideally it should be allowed to use -l and others with a default selector.

Does this work with ZSH?

 10:57:57am ﬌ papeer list https://developer.spotify.com/documentation/web-api --selector='#main'
zsh: command not found: papeer

Unable to manage multiple web pages

https://www.digitaljournal.com/pr/news/theexpresswire/systems-engineering-software-market-2023-2028-competitive-strategies-and-market-positioning-by-top-players-with-industry-share

https://pokde.net/system/software/mobile-application/chatgpt-ios-app

https://www.dailydot.com/news/instacart-shopper-criticized-rebagging-groceries/

https://visualstudiomagazine.com/articles/2023/05/23/build-2023-ai.aspx

https://www.inman.com/2023/05/23/ai-on-the-go-the-new-chatgpt-ios-app-is-finally-here/

NYTimes

This isn't working for me for nytimes, the seem to have some way of blocking it

What is the license of this software?

The title says all.

While we at Nixpkgs were packaging it, I decided to look at the repository, but there is no explicit license I could find.

Limit to site with --depth

I was trying to make an ebook of my site which has pages a few layers deep. Some of those pages then link out to other sites. If I use ---depth 5 then papeer also tries to scrape those external sites which I don't want. There isn't a depth setting I can use that incorporates my content but doesn't overlap with external links.

wget --mirror has a nice --no-parent option which prevents any recursion outside of the provided url. You do need to make an exception for directly linked css etc as those are often near the root of a site.