Giter VIP home page Giter VIP logo

papeer's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

papeer's Issues

Support for <tt> for code/typewriter text

As per JohannesKaufmann/html-to-markdown#49, some websites don't use semantic markup but specify <tt> directly.

Adding this rule for the markdown converter improves the output considerably:

	converter.AddRules(
		md.Rule {
			Filter: []string{"tt"},
			Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
				content = "`" + content + "`"

				return &content
			},
		},
	);

Command get --format=epub fails entirely if a single image link is broken

I'm trying to convert this page gamemath.com to epub, but it has an img link broken at some point and it just exits the whole process.

The command I want to run and fails.
papeer get https://gamemath.com/book/index.html --depth=1 --format=epub

This is the command pinned down to the specific chapter.
papeer get https://gamemath.com/book/index.html --format=epub --selector='#MainColumn > div.TableOfContents > div:nth-child(69) > a'
This is the broken link: https://gamemath.com/book/figs/matrixintro/2d_matrix_l.png (it may work in the future if the author fixes it).

Posible solution: just emit a warning and continue?

Edit: Is this something that should go in https://github.com/bmaupin/go-epub instead?

[Question] How does one chain the `selector` flag?

In the documentation, selector is described as a flag that can be chained in order to capture multiple 'outline' levels of a table of contents, which seems like it would be super useful for mirroring the format of some sites that don't necessarily follow a flat single-level chapter structure.

So, just as an example, if one were to use papeer to download a product manual with the following TOC:

I.  Intro (located at 'section.toc>h2>a')
II. Main Document
A. Chapter One. (located at 'section.toc>h3>ul>li>a')
B. Chapter Two.
...
III. Appendices 
A. Appendix A.
1. Addendum A-1 (located at 'section.toc>h4>ul>li>a')
B. Appendix B
and so on...

What would the command syntax look like to include these three selectors chained?

Feature Request: Support for Password-Protected Sites

Problem Description

Many valuable web resources require authentication, limiting the ability of "Papeer" to access and scrape these sites. Users frequently encounter password-protected pages, making it challenging to include these resources in their e-reader compatible formats. Current functionality does not support a way to authenticate against these sites directly through "Papeer."

Proposed Solution

Introduce functionality within "Papeer" to allow users to provide authentication details (e.g., username and password, API keys, session cookies) that can be used to log in to password-protected sites before scraping. This could be implemented in several ways:

  • Command Line Arguments: Allow users to pass authentication credentials as arguments when invoking "Papeer."
  • Configuration File: Enable users to store authentication details securely in a configuration file "Papeer" reads from.
  • Interactive Prompt: When encountering a password-protected site, "Papeer" could prompt the user to enter their credentials interactively, with an option to save these for future sessions.

Additional Context

  • Ensuring security and privacy of the stored credentials is paramount. Consideration should be given to encrypting the stored credentials if they are saved on disk.
  • This feature would significantly enhance "Papeer's" utility by expanding the range of accessible content for users, making it a more versatile tool for e-reader content preparation.

Potential Challenges

  • Handling different authentication mechanisms (basic auth, OAuth, form-based login, etc.) might require a flexible implementation.
  • Security concerns around storing and handling user credentials.

I believe this feature would be a valuable addition to "Papeer" and would greatly appreciate the team's consideration in implementing it. Thank you for your work on making "Papeer" such a useful tool for the community!

list allows usage of `-l` w/o `-s` while get doesn't

The list command implicitly allows using no selector and uses the default of "". Get doesn't work when only passing -l, nor does it work with passing -s "". In my case the "default" selector that get uses works quite well, but simply spits out a few chapters at the end that I intend to omit. Using papeer list -l 5 <uri> does that, however I cannot do the same with papeer get -l 5 <uri>.

I think ideally it should be allowed to use -l and others with a default selector.

Unable to manage multiple web pages

NYTimes

This isn't working for me for nytimes, the seem to have some way of blocking it

Limit to site with --depth

I was trying to make an ebook of my site which has pages a few layers deep. Some of those pages then link out to other sites. If I use ---depth 5 then papeer also tries to scrape those external sites which I don't want. There isn't a depth setting I can use that incorporates my content but doesn't overlap with external links.

wget --mirror has a nice --no-parent option which prevents any recursion outside of the provided url. You do need to make an exception for directly linked css etc as those are often near the root of a site.

Twitter

The tool 'papeer' is unable to download a tweet and convert it into a markdown file.

G:\>papeer.exe get https://twitter.com/SmokeAwayyy/status/1661494617156775936 2023/05/25 15:32:38 Get "/SmokeAwayyy/status/1661494617156775936": stopped after 10 redirects

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.