Giter VIP home page Giter VIP logo

rdrview's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rdrview's Issues

Question: Does rdrview also support images?

Hi! First of all: Awesome tool. I use it to extract readable part from URLs and convert that to epub (see blog post).

However: In my epub's I somehow miss pictures. I haven't investigated yet (might be pandocs fault) but could it be that rdrview filters out images?

Cheers,
Victor

Won't Follow Google News Redirects

Hi. I'm helping a blind friend who uses rdrview to read articles. It seems to follow some, but not all, redirects. In particular, Google News URLs don't work. Here's an example:

https://www.google.com/url?rct=j&sa=t&url=https://www.intellinews.com/turkey-s-karel-telco-equipment-maker-buys-local-peer-telesis-for-0-5mn-198953/%3Fsource%3Dcee-telecoms-media-it-newswatch&ct=ga&cd=CAEYASoUMTM2MzQwNzExNDcyNTAyMzUwMzMyGjg1NDM3YjRkY2FmN2QyZGE6Y29tOmVuOlVT&usg=AFQjCNH-3gYzxzSXyPZLrWczutgjqq3kag

rdrview reports "rdrview: document has no body tag".

Umlauts not displayed through rdrview

Umlauts are displayed as fragments when a page is extracted through rdrview.
Example: https://www.heise.de/news/Abschaltungen-von-Mining-Farmen-in-China-Gefahr-oder-Chance-fuer-Bitcoin-6160001.html
In my.mailcap is

text/html; /usr/local/bin/lynx --dump %s; copiousoutput; description=HTML Text; nametemplate=%s.html

Snippet in rdrview:

...Kryptowährung, die anfällig für eine...

Snippet through lynx --dump:

...Kryptowährung, die anfällig für eine 51-Prozent...

User Agent -- How to set?

Is it possible to set the User Agent rdrview transfers to the HTML server?
Some HTML servers refuse to deliver a page to curl when it is using its native User Agent.
As far as I understand, this concerns also libcurl.

versioning

rdrview doesn't have a version yet. I am packaging it for Guix and I'd like to be able to refer to a version number in the package definition.

Can we add a 0.1.0 version?

Rendering

the problem

Various glyphs get rendered in strange ways, comparing the piped-to-less content with the --html dump option:
’ as <E2>
… as <E2><A6>
ł as <C5>
▲ as <E2><B2>
and likely many more.

These appear highlighted within less, or as � if not piped. lynx -dump -nolist URL renders nearly everything.

step to reproduce

at least on my setup:

  1. make install rdrview
  2. rdrview -B 'lynx -dump -nolist' URL | less;

the URL must contain an apostrophe ' (which for some reason gets converted to ’, same goes for an three sequential periods ..., which get transformed to an ellipsis.

I'm unsure if the problem is in the encoding, or something else.

Works only with 'disable-sandbox'

Hello,
I am on a musl based system. When launching rdrview the output is empty. I have to use
'disable-sandbox' to get a webpage rendered.
I use libseccomp 2.5.1 but without the python bindings. Other programs are 'libxml2 2.9.10'
and 'curl 7.73.0'. Maybe you have an idea?

Non-english links and images are broken

Try it yourself:

rdrview "https://en.wikipedia.org/wiki/Wikipedia" and rdrview "https://ja.wikipedia.org/wiki/ウィキペディア". In first case you can see images and go through links. But in second case all links became "file:///..." so you can't open them or watch images.

I don't know what exactly this problem is, so sorry for possible mistake in the title.

recipe for target 'src/regex.o' failed

Fails on Ubuntu 16.04 (gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0):

gcc -DNDEBUG -O2 -Wall -Wextra -fno-strict-aliasing  -I/usr/include/libxml2 -o src/regex.o -c src/regex.c
src/regex.c:95:2: error: initializer element is not constant
  UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
  ^
src/regex.c:95:2: note: (near initialization for ‘REGEXES[0]’)
src/regex.c:95:15: error: initializer element is not constant
  UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
               ^
src/regex.c:95:15: note: (near initialization for ‘REGEXES[1]’)
src/regex.c:95:29: error: initializer element is not constant
  UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
                             ^
src/regex.c:95:29: note: (near initialization for ‘REGEXES[2]’)
src/regex.c:95:40: error: initializer element is not constant
  UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
                                        ^
src/regex.c:95:40: note: (near initialization for ‘REGEXES[3]’)
src/regex.c:95:53: error: initializer element is not constant
  UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
                                                     ^
src/regex.c:95:53: note: (near initialization for ‘REGEXES[4]’)
src/regex.c:95:62: error: initializer element is not constant
  UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
                                                              ^
src/regex.c:95:62: note: (near initialization for ‘REGEXES[5]’)
src/regex.c:96:2: error: initializer element is not constant
  HASCONTENT_RE, NEGATIVE_RE, POSITIVE_RE, SENTENCE_DOT_RE, B64_DATAURL_RE,
  ^
src/regex.c:96:2: note: (near initialization for ‘REGEXES[6]’)
src/regex.c:96:17: error: initializer element is not constant
  HASCONTENT_RE, NEGATIVE_RE, POSITIVE_RE, SENTENCE_DOT_RE, B64_DATAURL_RE,
                 ^
src/regex.c:96:17: note: (near initialization for ‘REGEXES[7]’)
src/regex.c:96:30: error: initializer element is not constant
  HASCONTENT_RE, NEGATIVE_RE, POSITIVE_RE, SENTENCE_DOT_RE, B64_DATAURL_RE,
                              ^
src/regex.c:96:30: note: (near initialization for ‘REGEXES[8]’)
src/regex.c:96:43: error: initializer element is not constant
  HASCONTENT_RE, NEGATIVE_RE, POSITIVE_RE, SENTENCE_DOT_RE, B64_DATAURL_RE,
                                           ^
src/regex.c:96:43: note: (near initialization for ‘REGEXES[9]’)
src/regex.c:96:60: error: initializer element is not constant
  HASCONTENT_RE, NEGATIVE_RE, POSITIVE_RE, SENTENCE_DOT_RE, B64_DATAURL_RE,
                                                            ^
src/regex.c:96:60: note: (near initialization for ‘REGEXES[10]’)
src/regex.c:97:2: error: initializer element is not constant
  SRCSET_RE, SRC_RE, VIDEOS_RE, SHARE_RE, ABSOLUTE_RE,
  ^
src/regex.c:97:2: note: (near initialization for ‘REGEXES[11]’)
src/regex.c:97:13: error: initializer element is not constant
  SRCSET_RE, SRC_RE, VIDEOS_RE, SHARE_RE, ABSOLUTE_RE,
             ^
src/regex.c:97:13: note: (near initialization for ‘REGEXES[12]’)
src/regex.c:97:21: error: initializer element is not constant
  SRCSET_RE, SRC_RE, VIDEOS_RE, SHARE_RE, ABSOLUTE_RE,
                     ^
src/regex.c:97:21: note: (near initialization for ‘REGEXES[13]’)
src/regex.c:97:32: error: initializer element is not constant
  SRCSET_RE, SRC_RE, VIDEOS_RE, SHARE_RE, ABSOLUTE_RE,
                                ^
src/regex.c:97:32: note: (near initialization for ‘REGEXES[14]’)
src/regex.c:97:42: error: initializer element is not constant
  SRCSET_RE, SRC_RE, VIDEOS_RE, SHARE_RE, ABSOLUTE_RE,
                                          ^
src/regex.c:97:42: note: (near initialization for ‘REGEXES[15]’)
Makefile:18: recipe for target 'src/regex.o' failed
make: *** [src/regex.o] Error 1

Provide library

It should be possible to put the core functionality into a library which can be reused by other projects.

Add option to inject css

Plain text is ok, but sometimes I want to make more readable pages. I propose to add option to inject css. Now I use shell wrapper and modified stolen css block from simplyread (https://njw.name/simplyread/):

p{margin:0ex auto;} h1,h2,h3,h4{font-weight:normal}
p+p{text-indent:2em;} body{background:#cccccc none}
img{display:block; max-width: 32em; padding:1em; margin: auto}
h1{text-align:center;text-transform:uppercase}
div#readability-page-1{width:34em; padding:8em; padding-top:2em;
background-color:white; margin:auto; line-height:1.4;
text-align:justify; font-family:serif; hyphens:auto;}

It looks nice and my shell script works ok, but it will be good to have same option natively. Maybe something like rdrview --css 'p{margin:0ex auto;} h1,h2,h3,h4{font-weight:normal}' https://.... or smth like that.

Convert to text

Hi, rdrview is absolutely fantastic! The fastest and most relevant output I've come across from all the firefox readability based tools I've tried.

One new feature would be a great addition I think: convert the readable html output to text. Right now I'm using rdrview to get the readable html, output it with "-H" and use the links or lynx browser to dump the formatted text with the -dump option.

Would be nice if rdrview would have an option for outputting text.

In any case, thanks a lot for rdrview!
(By the way I also had to throw away the sandbox stuff from the code because libseccomp would not compile on my system.)

"Operation not permitted": on ARM it only runs with "disable-sandbox"

Hi, first of all thanks for your work in this extremely usefull written in C for speed and promising tool.

I've been testing and using it on my x64 Arch Linux for some months. Very happy.

It's intended to be used with terminal RSS readers, to make the articles more readable on web browsers such as lynx.

I use it with w3m. Feel free to update the README as well.

W3m is a very underrated (and badly documented) cli browser. But with incredible customizing options. It works amazingly fast with rdrview.

You can use it for a one shot operation like this:

$ rdrview -H https://www.bbc.com/news/world-asia-china-55784231 | w3m -T text/html

or

$ rdrview -H https://www.bbc.com/news/world-asia-china-55784231 | w3m -T text/html -dump

Or for interactive browsing.
You can for example add these lines to the config file, ~/.w3m/keymap

keymap \\\r COMMAND "SHELL 'rdrview -H $W3M_URL > /tmp/rdrview.html' ; LOAD /tmp/rdrview.html"

or

keymap \\\r COMMAND "SHELL 'clear; echo \"parsing page with rdrview\" ; echo ; rdrview -H $W3M_URL > /tmp/rdrview.html' ; LOAD /tmp/rdrview.html"

and then use "\r" when your browsing a page inside w3m.

My issue is when I try to run in on Arm, also Arch Linux, armv7.

I tried it on both a chromebook running:

Linux alarmsung 5.10.10-1-ARCH #1 SMP PREEMPT Sat Jan 23 23:26:35 UTC 2021 armv7l GNU/Linux

and a Raspberry Pi 2 running:

Linux alarmpi 5.4.83-4-ARCH #1 SMP PREEMPT Wed Jan 20 14:06:49 UTC 2021 armv7l GNU/Linux

I install rdrview by hand. I do not use the Arch User AUR package like this other user, #13. This AUR arch linux is not well maintained and is marked only for x64 not Arm, https://aur.archlinux.org/packages/rdrview-git

What I do instead is this.
I do git clone .. and then run make.

On both these ARM systems I have the exact same needed dependencies installed

They are the same as in my working x64 system, and they are official distro packages:

 
local/libseccomp 2.5.1-2
    Enhanced seccomp library
local/libxml2 2.9.10-8
    XML parsing library, version 2
local/libcurl-gnutls 7.74.0-1
    An URL retrieval library (linked against gnutls)

Compilation with make runs without a problem. But running it, whatever the options choosen always gets me a message:

rdrview: Operation not permitted

If I use the flag "--disable-sandbox" then it works.

I'm not a developer, and have zero understanding of C programing, syscalls or security. The only thing I could find that brought me here was the similar issue by the other user.

#10 (comment)
...I am on a musl based system....
... have to use 'disable-sandbox' to get a webpage rendered....

Additional info that might be usefull:

  1. Some libs

Architecture : armv7h

glibc, Version         : 2.32-2
gcc, Version         : 10.2.0-1

  1. strace

$ strace rdrview -M https://www.bbc.com/news/world-asia-china-55784231 2>> error_log.txt

error_log.txt

  1. The compiled binary on arm:

$ file ./rdrview

./rdrview: ELF 32-bit LSB pie executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, BuildID[sha1]=b7438379398f311b185c1ba3a7ba9019f245321d, for GNU/Linux 3.2.0, not stripped

Add LDFLAGS to Makefile for full relro and fortify

The Makefile currently does not produce fully hardened binaries as it does not take the system's LDFLAGS or CPPFLAGS or CFLAGS. Which means the binaries aren't fully hardened

[jelle@natrium][/tmp/rdrview-git/src/rdrview]%checksec --file=/usr/bin/rdrview
RELRO           STACK CANARY      NX            PIE             RPATH      RUNPATH	Symbols		FORTIFY	Fortified	Fortifiable	FILE
Partial RELRO   Canary found      NX enabled    PIE enabled     No RPATH   No RUNPATH   No Symbols	  No	0		9		/usr/bin/rdrview

With a simple patch:

diff --git a/Makefile b/Makefile
index 18a0e8f..b018a9c 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,8 @@
 SYSTEM = $(shell uname)
 CC = gcc
 
+LDFLAGS ?=
+CPPFLAGS ?=
 CFLAGS = -DNDEBUG -O2 -Wall -Wextra -fno-strict-aliasing
 override CFLAGS += $(shell curl-config --cflags) $(shell xml2-config --cflags)
 
@@ -21,10 +23,10 @@ SRCS = $(wildcard src/*.c)
 OBJS = $(SRCS:.c=.o)
 
 rdrview: $(OBJS)
-       $(CC) $(CFLAGS) -o rdrview $(OBJS) $(LDLIBS)
+       $(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o rdrview $(OBJS) $(LDLIBS)
 
 %.o: %.c src/rdrview.h
-       $(CC) $(CFLAGS) -o $@ -c $<
+       $(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o $@ -c $<
 
 clean:
        rm -f $(OBJS) rdrview
[jelle@natrium][/tmp/rdrview-git/src/rdrview]%checksec --file=rdrview
RELRO           STACK CANARY      NX            PIE             RPATH      RUNPATH	Symbols		FORTIFY	Fortified	Fortifiable	FILE
Full RELRO      Canary found      NX enabled    PIE enabled     No RPATH   No RUNPATH   360) Symbols	  Yes	3		9		rdrview

https://www.redhat.com/en/blog/hardening-elf-binaries-using-relocation-read-only-relro

error while loading shared libraries: libicui18n.so.67

My OS is Arch Linux. I have recently upgraded my system.

rdrview is giving following error while starting :

rdrview: error while loading shared libraries: libicui18n.so.67: cannot open shared object file: No such file or directory

/usr/lib is showing libicui18n.so.68

Handle local content

rdrview is great for viewing remote content, but it'd be great to handle local content. The ability to read standard input or a file would be appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.