eafer / rdrview Goto Github PK
View Code? Open in Web Editor NEWFirefox Reader View as a command line tool
License: Apache License 2.0
Firefox Reader View as a command line tool
License: Apache License 2.0
Hi! First of all: Awesome tool. I use it to extract readable part from URLs and convert that to epub (see blog post).
However: In my epub's I somehow miss pictures. I haven't investigated yet (might be pandoc
s fault) but could it be that rdrview
filters out images?
Cheers,
Victor
Hi. I'm helping a blind friend who uses rdrview to read articles. It seems to follow some, but not all, redirects. In particular, Google News URLs don't work. Here's an example:
rdrview reports "rdrview: document has no body tag".
Umlauts are displayed as fragments when a page is extracted through rdrview.
Example: https://www.heise.de/news/Abschaltungen-von-Mining-Farmen-in-China-Gefahr-oder-Chance-fuer-Bitcoin-6160001.html
In my.mailcap
is
text/html; /usr/local/bin/lynx --dump %s; copiousoutput; description=HTML Text; nametemplate=%s.html
Snippet in rdrview:
...Kryptowährung, die anfällig für eine...
Snippet through lynx --dump:
...Kryptowährung, die anfällig für eine 51-Prozent...
Is it possible to set the User Agent rdrview transfers to the HTML server?
Some HTML servers refuse to deliver a page to curl when it is using its native User Agent.
As far as I understand, this concerns also libcurl.
rdrview doesn't have a version yet. I am packaging it for Guix and I'd like to be able to refer to a version number in the package definition.
Can we add a 0.1.0 version?
I get the message
The futex facility returned an unexpected error code
on pages like this, http://www.softpanorama.org/Admin/Monitoring/sar.shtml
Using the flag "--disable-sandbox" , rdrview does work and does the job.
The only "abnormal" thing I notice is that this page is served through http not https, it is .shtml not .html
and it's document charset is cp1252 Latin1, not Unicode.
Is this expected ?
rdrview is not opening the url with lynx as browser. it is showing help menu instead.
rdrview with -B w3m is opening url as expected.
something might have changed in opening the link with lynx browser.
Various glyphs get rendered in strange ways, comparing the piped-to-less content with the --html
dump option:
’ as <E2>
… as <E2><A6>
ł as <C5>
▲ as <E2><B2>
and likely many more.
These appear highlighted within less
, or as � if not piped. lynx -dump -nolist URL
renders nearly everything.
at least on my setup:
the URL must contain an apostrophe ' (which for some reason gets converted to ’, same goes for an three sequential periods ..., which get transformed to an ellipsis.
I'm unsure if the problem is in the encoding, or something else.
Hello,
I am on a musl based system. When launching rdrview the output is empty. I have to use
'disable-sandbox' to get a webpage rendered.
I use libseccomp 2.5.1 but without the python bindings. Other programs are 'libxml2 2.9.10'
and 'curl 7.73.0'. Maybe you have an idea?
Try it yourself:
rdrview "https://en.wikipedia.org/wiki/Wikipedia"
and rdrview "https://ja.wikipedia.org/wiki/ウィキペディア"
. In first case you can see images and go through links. But in second case all links became "file:///..." so you can't open them or watch images.
I don't know what exactly this problem is, so sorry for possible mistake in the title.
Fails on Ubuntu 16.04 (gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0):
gcc -DNDEBUG -O2 -Wall -Wextra -fno-strict-aliasing -I/usr/include/libxml2 -o src/regex.o -c src/regex.c
src/regex.c:95:2: error: initializer element is not constant
UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
^
src/regex.c:95:2: note: (near initialization for ‘REGEXES[0]’)
src/regex.c:95:15: error: initializer element is not constant
UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
^
src/regex.c:95:15: note: (near initialization for ‘REGEXES[1]’)
src/regex.c:95:29: error: initializer element is not constant
UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
^
src/regex.c:95:29: note: (near initialization for ‘REGEXES[2]’)
src/regex.c:95:40: error: initializer element is not constant
UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
^
src/regex.c:95:40: note: (near initialization for ‘REGEXES[3]’)
src/regex.c:95:53: error: initializer element is not constant
UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
^
src/regex.c:95:53: note: (near initialization for ‘REGEXES[4]’)
src/regex.c:95:62: error: initializer element is not constant
UNLIKELY_RE, CANDIDATE_RE, BYLINE_RE, PROPERTY_RE, NAME_RE, IMGEXT_RE,
^
src/regex.c:95:62: note: (near initialization for ‘REGEXES[5]’)
src/regex.c:96:2: error: initializer element is not constant
HASCONTENT_RE, NEGATIVE_RE, POSITIVE_RE, SENTENCE_DOT_RE, B64_DATAURL_RE,
^
src/regex.c:96:2: note: (near initialization for ‘REGEXES[6]’)
src/regex.c:96:17: error: initializer element is not constant
HASCONTENT_RE, NEGATIVE_RE, POSITIVE_RE, SENTENCE_DOT_RE, B64_DATAURL_RE,
^
src/regex.c:96:17: note: (near initialization for ‘REGEXES[7]’)
src/regex.c:96:30: error: initializer element is not constant
HASCONTENT_RE, NEGATIVE_RE, POSITIVE_RE, SENTENCE_DOT_RE, B64_DATAURL_RE,
^
src/regex.c:96:30: note: (near initialization for ‘REGEXES[8]’)
src/regex.c:96:43: error: initializer element is not constant
HASCONTENT_RE, NEGATIVE_RE, POSITIVE_RE, SENTENCE_DOT_RE, B64_DATAURL_RE,
^
src/regex.c:96:43: note: (near initialization for ‘REGEXES[9]’)
src/regex.c:96:60: error: initializer element is not constant
HASCONTENT_RE, NEGATIVE_RE, POSITIVE_RE, SENTENCE_DOT_RE, B64_DATAURL_RE,
^
src/regex.c:96:60: note: (near initialization for ‘REGEXES[10]’)
src/regex.c:97:2: error: initializer element is not constant
SRCSET_RE, SRC_RE, VIDEOS_RE, SHARE_RE, ABSOLUTE_RE,
^
src/regex.c:97:2: note: (near initialization for ‘REGEXES[11]’)
src/regex.c:97:13: error: initializer element is not constant
SRCSET_RE, SRC_RE, VIDEOS_RE, SHARE_RE, ABSOLUTE_RE,
^
src/regex.c:97:13: note: (near initialization for ‘REGEXES[12]’)
src/regex.c:97:21: error: initializer element is not constant
SRCSET_RE, SRC_RE, VIDEOS_RE, SHARE_RE, ABSOLUTE_RE,
^
src/regex.c:97:21: note: (near initialization for ‘REGEXES[13]’)
src/regex.c:97:32: error: initializer element is not constant
SRCSET_RE, SRC_RE, VIDEOS_RE, SHARE_RE, ABSOLUTE_RE,
^
src/regex.c:97:32: note: (near initialization for ‘REGEXES[14]’)
src/regex.c:97:42: error: initializer element is not constant
SRCSET_RE, SRC_RE, VIDEOS_RE, SHARE_RE, ABSOLUTE_RE,
^
src/regex.c:97:42: note: (near initialization for ‘REGEXES[15]’)
Makefile:18: recipe for target 'src/regex.o' failed
make: *** [src/regex.o] Error 1
https://www.geeksforgeeks.org/dynamic-programming/ all the list items are missing.
It should be possible to put the core functionality into a library which can be reused by other projects.
rdrview does not work with NixCraft articles (https://www.cyberciti.biz/).
Produces this output: rdrview: no content could be extracted
.
It would be nice to be able to read NixCraft articles on my terminal.
Plain text is ok, but sometimes I want to make more readable pages. I propose to add option to inject css. Now I use shell wrapper and modified stolen css block from simplyread (https://njw.name/simplyread/):
p{margin:0ex auto;} h1,h2,h3,h4{font-weight:normal}
p+p{text-indent:2em;} body{background:#cccccc none}
img{display:block; max-width: 32em; padding:1em; margin: auto}
h1{text-align:center;text-transform:uppercase}
div#readability-page-1{width:34em; padding:8em; padding-top:2em;
background-color:white; margin:auto; line-height:1.4;
text-align:justify; font-family:serif; hyphens:auto;}
It looks nice and my shell script works ok, but it will be good to have same option natively. Maybe something like rdrview --css 'p{margin:0ex auto;} h1,h2,h3,h4{font-weight:normal}' https://....
or smth like that.
./rdrview -B lynx 'https://github.com/eafer/rdrview'
no output
lynx w3m links elinks netsurf nothing works
Termux
Only rdview -H url | w3m -dump works
no other browser
-B browser option doesn't work at all
Hi, rdrview is absolutely fantastic! The fastest and most relevant output I've come across from all the firefox readability based tools I've tried.
One new feature would be a great addition I think: convert the readable html output to text. Right now I'm using rdrview to get the readable html, output it with "-H" and use the links or lynx browser to dump the formatted text with the -dump option.
Would be nice if rdrview would have an option for outputting text.
In any case, thanks a lot for rdrview!
(By the way I also had to throw away the sandbox stuff from the code because libseccomp would not compile on my system.)
Hello,
latest master
requires the sandbox to be disabled on my musl system.
Hi, first of all thanks for your work in this extremely usefull written in C for speed and promising tool.
I've been testing and using it on my x64 Arch Linux for some months. Very happy.
It's intended to be used with terminal RSS readers, to make the articles more readable on web browsers such as lynx.
I use it with w3m. Feel free to update the README as well.
W3m is a very underrated (and badly documented) cli browser. But with incredible customizing options. It works amazingly fast with rdrview.
You can use it for a one shot operation like this:
$ rdrview -H https://www.bbc.com/news/world-asia-china-55784231 | w3m -T text/html
or
$ rdrview -H https://www.bbc.com/news/world-asia-china-55784231 | w3m -T text/html -dump
Or for interactive browsing.
You can for example add these lines to the config file, ~/.w3m/keymap
keymap \\\r COMMAND "SHELL 'rdrview -H $W3M_URL > /tmp/rdrview.html' ; LOAD /tmp/rdrview.html"
or
keymap \\\r COMMAND "SHELL 'clear; echo \"parsing page with rdrview\" ; echo ; rdrview -H $W3M_URL > /tmp/rdrview.html' ; LOAD /tmp/rdrview.html"
and then use "\r" when your browsing a page inside w3m.
My issue is when I try to run in on Arm, also Arch Linux, armv7.
I tried it on both a chromebook running:
Linux alarmsung 5.10.10-1-ARCH #1 SMP PREEMPT Sat Jan 23 23:26:35 UTC 2021 armv7l GNU/Linux
and a Raspberry Pi 2 running:
Linux alarmpi 5.4.83-4-ARCH #1 SMP PREEMPT Wed Jan 20 14:06:49 UTC 2021 armv7l GNU/Linux
I install rdrview by hand. I do not use the Arch User AUR package like this other user, #13. This AUR arch linux is not well maintained and is marked only for x64 not Arm, https://aur.archlinux.org/packages/rdrview-git
What I do instead is this.
I do git clone .. and then run make.
On both these ARM systems I have the exact same needed dependencies installed
They are the same as in my working x64 system, and they are official distro packages:
local/libseccomp 2.5.1-2
Enhanced seccomp library
local/libxml2 2.9.10-8
XML parsing library, version 2
local/libcurl-gnutls 7.74.0-1
An URL retrieval library (linked against gnutls)
Compilation with make runs without a problem. But running it, whatever the options choosen always gets me a message:
rdrview: Operation not permitted
If I use the flag "--disable-sandbox" then it works.
I'm not a developer, and have zero understanding of C programing, syscalls or security. The only thing I could find that brought me here was the similar issue by the other user.
#10 (comment)
...I am on a musl based system....
... have to use 'disable-sandbox' to get a webpage rendered....
Additional info that might be usefull:
Architecture : armv7h
glibc, Version : 2.32-2
gcc, Version : 10.2.0-1
$ strace rdrview -M https://www.bbc.com/news/world-asia-china-55784231 2>> error_log.txt
$ file ./rdrview
./rdrview: ELF 32-bit LSB pie executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, BuildID[sha1]=b7438379398f311b185c1ba3a7ba9019f245321d, for GNU/Linux 3.2.0, not stripped
The Makefile currently does not produce fully hardened binaries as it does not take the system's LDFLAGS or CPPFLAGS or CFLAGS. Which means the binaries aren't fully hardened
[jelle@natrium][/tmp/rdrview-git/src/rdrview]%checksec --file=/usr/bin/rdrview
RELRO STACK CANARY NX PIE RPATH RUNPATH Symbols FORTIFY Fortified Fortifiable FILE
Partial RELRO Canary found NX enabled PIE enabled No RPATH No RUNPATH No Symbols No 0 9 /usr/bin/rdrview
With a simple patch:
diff --git a/Makefile b/Makefile
index 18a0e8f..b018a9c 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,8 @@
SYSTEM = $(shell uname)
CC = gcc
+LDFLAGS ?=
+CPPFLAGS ?=
CFLAGS = -DNDEBUG -O2 -Wall -Wextra -fno-strict-aliasing
override CFLAGS += $(shell curl-config --cflags) $(shell xml2-config --cflags)
@@ -21,10 +23,10 @@ SRCS = $(wildcard src/*.c)
OBJS = $(SRCS:.c=.o)
rdrview: $(OBJS)
- $(CC) $(CFLAGS) -o rdrview $(OBJS) $(LDLIBS)
+ $(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o rdrview $(OBJS) $(LDLIBS)
%.o: %.c src/rdrview.h
- $(CC) $(CFLAGS) -o $@ -c $<
+ $(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o $@ -c $<
clean:
rm -f $(OBJS) rdrview
[jelle@natrium][/tmp/rdrview-git/src/rdrview]%checksec --file=rdrview
RELRO STACK CANARY NX PIE RPATH RUNPATH Symbols FORTIFY Fortified Fortifiable FILE
Full RELRO Canary found NX enabled PIE enabled No RPATH No RUNPATH 360) Symbols Yes 3 9 rdrview
https://www.redhat.com/en/blog/hardening-elf-binaries-using-relocation-read-only-relro
My OS is Arch Linux. I have recently upgraded my system.
rdrview is giving following error while starting :
rdrview: error while loading shared libraries: libicui18n.so.67: cannot open shared object file: No such file or directory
/usr/lib is showing libicui18n.so.68
rdrview is great for viewing remote content, but it'd be great to handle local content. The ability to read standard input or a file would be appreciated.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.