mbirth / ttrss_plugin-af_feedmod Goto Github PK

Article Filter plugin for Tiny-Tiny-RSS to replace article stubs by website contents.

PHP 100.00%

ttrss_plugin-af_feedmod's Introduction

ttrss_plugin-af_feedmod

This is a plugin for Tiny Tiny RSS (tt-rss). It allows you to replace an article's contents by the contents of an element on the linked URL's page, i.e. create a "full feed".

‼️ Since I don't use Tiny Tiny RSS anymore, this project is abandoned for now. ‼️

Please have a look at the FeedIron plugin by @m42e.

Installation

Checkout the directory into your plugins folder like this (from tt-RSS root directory):

$ cd /var/www/ttrss
$ git clone git://github.com/mbirth/ttrss_plugin-af_feedmod.git plugins/af_feedmod

Then enable the plugin in preferences.

Configuration

The configuration is done in JSON format. In the preferences, you'll find a new tab called FeedMod. Use the large field to enter/modify the configuration data and click the Save button to store it.

A configuration looks like this:

{

"heise.de": {
    "type": "xpath",
    "xpath": "div[@class='meldung_wrapper']",
    "force_charset": "utf-8"
},
"berlin.de/polizei": {
    "type": "xpath",
    "xpath": "div[@class='bacontent']"
},
"n24.de": {
    "type": "xpath",
    "xpath": "div[@class='news']"
},
"golem0Bde0C": {
    "type": "xpath",
    "xpath": "article"
},
"oatmeal": {
    "type": "xpath",
    "xpath": "div[@id='comic']"
},
"blog.beetlebum.de": {
    "type": "xpath",
    "xpath": "div[@class='entry-content']",
    "cleanup": [ "header", "footer" ],
}

}

The array key is part of the URL of the article links(!). You'll notice the golem0Bde0C in the last entry: That's because all their articles link to something like http://rss.feedsportal.com/c/33374/f/578068/p/1/s/3f6db44e/l/0L0Sgolem0Bde0Cnews0Cthis0Eis0Ean0Eexample0A10Erss0Bhtml/story01.htm and to have the plugin match that URL and not interfere with other feeds using feedsportal.com, I used the part golem0Bde0C.

type has to be xpath for now. Maybe there will be more types in the future.

The xpath value is the actual Xpath-element to fetch from the linked page. Omit the leading // - they will get prepended automatically.

If type was set to xpath there is an additional option cleanup available. Its an array of Xpath-elements (relative to the fetched node) to remove from the fetched node. Omit the leading // - they will get prepended automatically.

force_charset allows to override automatic charset detection. If it is omitted, the charset will be parsed from the HTTP headers or loadHTML() will decide on its own.

If you get an error about "Invalid JSON!", you can use JSONLint to locate the erroneous part.

XPath

Tools

To test your XPath expressions, you can use these Chrome extensions:

Examples

Some XPath expressions you could need (the // is automatically prepended and must be omitted in the FeedMod configuration):

HTML5 <article> tag

<article>…article…</article>

//article

DIV inside DIV

<div id="content"><div class="box_content">…article…</div></div>`

//div[@id='content']/div[@class='box_content']

Multiple classes

<div class="post-body entry-content xh-highlight">…article…</div>

//div[starts-with(@class ,'post-body')]

//div[contains(@class, 'entry-content')]

ttrss_plugin-af_feedmod's People

Contributors

Stargazers

Watchers

Forkers

bfly75 julmud jnegre dkopitsa nicobo four2six mcnetic heikojansen m42e khady boxdot bryanlyon fjen rosssiuk k2s janekbettinger wko oceandepthz mabahamo fydexx cjchung mb-0

ttrss_plugin-af_feedmod's Issues

Call to undefined method PluginHost::getInstance()

Hi!

After cloning the plugin on a default debian installation i get following error.
Any idea?

Thanks.

[Sat Jun 22 02:17:37 2013] [error] [client 91.119.71.125] PHP Fatal error: Call to undefined method PluginHost::getInstance() in /usr/share/tt-rss/www/plugins/af_feedmod/init.php on line 162, referer: https://opossum.htu.tuwien.ac.at/tt-rss/prefs.php

tt-rss package:

apt-cache showpkg tt-rss
Package: tt-rss
Versions:
1.7.8+dfsg-2 (/var/lib/apt/lists/gd.tuwien.ac.at_opsys_linux_debian_dists_testing_main_binary-amd64_Packages) (/var/lib/dpkg/status)
Description Language:
File: /var/lib/apt/lists/gd.tuwien.ac.at_opsys_linux_debian_dists_testing_main_binary-amd64_Packages
MD5: 02bd340a64d29c6b17e906e3b16d5f62
Description Language: en
File: /var/lib/apt/lists/gd.tuwien.ac.at_opsys_linux_debian_dists_testing_main_i18n_Translation-en
MD5: 02bd340a64d29c6b17e906e3b16d5f62

Reverse Depends:
Dependencies:
1.7.8+dfsg-2 - debconf (18 0.5) debconf-2.0 (0 (null)) dbconfig-common (0 (null)) libjs-dojo-core (2 1.5.0) libjs-dojo-dijit (2 1.5.0) libjs-scriptaculous (0 (null)) libphp-phpmailer (0 (null)) libphp-simplepie (0 (null)) php-gettext (0 (null)) libapache2-mod-php5 (18 5.3.0) php5-cgi (18 5.3.0) php5 (2 5.3.0) php5-cli (0 (null)) php5-mysql (16 (null)) php5-pgsql (0 (null)) phpqrcode (0 (null)) mysql-server (16 (null)) postgresql (0 (null)) mysql-client (16 (null)) postgresql-client (0 (null)) sphinxsearch (0 (null)) php-apc (0 (null)) apache2 (16 (null)) lighttpd (16 (null)) httpd (0 (null)) php5-gd (0 (null))
Provides:
1.7.8+dfsg-2 -
Reverse Provides:

Support tagesschau.de

Please add the tagesschau.de RSS to your preconfigured json files.

German Umlaut not properly displayed

I love your tt-rss plugin and while it works for me most of the time without any issues, there are some sites where the German Umlaut is not properly displayed (i.e. iphoneblog.de). My config for this blog is as follows,

"iphoneblog.de": {
"type": "xpath",
"xpath": "div[@Class='beitragstext']"
}

I already tried the "force_charset": "utf-8" option but this does not work either. A post where you can see the wrong encoding is the article from the 10th of November ("Besserer Dateitausch ...") where the Öffnen-In is not correctly encoded.

I would deeply appreciate your help on this issue.

Best regards

Andy

error: Invalid JSON!

I get an "error: Invalid JSON!", even when http://jsonlint.com/ tells me everything is correct. This also happens with the examples on here. What am I doing wrong?
Thanks for any advice!

Warning about feedmod during update_daemon2.php

I have feedmod installed but not configured yet. I was watching update_daemon2.php run and noticed this warning:

Warning: Invalid argument supplied for foreach() in /home/.../plugins/af_feedmod/init.php on line 55

My environ is shared hosting through Dreamhost. All of my .php files are vanilla.

Enable images in generated feeds

Could this please fixed?
http://tt-rss.org/forum/viewtopic.php?f=1&t=2470&p=14006#p14006

Thanks

Feeds mit Updates

Es kommt bei einigen Seiten vor, dass ein Artikel erneut im RSS-Feed auftaucht weil es ein Update des Artikels gegeben hat. Leider wird aber der Artikelinhalt nicht erneut geladen von af_feedmod.

Beispiel Tagesspiegel:
Alte URL (bzw. offenbar auch bleibende Feed-URL): http://www.tagesspiegel.de/berlin/polizei-justiz/kaputte-gasleitung-in-berlin-mitte-s-bahnverkehr-lahm-gelegt/9458990.html

URL wenn man den Link im Browser öffnet: http://www.tagesspiegel.de/berlin/polizei-justiz/zwischen-friedrichstrasse-und-alexanderplatz-nach-vollsperrung-bahnverkehr-in-mitte-rollt-wieder/9458990.html

Modify article content

First of all, thanks for an excellent plugin! It is the reason I'm now using tt-rss :)

I do have a suggestion though; a way to modify the article content after it's been fetched. This would allow for all kind of nice things, but the problem I have a the moment is a site that uses scheme-less img hrefs, which aren't handled properly by the feed reader I'm using.

I took a stab at implementing a fix for this particular issue, something like this seems to work very well (to be placed right after the cleanup code):

$nodelist = $basenode->getElementsByTagName("img");
foreach($nodelist as $node) {
    $imgsrc = $node->getAttribute("src");
    if (substr($imgsrc, 0, 2) == '//') {
        $node->setAttribute("src", "http:" . $imgsrc);
    }
}

Evolution of feedmod

Hi mbirth,

I've played around with feedmod for a while and adding new features in my fork, restructured the code and so on. Now I made a new repository which contains all the changes. It seems to me to be a bit evolutionary to make pull requests. I hope this is ok for you. If not please contact me and I'll delete it.

The repository is located here:
https://github.com/m42e/ttrss_plugin-feediron

Thank you @mbirth

cookies

After some frustration it appears the site I'm trying to get content from the new england journal of medicine (e.g. http://www.nejm.org/doi/full/10.1056/NEJMp1306065) won't talk to browsers at all unless they go accept cookies - which seems to pose a problem. Even if following a link to a specific page you have to go through 2 redirects and only pass if cookies are accepted.

It seems trivial to fix if using curl to do the downloading, just adding two lines. But the function within tt-rss doesn't currently.

Not accepting cookies might not affect anyone else - but was quite tricky to find out this was the problem & may be the reason that other people's seemingly correct xpaths don't work.

Use "readability" to auto-select article body

The "readability" library can extract the article content of a html page. With that, the configuration file would no longer be needed.

More info about readability: it was at first a js lib, which was then turned into a proprietary service (with an API). However there are now a lot of open source ports of the original js library to other languages, including PHP.

php lib: http://code.fivefilters.org/php-readability (or just google "readability php")

Maybe using this lib in this plugin could help :)

Regards

suggestion: URL_REWRITE Type

Currently I try to make a good xpath extract for a local newspaper website, but their style has pretty many unnecessary stuff inside and no single div tag or something for the pure article text.

My suggestion for cases like this would be some url rewrite feature to fetch the print version instead of the normal article version.
A Simple regex rewrite for the url and it could fetch a very slime and clean version of the article.

Error after installation in settings dialog

Hi,

first of all thanks for the nice plugin! Should come in handy now as Google Reader shuts down...

But I have a problem with it: After installing the plugin as instructed in the README, I only get a very generic error message in the settings dialog: "Es ist ein Fehler aufgetreten."

Any clues what causes this? Can I somehow debug it?

globo feed

I have a feed like this:
http://g1.globo.com/dynamo/rss2.xml

This is a linkm it is going to:
http://g1.globo.com/goias/noticia/2013/06/homens-assaltam-mulher-em-cemiterio-de-itumbiara-go.html

this is the query xpath helper has the right results with:
/html/body/div[@id='glb-doc']/div[@id='glb-corpo']/div[@Class='glb-conteudo']/div[@Class='glb-bloco']/div[@Class='glb-grid-8']/div[@id='glb-materia']

is there any way i can extract this via this plugin?

Trim article (read more option)

Hi,

really useful and nice plugin - helps me a lot :D

Now I've a big wish: an option to trim / shorten the parsed article to an specific length of characters (like 200) and after these 200 characters there could be a text like "read more", which opens the full article (in the same frame/div).

Greetings
Marco

PS: nochmal auf Deutsch, da ich (in Anbetracht der vordefinierten Seiten davon ausgehe, dass Du dies lesen kannst ;)

Ich würde mir wünschen, dass es eine Option gibt, den (aus der Seite heraus gelesenen) Text auf eine definierte Länge zu kürzen und mittels einem "mehr lesen" Link dann vollständig im gleichen Fenster anzeigen zu lassen - wäre super, wenn Du das Implementieren könntest :D

Viele Grüße
Marco

Feature Request: Regex replacements

Feed http://appleinsider.com/appleinsider.rss pointing to article content on the main site have these anti image thief mechanism, where images are replaced by 1x1 pixel.

<div class="article-img"><img src="http://photos.appleinsider.com/v9/images/1x1-white.jpg" width="660" height="362" alt="only cost matters" class="lazy" data-original="http://cdn1.appleinsider.com/JDPower203113.png"><noscript><img src="http://cdn1.appleinsider.com/JDPower203113.png"></noscript></div>

Is it possible to run a regex replace on the content?

$articlebody=~s/<div class="article-img"><img src=".+?" (.+?) class=".+?" data-original="(.+?)"><noscript><img[^>]+><\/noscript><\/div>/<img $1 src="$2">/g;

add license

At the moment, af_feedmod is unlicensed.

problem android.pit

"androidpit.de": { "type": "xpath", "xpath": "article" },
funktioniert für fast alle Artikel, außer für App-Tests,

da bricht der Text zwischen "Bewertung - Gutes" und dem "Daumen hoch" ab.
Beispiel-Artikel http://www.androidpit.de/skoobe-die-mobile-bibliothek-app-test

Select /html/body/

Any way to select an entire body of a page? I'm working on one that has no div or classes or much of anything except text wrapped in a body tag.

German Umlaute are broken

I added this plugin to my installation of TTRSS and enabled "heise security" (http://www.heise.de/security/news/news-atom.xml) as test. Now the german Umlaute inside content are broken.

Apache is running UTF-8

Title of article is OK

Database is also UTF8

Example: "DrauÃŸen meldet sich langsam der FrÃ¼hling zurÃ¼ck, drinnen rufen Adobe und Microsoft zum FrÃ¼hjahrsputz des Rechners auf."

UTF-8 problem

I'm having an issue with UTF-8 encoding on http://bankier.pl/ - the feed is at http://feeds.feedburner.com/bankier-wiadomosci-dnia

The XPath expression I'm using is:

"bankier.pl": {
 "type": "xpath",
 "xpath": "div[@id='articleContent']"
},

and the resulting article is encoded like this:

Raport wydano przy cenie 43 zÅ‚, a w poniedziaÅ‚ek na zamkniÄ™ciu akcje PCM kosztowaÅ‚y 42,98 zÅ‚. (PAP)

The original article I used in this example is http://www.bankier.pl/wiadomosc/BESI-rozpoczal-wydawanie-rekomendacji-dla-PCM-od-kupuj-z-cena-docelowa-55-zl-3147285.html

the guardian feed

i seem to be unable to pull the guardian's full article.
example:
feed: http://www.theguardian.com/world/rss
article: http://www.theguardian.com/world/2014/jul/30/wikileaks-australia-super-injunction-bribery-allegations

this is the xpath needed:
/html[@id='js-context']/body[@id='top']/div[@Class='l-side-margins l-side-margins--layout-content']/article[@id='article']/div[@Class='gs-container']/div[@Class='content__main-column content__main-column--article']/div[@Class='from-content-api js-article__body']

but even using this
"theguardian": {
"type": "xpath",
"xpath": "div[@Class='gs-container']"
},
doesn't pull anything from their website.
Any idea what i'm doing wrong?

Modify after filtering

TT applies plugins before its built-in filters, so when using feedmod you cannot filter articles based on text outside the main content, like a category heading.

It'd be nice if it was the other way around...

Add compatibility to Tiny Tiny RSS v1.7.9 and newer

Add post-processing or more specific content selection

N24.de has additional content inside their main article DIV. There should be some filter or a more specific way of selecting the desired content to use.

Using more than one element

There are articles out there that have two parts, a short and a detailed version, but the detailed version has some important context missing. I tried to get the content with the following:

"xpath": [ "div[@id='artdetail_short']", "div[@id='artdetail_text']" ]

This only extracts the short version of the article. Is there a way to get two parts? I looked through your examples and this did not seem to come up.

Problem Golem.de

Ich bekomme es nicht mehr hin, dass mir die Volltexte in ttrss gespeichert werden.
Egal welche Einstellung ich nehme, kein Text mehr seit dem Umzug aufs eigene NAS und damit verbundenen Update auf v1.11.
Bei v1.7.9 funktionierte das noch.

auch in Kombinationen funktioniert weder
"rss.feedsportal.com" noch "golem0Bde0C"
für "xpath"
"article" or "div[@Class='g g4 g-ie6']"}
et cetera...
ratlos.