Giter VIP home page Giter VIP logo

goreadability's Introduction

goreadability

GoDoc Go Report Card Code Coverage Build Status

goreadability is a tool for extracting the primary readable content of a webpage. It is a Go port of arc90's readability project, based on ruby-readability.

From v2.0 goreadability uses opengraph tag values if exists. You can disable opengraph lookup and follow the traditional readability rules by setting Option.LookupOpenGraphTags to false.

Install

go get github.com/philipjkim/goreadability

Example

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
    log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Testing

go test

# or if you want to see verbose logs:
DEBUG=true go test -v

Command Line Tool

TODO

Related Projects

  • ruby-readability is the base of this project.
  • fastimage finds the type and/or size of a remote image given its uri, by fetching as little as needed.

Potential Issues

TODO

License

MIT

goreadability's People

Contributors

bitdeli-chef avatar philipjkim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

goreadability's Issues

Example where main text body not extracted

Not sure if that's helpful, just thought I'll dump this here as an reproducible example.

package main

import (
	"fmt"

	"github.com/mmcdole/gofeed"
	readability "github.com/philipjkim/goreadability"
)

func main() {
	fp := gofeed.NewParser()
	feed, err := fp.ParseURL("https://roadsandkingdoms.com/feed/")
	if err != nil {
		return
	}
	opt := readability.NewOption()
	opt.DescriptionExtractionTimeout = 4000

	opt.ImageRequestTimeout = 3000
	for _, i := range feed.Items {
		fmt.Println(i.Link)
		content, err := readability.Extract(i.Link, opt)
		if err != nil {
			return
		}
		fmt.Println(content.Title)
		fmt.Println(content.Description)
		break
	}
}

Output:

Desktop|⇒ go run thing.go
https://roadsandkingdoms.com/2018/rk-insider-lets-talk-about-brazil/
R&K Insider: Let's talk about Brazil
 Frustration, despair, and heightened risk: those are the common themes in conversations with friends and colleagues who have been covering the protests in Gaza and the West Bank.
Desktop|⇒

https://github.com/mauidude/go-readability works correctly so I'm going to switch to that one for now.

Support web pages using non-utf8 charset

For example:

func bodyStr(res *http.Response) (s *string, err error) {
    defer res.Body.Close()
    body, err := ioutil.ReadAll(res.Body)
    if err != nil {
        return nil, fmt.Errorf("ioutil.ReadAll failed: %v", err)
    }
    result := string(body)
    _, cs, _ := charset.DetermineEncoding(body, res.Header.Get("Content-Type"))
    if cs == "windows-1252" {
        cs = "euc-kr"
    }
    if cs != "utf-8" {
        result, err = iconv.ConvertString(result, cs, "utf-8")
        if err != nil {
            return nil, fmt.Errorf("Converting string from %v to utf-8 failed, url: %v",
                cs, res.Request.URL.String())
        }
    }
    return &result, nil
}

Release it as cli command line tool and compiled binary

hi, I saw this on the Readme,

Command Line Tool
TODO

Any chance you could actualy do this and also release it as a binary on github ?
Your goreadablity would make a great little portable, easy to install cli tool to combine as a filter/pipe to other tools.

  1. A a "reader mode"/"pocket mode" in the browser (Qutebrowser)
    An example (based on the old python-readability), https://github.com/qutebrowser/qutebrowser/blob/master/misc/userscripts/readability

  2. As a filter/pipe for other cli tools, ex, newsboat (rss reader), shell scripts, text editor (vim), cli web browsers, hs etc....

Not everyone has the entire go distribution installed or as a simple user is familiar with compiling.Besides that releasing a binary, would make it easier for others to create a package for a Linux distro, based on the binary (ex. Archlinux AUR).

Give it a url or html file or read from standard input, just dump the cleaned up html(with?without pictures?) to a file or standard output. Additionaly, you could:

IMPORTANT: dont forget UTF8 Unicode encoding and east asian languages (CJK). For some reason almost everyone who releases these html parsing things, always forgets about foreign language pages/sites.

Extremely useful in the Linux world.

PS:
These guys have made something similar, but doesnt seem maintained/updated anymore,
https://github.com/feeeper/newspaper/blob/readability-support/main.go

The difference is that wihle yours derives from the original Arc90/ruby,
theirs seems to be derived from go-shiori/go-readability (mozilla pocket/readability port)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.