Giter VIP home page Giter VIP logo

goq's Introduction

goq

Build Status GoDoc Coverage Status Go Report Card

Example

import (
	"log"
	"net/http"

	"astuart.co/goq"
)

// Structured representation for github file name table
type example struct {
	Title string `goquery:"h1"`
	Files []string `goquery:"table.files tbody tr.js-navigation-item td.content,text"`
}

func main() {
	res, err := http.Get("https://github.com/andrewstuart/goq")
	if err != nil {
		log.Fatal(err)
	}
	defer res.Body.Close()

	var ex example
	
	err = goq.NewDecoder(res.Body).Decode(&ex)
	if err != nil {
		log.Fatal(err)
	}

	log.Println(ex.Title, ex.Files)
}

Details

goq

-- import "astuart.co/goq"

Package goq was built to allow users to declaratively unmarshal HTML into go structs using struct tags composed of css selectors.

I've made a best effort to behave very similarly to JSON and XML decoding as well as exposing as much information as possible in the event of an error to help you debug your Unmarshaling issues.

When creating struct types to be unmarshaled into, the following general rules apply:

  • Any type that implements the Unmarshaler interface will be passed a slice of *html.Node so that manual unmarshaling may be done. This takes the highest precedence.

  • Any struct fields may be annotated with goquery metadata, which takes the form of an element selector followed by arbitrary comma-separated "value selectors."

  • A value selector may be one of html, text, or [someAttrName]. html and text will result in the methods of the same name being called on the *goquery.Selection to obtain the value. [someAttrName] will result in *goquery.Selection.Attr("someAttrName") being called for the value.

  • A primitive value type will default to the text value of the resulting nodes if no value selector is given.

  • At least one value selector is required for maps, to determine the map key. The key type must follow both the rules applicable to go map indexing, as well as these unmarshaling rules. The value of each key will be unmarshaled in the same way the element value is unmarshaled.

  • For maps, keys will be retreived from the same level of the DOM. The key selector may be arbitrarily nested, though. The first level of children with any number of matching elements will be used, though.

  • For maps, any values must be nested below the level of the key selector. Parents or siblings of the element matched by the key selector will not be considered.

  • Once used, a "value selector" will be shifted off of the comma-separated list. This allows you to nest arbitrary levels of value selectors. For example, the type []map[string][]string would require one selector for the map key, and take an optional second selector for the values of the string slice.

  • Any struct type encountered in nested types (e.g. map[string]SomeStruct) will override any remaining "value selectors" that had not been used. For example, given:

    struct S { F string goquery:",[bang]" }

    struct { T map[string]S goquery:"#someId,[foo],[bar],[baz]" }

[foo] will be used to determine the string map key,but [bar] and [baz] will be ignored, with the [bang] tag present S struct type taking precedence.

Usage

func NodeSelector

func NodeSelector(nodes []*html.Node) *goquery.Selection

NodeSelector is a quick utility function to get a goquery.Selection from a slice of *html.Node. Useful for performing unmarshaling, since the decision was made to use []*html.Node for maximum flexibility.

func Unmarshal

func Unmarshal(bs []byte, v interface{}) error

Unmarshal takes a byte slice and a destination pointer to any interface{}, and unmarshals the document into the destination based on the rules above. Any error returned here will likely be of type CannotUnmarshalError, though an initial goquery error will pass through directly.

func UnmarshalSelection

func UnmarshalSelection(s *goquery.Selection, iface interface{}) error

UnmarshalSelection will unmarshal a goquery.goquery.Selection into an interface appropriately annoated with goquery tags.

type CannotUnmarshalError

type CannotUnmarshalError struct {
	Err      error
	Val      string
	FldOrIdx interface{}
}

CannotUnmarshalError represents an error returned by the goquery Unmarshaler and helps consumers in programmatically diagnosing the cause of their error.

func (*CannotUnmarshalError) Error

func (e *CannotUnmarshalError) Error() string

type Decoder

type Decoder struct {
}

Decoder implements the same API you will see in encoding/xml and encoding/json except that we do not currently support proper streaming decoding as it is not supported by goquery upstream.

func NewDecoder

func NewDecoder(r io.Reader) *Decoder

NewDecoder returns a new decoder given an io.Reader

func (*Decoder) Decode

func (d *Decoder) Decode(dest interface{}) error

Decode will unmarshal the contents of the decoder when given an instance of an annotated type as its argument. It will return any errors encountered during either parsing the document or unmarshaling into the given object.

type Unmarshaler

type Unmarshaler interface {
	UnmarshalHTML([]*html.Node) error
}

Unmarshaler allows for custom implementations of unmarshaling logic

TODO

  • Callable goquery methods with args, via reflection

goq's People

Contributors

andrewstuart avatar undefx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

goq's Issues

Out of range panic in embeded map structures

The panic stack trace is the following:

panic: runtime error: index out of range

goroutine 1 [running]:
astuart.co/goq.goqueryTag.preprocess(0x7d907c, 0x13, 0xc42025be30, 0x7)
	/n/gopath/src/astuart.co/goq/unmarshal.go:40 +0x207
astuart.co/goq.unmarshalStruct(0xc42025be30, 0x8569c0, 0xc420184090, 0x199, 0x8569c0, 0x8569c0)
	/n/gopath/src/astuart.co/goq/unmarshal.go:283 +0x1ac
astuart.co/goq.unmarshalByType(0xc42025be30, 0x7cbd60, 0xc420184090, 0x16, 0x7c1091, 0x9, 0x0, 0x0)
	/n/gopath/src/astuart.co/goq/unmarshal.go:202 +0x793
astuart.co/goq.unmarshalMap.func1(0x0, 0xc42025be30, 0x1)
	/n/gopath/src/astuart.co/goq/unmarshal.go:405 +0x435
github.com/PuerkitoBio/goquery.(*Selection).EachWithBreak(0xc42025ba70, 0xc4204e3658, 0x7c1091)
	/n/gopath/src/github.com/PuerkitoBio/goquery/iteration.go:21 +0x10b
astuart.co/goq.unmarshalMap(0xc42025ba70, 0x7ffe20, 0xc42000e028, 0x195, 0x7c1091, 0x19, 0xc42000e028, 0x195)
	/n/gopath/src/astuart.co/goq/unmarshal.go:390 +0x385
astuart.co/goq.unmarshalByType(0xc42025ba70, 0x7ffe20, 0xc42000e028, 0x195, 0x7c1091, 0x19, 0x195, 0x7ffe20)
	/n/gopath/src/astuart.co/goq/unmarshal.go:208 +0x6ab
astuart.co/goq.unmarshalStruct(0xc42025ba40, 0x80fea0, 0xc42000e028, 0x199, 0x80fea0, 0x80fea0)
	/n/gopath/src/astuart.co/goq/unmarshal.go:289 +0x245
astuart.co/goq.unmarshalByType(0xc42025ba40, 0x80fea0, 0xc42000e028, 0x199, 0x0, 0x0, 0xc42000e028, 0x199)
	/n/gopath/src/astuart.co/goq/unmarshal.go:202 +0x793
astuart.co/goq.UnmarshalSelection(0xc42025ba40, 0x7cbda0, 0xc42000e028, 0x0, 0xc420400020)
	/n/gopath/src/astuart.co/goq/unmarshal.go:180 +0x308
astuart.co/goq.(*Decoder).Decode(0xc420400020, 0x7cbda0, 0xc42000e028, 0xc420784000, 0x2ad98)
	/n/gopath/src/astuart.co/goq/decoder.go:37 +0xc4
main.store(0xc42016e0c0)
	/home/mester/twscrap/main.go:151 +0x1d3
main.startCollecting(0xc42015a280)
	/home/mester/twscrap/main.go:106 +0x458
main.main()
	/home/mester/twscrap/main.go:72 +0x1a4
exit status 2

The problem occurss with this structure type:

type T struct {
    A string `goquery:",[second-id]"`
}
type A struct {
    B map[string]T `goquery:"div.id,[div-id]"`
}

Module declares its path as: astuart.co/goq

Command:

GO111MODULE=on go get github.com/andrewstuart/[email protected]

Outputs:

go: finding github.com v1.0.0
go: finding github.com/andrewstuart v1.0.0
go: finding github.com/andrewstuart/goq v1.0.0
go: downloading github.com/andrewstuart/goq v1.0.0
go: extracting github.com/andrewstuart/goq v1.0.0
go get: github.com/andrewstuart/[email protected]: parsing go.mod:
	module declares its path as: astuart.co/goq
	        but was required as: github.com/andrewstuart/goq

I fixed it with:

replace (
  github.com/andrewstuart/goq => astuart.co/goq v1.0.0
)

But maybe there is a way to fix it on side of repo?

Readme

Add a README to help users

Fix bug in error where path does not fully show for non-pointers

E.g. 'main.page.Items[0xc42019e318]' (type int): a type conversion error occurred: strconv.ParseInt: parsing "": invalid syntax

when the real error should show the extra type info 'main.page.Items[0xc42019f048].Score' (type unknown: invalid value): a custom Unmarshaler implementation threw an error: strconv.ParseInt: parsing "": invalid syntax

Selector mapping from goq => goquery => cascadia doesn't work as expected

Hi,

I'm using goq and am very happy with it so far.

But now I want to extract CSS links from a <head></head> section of a page and don't get it working.

Here's the HTML:

<!DOCTYPE html>
<html lang="de"
  <head>
    <link rel="stylesheet" type="text/css" href="https://foo.bar/blah1.css"/>
    <link rel="stylesheet" type="text/css" href="https://foo.bar/blah2.css"/>
  </head>
</html>

And the code i'm trying to use:

package main

import (
        "log"
        "os"

        "astuart.co/goq"
)

type Site struct {
        CSS []string `goquery:"head > link[type='text/css'],[href]"`
}

func main() {

        fd, err := os.Open(os.Args[0])
        if err != nil {
                log.Fatalln(err)
        }

        s := &Site{}
        if err = goq.NewDecoder(fd).Decode(&s); err != nil {
                log.Fatalln(err)
        }

        log.Println(s)
}

The Site struct obj is empty after execution.

However, if I use the cascadia testing cli, the selector works:

cascadia -i sample.html -o -c "head > link[type='text/css']" -p "Link=ATTR:href"
Link
https://foo.bar/blah1.css
https://foo.bar/blah2.css

I assume I have not "translated" the selector properly to goq. If that's the case, how would I do it correctly?

thanks in advance,
Tom

Document selectors (issue?)

I'm having an issue with selectors, and in general they're hard to deal with because they are not documented, neither here nor in GoQuery.

I have this markup:

image

And I select :

type Categorie struct {
	Text string `goquery:"a,text"`
	Link string `goquery:"a,[href]"`
	Sub []Categorie `goquery:"ul"`
}

type Menu struct {
	Categorie []Categorie `goquery:".menu-l1-li-hld li"`
}

I would expect text and href from links to return 1 of each, but I have a weird result where text append every sub text together, but href doesn't. Is it an issue with the lib?

image

Thanks

race condition & crash

Yesterday, we had a crash that seems to come down to a data race / concurrent map access in goq, so I ran our application after compiling it with the -race flag. It seems that the library regularly creates race conditions:

WARNING: DATA RACE
Read at 0x00c0001c30e0 by goroutine 137:
  runtime.mapaccess1_faststr()
      /usr/local/go/src/runtime/map_faststr.go:12 +0x0
  github.com/andrewstuart/goq.goqueryTag.valFunc()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:89 +0x85
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:210 +0x416
  github.com/andrewstuart/goq.unmarshalSlice()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:332 +0x23f
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:204 +0x91a
  github.com/andrewstuart/goq.unmarshalStruct()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:289 +0x243
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:202 +0x879
  github.com/andrewstuart/goq.UnmarshalSelection()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:180 +0x4fc

Previous write at 0x00c0001c30e0 by goroutine 44:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/map_faststr.go:202 +0x0
  github.com/andrewstuart/goq.goqueryTag.valFunc()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:115 +0x296
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:210 +0x416
  github.com/andrewstuart/goq.unmarshalSlice()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:332 +0x23f
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:204 +0x91a
  github.com/andrewstuart/goq.unmarshalStruct()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:289 +0x243
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:202 +0x879
  github.com/andrewstuart/goq.UnmarshalSelection()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:180 +0x4fc

I'm currently investigating.

how does it compare with pure CSS selectors?

Is there any reason why the goquery tags don't use pure CSS selectors? It seems some special rules are required (i.e. element selector followed by arbitrary comma-separated "value selectors."). The documentation doesn't even mention if the "selectors" are actually CSS(3) selectors. though from the source looks like that's the case.

I'm asking about this decision because just before to find your library I was thinking to develop something similar(i.e. a library that unmarshals pure css selectors(using cascadia) into /x/html.Node)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.