Giter VIP home page Giter VIP logo

cascadia's People

Contributors

andybalholm avatar benoitkugler avatar cjoudrey avatar jauderho avatar kinoute avatar martinlindhe avatar ryancox avatar sethwklein avatar suntong avatar wgh- avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cascadia's Issues

oss-fuzz integration

Is there interest in integrating the fuzzer in Googles continuous fuzzing service, oss-fuzz?

Through that project, Google will run the fuzzer and send bug reports if a bug is found. Google expects the bugs to be fixed, so the fuzzers can go on and find other bugs, and other than that the service is free of charge.

I will be happy to set it up. All I need is a contact email address that is the official and public contact email of the project.

unknown pseudoclass :checked

I'm trying to match the selected value of a "select" element. This throws the error above so I guess "checked" is not supported?
css.Compile(`select[name="language"] > option:checked`)

exp/html moved to net subrepo

http://code.google.com/p/go/source/detail?r=ffbff9f7596e2655eab581b0188ad2a02177
78f0&repo=net

Imports now should be:
"code.google.com/p/go.net/html"

Best regards,
Dobrosław Żybort

Original issue reported on code.google.com by [email protected] on 12 Feb 2013 at 10:01

switch go.net import path to golang.org/x/

As explained here: 
https://groups.google.com/forum/#!msg/golang-nuts/eD8dh3T9yyA/l5Ail-xfMiAJ

The code.google.com/p/go.xxx must be updated to use golang.org/x/xxx. This is 
currently blocking the import path updates for my goquery package. I would've 
liked to send a pull request instead of just logging an issue, but I'm not 
familiar enough with mercurial and google code's PR process.

Thanks,
Martin

Original issue reported on code.google.com by [email protected] on 6 Nov 2014 at 1:38

call Parse on a group query "a, b" returns error

Hello,
First thank you very much for cascadia which is a great lib 👏

I have an issue calling cascadia.Parse with a group query: a, b
The error is: parsing "a, b": 3 bytes left over.

A wrote a quick test case locally to reproduce the error:

diff --git a/parser_test.go b/parser_test.go
index 0dacb79..cb29f56 100644
--- a/parser_test.go
+++ b/parser_test.go
@@ -1,6 +1,7 @@
 package cascadia
 
 import (
+	"fmt"
 	"testing"
 )
 
@@ -86,3 +87,15 @@ func TestParseString(t *testing.T) {
 		}
 	}
 }
+
+func TestParseGroup(t *testing.T) {
+	source := "a, b"
+
+	got, err := Parse(source)
+
+	if err != nil {
+		t.Fatalf("parsing %q: got error (%s)", source, err)
+	}
+
+	fmt.Println(got)
+}

And here is my output:

$ go  test -v --run TestParseGroup
=== RUN   TestParseGroup
    parser_test.go:97: parsing "a, b": got error (parsing "a, b": 3 bytes left over)
--- FAIL: TestParseGroup (0.00s)
FAIL
exit status 1
FAIL    github.com/andybalholm/cascadia 0.002s

I think something is wrong into into https://github.com/andybalholm/cascadia/blob/master/parser.go#L854 when deciding to just return if a comma is found. Maybe returning on ) could also lead to error, but not sure.

WDYT?

Performance issues with nth-child pseudo-selector for large documents

Hello,

With large HTML documents (e.g. thousands of rows in a table), the performance of pseudo-selectors such as nth-child really suffers. I created a reproducible test case in https://github.com/PuerkitoBio/cascadia-nth-child after a user reported the issue on the goquery repo, and the results are:

2016/08/30 18:42:24 #tbl tr:nth-child(1) td: 30 elements in 1m0.838429805s via goquery
2016/08/30 18:43:25 #tbl tr:nth-child(1) td: 30 elements in 1m0.199170639s via cascadia
2016/08/30 18:43:25 tr + First + td: 30 elements in 10.90311ms

That is, selecting with #tbl tr:nth-child(1) td takes over a minute while doing the equivalent doc.Find("#tbl tr").First().Find("td") takes a few ms.

I think that's mostly by design, having checked the code I believe the (internal) cascadia selectors are stateless so when an nth-child is compiled, it loops over all of the node's parent's children, even though nth-child(1) may already be found. I don't know of any easy solution, but I wanted to raise the issue and get your thoughts on this.

EDIT: related goquery issue: PuerkitoBio/goquery#126

Thanks,
Martin

Patch: Additional selectors

I expanded the selector set a bit and am submitting a patch in case you want to 
include any of it in the library.

If you see some issues, let me know. I more than happy to make changes.

And, btw, I've gotten a lot of use out of this library. Thank you for writing 
it! I basically think it's pretty awesome. 


Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 2:55

Attachments:

Case-insensitive selectors without regex using under the hood

It would be nice to add a case-insensitive attribute search. Now this is possible only through regex which has a low performance. Instead it possible to add the css 4 syntax (https://css4-selectors.com/selector/css4/attribute-case-sensitivity/). Under the hood you can use the EqualFold Go function for checking equality which has much better performance against Regex

For example, this tags

<div class="Red">
<div class="red">

can be find by div[class="red" i] selector

:first, :last and :nth selector

Hello,

What about implementing the :first, :last and :nth selectors, and not just *-child and *-of-type?

I know they are not from CSS, but jQuery, but they would be pretty useful.

Thank you!

Move to GitHub

Can you move this project to GitHub?
Google Code is shutting down.

Original issue reported on code.google.com by [email protected] on 14 Mar 2015 at 11:09

Attribute value of zero

Is this claim to be true:

using quotation marks around an attribute value is required only if this value is not a valid identifier

-- taken from here

I have a multiple attribute selector as "table[border=0][cellpadding=0][cellspacing=0]", and cascadia is failing with the following message:

panic: expected identifier, found 0 instead

goroutine 1 [running]:
panic(0x749a80, 0xc420075ec0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/andybalholm/cascadia.MustCompile(0x7ffe2d3262ea, 0x1b, 0x0)
        /...gopath/src/github.com/andybalholm/cascadia/selector.go:59 +0x7e
...

At first I thought it was issues #24, but it turns out to be zero value of specified attribute. I.e., if I change the CSS selector to table[border="0"][cellpadding="0"][cellspacing="0"], it will work.

Do you think it is worthwhile for you to double-check please? I assume there's a problem because xidel handles table[border=0][cellpadding=0][cellspacing=0] without any problem. Thx.

Attribute Contains (*=) not working

I am only able to get the attribute contains selector to work for some arbitrary values, not constantly. However, if I switch to the regex selector, it always works like expected.

E.g. this works:
[id#=-621-] but not this [id*=-621-]
I have tried with smaller portions of that text with the same result. Am I misunderstanding the substring selector? I would think that the above selectors should yield the same result. I'm using goquery, so not directly using cascadia.

The id that I am matching against is: eventLineBook-3139607-238-621-2

nested MatchAll fails

Using the latest cascadia, the following test fails:

func TestSubSelect(t *testing.T) {
	const page = `  <table>
    <tr class="odd">
      <td>
        <table class="content">
          <tr>
            <td>a1</td>
            <td>a2</td>
            <td>a3</td>
          </tr>
        </table>
      </td>
    </tr>
    <tr class="even">
      <td>
        <table class="content">
          <tr>
            <td>b1</td>
            <td>b2</td>
            <td>b3</td>
          </tr>
        </table>
      </td>
    </tr>
  </table>`

	doc, err := html.Parse(strings.NewReader(page))
	if err != nil {
		t.FailNow()
	}

	var (
		rows  = cascadia.MustCompile(`.even,.odd`)
		first = cascadia.MustCompile(`table tr td:nth-child(1)`)
	)

	var count = 0
	for _, node := range rows.MatchAll(doc) {
		for range first.MatchAll(node) {
			count++
		}
	}
	if count != 2 {
		t.Errorf("expected 2 matches; got %v matches", count)
		t.FailNow()
	}
}

Support :content pseudo-class

What steps will reproduce the problem?
1. Try a CSS selector ":content"
2. Yields an error: unknown pseudoclass :content
3. Cry

What is the expected output? What do you see instead?

I'd like the :content pseudo class to be handled :)

What version of the product are you using? On what operating system?

Using through goquery, checked out a few moments ago.

Please provide any additional information below.



Original issue reported on code.google.com by [email protected] on 19 Oct 2014 at 9:27

Change gopath

Should we change the sub-repository path?

sed -i 's|"code\.google\.com/p/go\.|"golang.org/x/|' $(find . -name '*.go')

https://groups.google.com/forum/#!topic/golang-nuts/eD8dh3T9yyA

Original issue reported on code.google.com by [email protected] on 6 Nov 2014 at 4:05

Fails with has and direct child selector

Problem

Cascadia seems to fail with the has psudo-class and a direct child selector

From MDN

The following selector matches only elements that directly contain an child:
a:has(> img)

Cascadia test

var selectorTests = []selectorTest{
	{
		`<html><a><img></img></a></html>`,
		"a:has(> img)",
		[]string{
			"<a><img></img></a>",
		},
	},
        ...
}

Result

selector_test.go:651: error compiling "a:has(> img)": expected identifier, found > instead

Is there any chance to get this implemented?
Thanks!

Null pointer dereference

Hi,

my project is using this one as a lib. After some slight changes on webpage that it parsers there is a panic:
Komosa/cf#4

I suppose there is just a need to handle it around lines from stacktrace. But filling an issue for reference.
offending line: gopath.../github.com/andybalholm/cascadia/selector.go:217

Top level > in query

I'm using goquery, and retrieved a selection. I want to do a further query on that selection, requiring direct descendents of the selected nodes, so I start my query with '>'. For example say I find a particular 'ul' element, then want to target direct child 'li' elements in a later query on the selection containing the 'ul', I'd pass '> li'. Perhaps I'm not doing this right, but it seems jQuery supports it, but cascadia will not. Thanks.

PuerkitoBio/goquery#117

Handle colon in elementid

Similar to that problem:
http://stackoverflow.com/questions/5552462/handling-colon-in-element-id-with-jquery

Double backslashes doesn't work

<div id="test:abc" value="123">
value, exists := doc.Find("div#test:abc").Attr("value")
  if exists == false {
          log.Fatal("Not found\n")
  } else {
  log.Printf("Value: %s\n", value)
panic: unknown pseudoclass :abc

goroutine 1 [running]:
github.com/andybalholm/cascadia.MustCompile(0x7f1020, 0x23, 0xc82006a380)
        /go/src/github.com/andybalholm/cascadia/selector.go:59 +0x72
github.com/PuerkitoBio/goquery.(*Selection).Find(0xc8201887b0, 0x7f1020, 0x23, 0x0)
        /go/src/github.com/PuerkitoBio/goquery/traversal.go:27 +0x38
main.ExampleScrape()
        /go/src/test/test.go:16 +0x7e
main.main()
        /go/src/test/test.go:42 +0x14

Elements inside <noscript> are ignored?

The following test fails; shouldn't it pass?

diff --git a/selector_test.go b/selector_test.go
index 8438d38..b372cf9 100644
--- a/selector_test.go
+++ b/selector_test.go
@@ -35,6 +35,11 @@ var selectorTests = []selectorTest{
                },
        },
        {
+               `<noscript><img src=foo/></noscript>`,
+               "img",
+               []string{"<img src=\"foo/\">"},
+       },
+       {
                `<html><head></head><body></body></html>`,
                "*",
                []string{

nodeString vs html.Render in test

Hi Andy,

I'm wondering what your considerations were when choosing to use nodeString() function (instead of html.Render()) in your selector_test.go file. I mean, html.Render() would work as well, right?

The reason I'm asking is that, I think people would be more interested in the actual effect on html.Render(). Thx.

Error that doesn't allow to parse RSS.

This selector passes test:

	{
		`<item><link>Any link</link><title>Any title</title></item>`,
		"item link:empty",
		[]string{
			"<link>",
		},
	},

All other tags work fine. Only link always returns empty text.
For example, "item title:empty" returns 0 elements.

Support for String() method

I would like to add support for converting a Selector back to a string with the String() method. But I think this would require changing the Selector type from a function type to a struct or an interface. So it would break the API. Should I make a Cascadia2 repository for that? Or is there another way to do it that I'm not thinking of?

cascadia fails on multiple attribute selector

The multiple CSS attribute selectors like input[name=Sex][value=M] are valid selectors (for htmls like <input type="radio" name="Sex" value="F" /> etc), as per this and this, however cascadia are failing on them:

panic: expected identifier, found 1 instead

goroutine 1 [running]:
panic(0x749a80, 0xc420075ec0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/andybalholm/cascadia.MustCompile(0x7ffe2d3262ea, 0x1b, 0x0)
        /export/repo/go-arch/src/github.com/andybalholm/cascadia/selector.go:59 +0x7e
...

Would you double-check please? Thx.

Consider include go.sum file in repo

This package currently does not have go.sum file. Which is causing some building errors on some projects i.e.: Dentrax/GMDB#6

We should add a go.sum file as described here.

The go command uses the go.sum file to ensure that future downloads of these modules retrieve the same bits as the first download, to ensure the modules your project depends on do not change unexpectedly, whether for malicious, accidental, or other reasons. Both go.mod and go.sum should be checked into version control.

Node.Child no longer exists in the tip version of exp/html

What steps will reproduce the problem?
1. go get code.google.com/p/cascadia with a tip version of go fails

Error message is:

$ go get code.google.com/p/cascadia
# code.google.com/p/cascadia
/tmp/go/src/code.google.com/p/cascadia/selector.go:46: n.Child undefined (type 
*html.Node has no field or method Child)
/tmp/go/src/code.google.com/p/cascadia/selector.go:234: parent.Child undefined 
(type *html.Node has no field or method Child)
/tmp/go/src/code.google.com/p/cascadia/selector.go:279: parent.Child undefined 
(type *html.Node has no field or method Child)
/tmp/go/src/code.google.com/p/cascadia/selector.go:299: n.Child undefined (type 
*html.Node has no field or method Child)
/tmp/go/src/code.google.com/p/cascadia/selector.go:348: p.Child undefined (type 
*html.Node has no field or method Child)

What version of the product are you using? On what operating system?
go/ $ hg sum
parent: 14012:b8637622df90 tip
 exp/locale/collate/build: moved some of the code to the appropriate file, as
branch: default
commit: (clean)
update: (current)


Please provide any additional information below.
The attached patch gets tests passing again

Original issue reported on code.google.com by [email protected] on 8 Sep 2012 at 1:58

Attachments:

New release

Hi @andybalholm

Would it be possible to add a new release tag (v1.2.0?) for use with go mod?

There's some nice changes since v1.1 that would be great to version 😀

Matching in the context of a node returns (potentially) unexpected results

Hello Andy,

The following program returns the 2 <td> nodes under the first <tr> even though the selector gives the impression that it should look for a .start class in the decendents of that <tr> (and should not find any):

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/PuerkitoBio/goquery"
	"github.com/andybalholm/cascadia"
)

var data = `
<!DOCTYPE html>
<html>
<body>
    <table class="start">
        <tbody>
            <tr>
                <td>test1</td>
                <td>test2</td>
            </tr>
            <tr>
            <td>
                <table>
                    <tbody>
                        <tr>
                           <td>test3</td>
                           <td>test4</td>
                        </tr>
                        <tr>
                           <td>test5</td>
                           <td>test6</td>
                        </tr>
                    </tbody>
                </table>
              </td>
            </tr>
        </tbody>
    </table>
</body>
</html>
`

func main() {
	doc, err := goquery.NewDocumentFromReader(strings.NewReader(data))
	if err != nil {
		log.Fatal(err)
	}

	// find outer tr
	rowSelection := doc.Find(".start > tbody > tr")
	fmt.Println("row selection length: ", len(rowSelection.Nodes))
	rowSelection.Each(func(i int, s *goquery.Selection) {
		fmt.Println(i, goquery.NodeName(s), s.AttrOr("class", ""))
	})
	fmt.Println()

	// get first outer <tr> and look for .start inside it
	tr0 := rowSelection.Get(0)

	cs := getMatcher(".start")
	matches := cascadia.QueryAll(tr0, cs)
	fmt.Println("expecting 0, returns 0: ", len(matches))

	cs = getMatcher(".start > tbody")
	matches = cascadia.QueryAll(tr0, cs)
	fmt.Println("expecting 0, returns 0: ", len(matches))

	cs = getMatcher(".start > tbody > tr")
	matches = cascadia.QueryAll(tr0, cs)
	fmt.Println("expecting 0, returns 0: ", len(matches))

	cs = getMatcher(".start > tbody > tr > td")
	matches = cascadia.QueryAll(tr0, cs)
	fmt.Println("expecting 0, returns 2: ", len(matches))
}

func getMatcher(s string) cascadia.Matcher {
	m, err := cascadia.ParseWithPseudoElement(s)
	if err != nil {
		log.Fatal(err)
	}
	return m
}

Correct me if I'm wrong but I think it might be working as intended in Cascadia, even though it differs from what folks may be used to with jQuery, in that (IIUC) the selector is always started from the root of the document, but only descendents of the contextual node are returned (if they do match).

This has come up in the context of PuerkitoBio/goquery#468, but after investigation and reading through some issues you closed, I have the feeling it is by design.

Thanks,
Martin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.