jhy / jsoup Goto Github PK

View Code? Open in Web Editor NEW

10.7K 395.0 2.1K 4.91 MB

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

Home Page: https://jsoup.org

License: MIT License

Java 84.24% HTML 15.76%

jsoup html java dom css java-html-parser css-selectors xml xpath parser

jsoup's People

Contributors

Stargazers

Watchers

Forkers

uggedal jaiew yeameen michael-simons tburch kzn rorygibson piascikj btd rocy nator oxromantic sumitbisht derzzle tefla darawalee clementdenis radomirml a5an0 pfiver serioussam atoxic dwhitney cadams500 bsampath babebridou hydrocul paulorcf arnaudlefebvre lesscome33 huncent malax yoav200 szimano dudw williamren smallsilver davidsantiago stask gotomypc ja-fra fengcunhan rodeschen hoverruan apptaro congmo kasgoku ile kishi123 prayagupa nztomas joonyli guru1024 xxxazxxx ilaww cimbit calebfornari zim3453 lukepfarrar piwizard depassp bleporini flyhawk007 xiedantibu xmxkkk jiemingxin sospartan jkowalczyk najibsaurus rtw slavabulgakov lt1946 davelnewton dma lostape zhongliangjun1 nottoobad mingfai blusssteve dcobb jairideout tbarderas huoqiu14331 jimyth xjaphx ngocbd tatepoon jaju tranvt boomkap nodens2k yuany lingen1949 kml23956 tabachain vartipz abel533 shepard deronda bill-sherwood

jsoup's Issues

Unadorned text following data-only tags doesn't parse properly

This HTML, parsed and immediately printed out, results in:

<html>
<body>
<script type="text/javascript">
var inside = true;
</script>
this should be outside.
</body>
</html>

Results:

<html>
<head>
</head>
<body>
<script type="text/javascript">
var inside = true;

this should be outside.

</script>
</body>
</html>

Note how "this should be outside" ends up inside the <script> tag, instead of following it. From what I can tell, this only happens to data-only tags.

Cleaner.isValid improvement idea for form processing

Hi,
I'm using jsoup behind some wicket form processing. I really like it.
Maybe there is a way to call e.g. Cleaner.isValid(String input, Whitelist list).
Which returns false on the first tag removed.
Of course it could be coded manually, but I think that might be a nice feature.

What do you think?

Page results in malformed tree

The page I will attach results in a Jsoup tree with two body elements, neither if which is a direct child of the html element.

You will find the page in "[email protected]:bimargulies/Misc.git" under the jsoup-tc directory.

StringIndexOutOfBoundsException when parsing link http://news.yahoo.com/s/nm/20100831/bs_nm/us_gm_china

java.lang.StringIndexOutOfBoundsException: String index out of range: 1
at java.lang.String.charAt(String.java:686)
at java.util.regex.Matcher.appendReplacement(Matcher.java:711)
at org.jsoup.nodes.Entities.unescape(Entities.java:69)
at org.jsoup.nodes.TextNode.createFromEncoded(TextNode.java:95)
at org.jsoup.parser.Parser.parseTextNode(Parser.java:222)
at org.jsoup.parser.Parser.parse(Parser.java:94)
at org.jsoup.parser.Parser.parse(Parser.java:54)
at org.jsoup.Jsoup.parse(Jsoup.java:30)

Handle incorrectly gzipped responses as well

Some bad sites returns gzipped responses regardless of the client's capabilities. It would be nice if Jsoup detects the presence of Content-Encoding: gzip in the response header and then uses GZIPInputStream to read the response. Right now it is not doing that.

Example of such a site: http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289

This issue is triggered by http://stackoverflow.com/questions/3406289

JSoup cannot parse IDs with dash

If I trying to use the following expression
doc.select("#expandable-nav");
I'll get following error
Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query #expandable-nav

Add xpath support

I put a reward of $80 as well
http://nextsprocket.com/tasks/add-xpath-to-jsoup-java-html-parser-library

Set USER_AGENT

It would be good to be able to set the user agent on the fly for Jsoup.parse(url). Many sites block a java user_agent and return a 403.

Can get text of a <link></link> node

    String html = "<link>http://www.google.com</link><link1>http://link1.com</link1>";
    Document doc = Jsoup.parse(html);
    String link = doc.select("link").first().text();
    System.out.println("Link: " + link);
    String link1 = doc.select("link1").first().text();
    System.out.println("Link1: " + link1);

The result is :

    Link: 
    Link1: http://link1.com

It seems the content of "" node is ignored

Node.absUrl returns empty String

I tried different nodes, but all I get is an empty String when calling .absUrl(.attr("href")). In contrast, when calling .attr("abs:href") I DO GET THE CORRECT absolute URL!?

So it seems something is wrong with the absUrl(...) function.

Cheers, Christian

Html entities containing digits are not unescaped correctly

Some html entities (such as sup1, sup2) are not unescaped correctly by Entities.unescape because they contain digits.

The problem is the pattern Entities.unescapePattern. I changed it to '&(#(x|X)?([0-9a-fA-F]+)|[0-9a-zA-Z]+);?', and it worked fine for me. But there might be side effects ...

You can see my changes here : clementdenis@d65387c

Should treat unknown tags as inline, not block

See: http://groups.google.com/group/jsoup/browse_thread/thread/711fb6d0c4818ead?hl=en_US#

We should probably treat unknown tags as inline, rather than block tags. Otherwise an unknown tag within a <p> causes the auto-closer to close the P, so <p><custom>Test</custom></p> parses to <p></p><custom>Test</custom>.

Need to think about what impact that would have on unknown tags that should be blocks.

Thanks to François Goldgewicht (http://francois.goldgewicht.com) for reporting the issue.

options tags not properly normalised from ugly HTML

After parsing a large HTML document from the wild, unclosed <option> tags are not being automatically closed when a second <option> tag (or finishing </select> tag) is met.

Example:

Element node:
DetailsTurnsCRXP ... etc. Then there is another element node containing the first <option> tag (value="title") and onward. Within that element node exists a single data node: DetailsTurnsCRXP ... etc. Nothing else follows.

Incorrect normalisation on headless body

Parsing <html><body><span class="foo">bar</span> creates <html><body><span class="foo">bar</span><head></head></html>: in the normalisation process, the head element is appended to the html element, instead of prepended.

Thanks to Patrick Smith @ ucsc.edu for reporting the issue.

Add Node#remove and Node#replaceWith(node)

Add support for direct DOM tree remove and replacement.

StringIndexOutOfBoundsException when testing whether String content is valid HTML

If I try to parse a tag with an equals sign (an empty attribute) but without any single or double quotes around an attribute value, then I get a StringIndexOutOfBoundsException. The stack trace is pasted below.

An example String would be "<a =a"

The following JUnit test case should not throw a StringIndexOutOfBoundsException:

import static org.junit.Assert.assertTrue;
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
import org.junit.Test;
public class BadAttributeTest {
@test
public void aTagWithABadAttributeIsValid() throws Exception {
assertTrue(Jsoup.isValid("<a =a", Whitelist.relaxed()));
}
}

java.lang.StringIndexOutOfBoundsException: String index out of range: 13
at java.lang.String.charAt(String.java:686)
at org.jsoup.parser.TokenQueue.consume(TokenQueue.java:130)
at org.jsoup.parser.Parser.parseAttribute(Parser.java:207)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:142)
at org.jsoup.parser.Parser.parse(Parser.java:91)
at org.jsoup.parser.Parser.parseBodyFragment(Parser.java:64)
at org.jsoup.Jsoup.parseBodyFragment(Jsoup.java:99)
at org.jsoup.Jsoup.isValid(Jsoup.java:155)

Add :eq() pseudo-selector

Add support for the eq() jquery pseudo-selector.

Assertion error while cleaning with empty attribute

If you try to clean <img alt="" />, an IllegalArgumentException is thrown; there's an assertion in the Whitelist TypedValue that the value is not empty. Should be testing not null instead.

Thanks to François Goldgewicht (http://francois.goldgewicht.com) for reporting the issue.

302 redirects are not followed

Not sure if this is a bug or done intentionally, but HTTP 302 redirects are not followed. It'd be great if they could be.

-edit-

I saw "// todo: error handling options, allow user to get !200 without exception" in HttpConnection, so maybe this more of a feature request...

JSoup cannot parse IDs with underscores

Example:
Elements id = doc.select( "#An_ID_name" );

Error output:
Could not parse query #An_ID_name

Underscores are valid characters for IDs, but JSoup seems to choke on them. Regular IDs are working fine. There are other valid characters that I haven't tested, like dashes - these should all be accepted.

Add #html() method to Elements

Add a collecting html() method to Elements, to align with text().

Also think about supporting Elements#html(String). Not sure we want to do this (effectively you'd use this to avoid getting a single element via first() and setting HTML on this. Still, would support some use cases. If we do, should also implement the prepend, wrap methods as well.

On URL parse, complain if not text/*

Throw an IO exception if the content-type of the HTTP response is not text/*

This is to prevent trying to parse PDFs for example.

When cleaning HTML, relative links should be resolved to absolute for protocol test

When cleaning HTML with relative links, the href attributes should be resolved to absolute links against the base URI, to confirm the protocol is allowed. At the moment relative links are dropped.

Add TextNode#text and #text(string) support

Simplify getter and setter on TextNodes.

Add support for Element class manipulation

Add support for Element addClass, removeClass, toggleClass (hasClass, classNames exist, this adds convenience)

Also include in Elements. addClass / removeClass / toggleClass acts on all, hasClass finds first match to true.

Wrong html parsing (probably) due to isEmptyElement

Look at this original HTML code
----HTML START---

        <p id="pivot">
            <span style="font-weight:bold;">
                <table width="1" align="left" class="foto-v-left">
                    <tr>
                        <td>
                            <img align="left" alt="x" title="y" border="0" width="140" height="180" src="http://foo.org/iPhoneApp1.jpg"></td>
                        </tr>
                        <tr>
                            <td>Txt1</td>
                        </tr>
                    </table>
        Txt2 - </span>
        Txt3 
        </p>
</body>

---HTML END--- Try to parse it! The "toString()" of resulting org.jsoup.nodes.Document figure like:

---ToString start---


Txt1

Txt2 - Txt3

---ToString end---

As you can see the documents are differnt in the structure. For example "Txt2" and "Txt3" are not children the "p" element but they are children of a "div"

Add element.val() to get input, textarea values

Include bulk methods in Elements, too.

Return empty string if not a textarea or an element with a value attribute.

Make Document Cloneable

I think it's useful to let Document implement Cloneable and return a deep copy on Document#clone().

JSoup cannot CSS select IDs with a colon

If I trying to use the following expression
doc.select ( "#" + pageId );
where pageId happens to be 'PlugIn100:PlugIn0_ManageMailStoreUserMultipleSelectionsPage' in one case I use, I get the following error:

Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query '#PlugIn100:PlugIn0_ManageMailStoreUserMultipleSelectionsPage': unexpected token at ':PlugIn0_ManageMailStoreUserMultipleSelectionsPage'

I know this issue has come up with underscores and dashes, but I thought I would bring it to your attention that it happens with colons as well.

Implement :not pseudo-selector

In version 1.3.3, the pseudo selector :not is not implemented.

JSoup unable to extract text from paragraphs

I have the following test case for a CNN url: http://pastebin.com/yqZ1fbY1

if you look at the output you'll be able to see that it doesn't print most of the paragraphs, in fact the second paragraph of the story is rendered as: http://pastebin.com/Hh8KyRwD

expected output would be the text from the 2nd paragraph
"We will continue to highlight the Democratic Party's role in strengthening it and the Republican Party's role in opposing it," etc..........

Suggestion: operators at the start of a selector

In jQuery, when doing further DOM selection on an element (e.g. using find), you can use operators at the start of the query to filter based on the current element.

For example, this jQuery: $('table.data > tbody > tr').find('> td') will select td elements that are direct children of the rows found in the first query. It will not select td elements from any nested tables.

With JSoup, this would be something like:

Elements tableRows = doc.select( "table.data>tbody>tr" );
for ( Element tr : tableRows )
{
    // do something with tr here
    tr.select(">td");
}

I currently get this error: Could not parse query >td

"charset=latin-1" is not properly detected

From the mailing list: http://groups.google.com/group/jsoup/browse_thread/thread/09d8325e0e5a46c6#

I just downloaded jsoup 1.3.3 and gave it a try. It works great for
UTF-8 encoded websites, but dies for LATIN-1 encoded sites.
The site that caused the error below is:
http://www.macupdate.com
In the html source you'll find this line:

Here the full stacktrace:
Exception in thread "main"
java.nio.charset.UnsupportedCharsetException: LATIN-1
at java.nio.charset.Charset.forName(Charset.java:505)
at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:58)
at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:
376)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:122)
at rgse.test.Main.main(Main.java:15)
System:
Mac OS X 10.5
Java 1.6
jsoup 1.3.3

Reason:
The general problem:
In DateUtil.java, line 56, the charset name is identified as
"LATIN-1". That name is handed to Charset.forName(). However,
"LATIN-1" does not seem to be recognized as valid character set alias
as defined in http://www.iana.org/assignments/character-sets
The correct character set alias for "LATIN-1" should be "latin1". I
wrote a small test program and the following line runs without problems:
Charset c = Charset.forName("latin1"); // WORKS
Charset c = Charset.forName("LATIN-1"); // FAILS
Solution:
Maybe somewhere in DateUtil.getCharsetFromContentType()? At least this
is where the character set is parsed and turned into all uppercase
(breaks for latin1).
Thanks!
Rico

Parsing tags in a custom tag library with a colon

A problem with parsing tags such as <abc:door />. Currently, JSoup parses this and splits the tag into- <abc :door=""></abc> which is invalid and makes the code useless.

Problem with <td tag

Hello

making follow:

final Elements rows = doc.select("body > table > tr");
for ( Element row: rows ) {
final Element date = row.child(0); // select("td").first();
}

for first < tr > will return < td class="company"..., first child ignored
for second < tr > will return < td >21-Feb...</ td > correct

see comments beside tags in html below

This html:

< table cellspacing="0" cellpadding="0" border="0">

    <col width="12%">
    <col>
</colgroup>
<tbody>
    <tr>
        <th class="tl">Date Posted</th>
        <th>Details Preview</th>
        <th>Type</th>
        <th>Amount</th>
        <th class="tr">Location</th>
    </tr>
    <tr>    <!-- if inspect code then displayed as <tr td=""> and first child is <td class="company">...</td>    -->  
        <td>21-Feb-2010 10:44</td>
        <td class="company">
            <h2>
                1.
                <a id="AdvertTitleForRow1"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="TITLE"
                >Title title title</a>
            </h2>
            <p>
                vText Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text .....
                <a
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="MORE_ADVERT_INFO"
                >More</a>
            </p>
            <p>Advertiser : AAAA Services</p>
        </td>
        <td>Contract</td>
        <td class="viewItem">
            United Kingdom,City of London
            <div class="view_advert_link">
                <a id="view_advert_link_7801464"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="view_advert_link"
                >View </a>
            </div>
        </td>
    </tr>
    <tr class="alternate"> <!-- BUT in this row all ok. First child is <td>21-Feb...</td>-->
        <td>21-Feb-2010 10:44</td>
        <td class="company">
            <h2>
                1.
                <a id="AdvertTitleForRow1"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="TITLE"
                >Title title title</a>
            </h2>
            <p>
                vText Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text .....
                <a
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="MORE_ADVERT_INFO"
                >More</a>
            </p>
            <p>Advertiser : AAAA Services</p>
        </td>
        <td>Contract</td>
        <td class="viewItem">
            United Kingdom,City of London
            <div class="view_advert_link">
                <a id="view_advert_link_7801464"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="view_advert_link"
                >View </a>
            </div>
        </td>
    </tr>

</tbody>

toString NPE for orphans

I'm working on code that frequently calls 'remove' and then re-adds an element. While the element is in a detached string, toString throws something, so Eclipse prints only an 'invocation target exception.' It would be nice if this were not so.

Update the tag definitions to allow span and header elements to be block

Review the HTML5 spec; span and header tags (and what else?) can be block elements and not phrasing content, so the tag spec needs to be updated.

Selector for data attributes in HTML5

Hi Jhy,

is it possible to consume data elements?

    <li class="user" data-name="John Resig" data-city="Boston"
      data-lang="js" data-food="Bacon">
      <b>John says:</b> <span>Hello, how are you?</span>
    </li>

Jsoup.parse(document).select("[data]"); doesn't work for me.

I really love jsoup, thanks for your awesome work.

attr("abs:href") , absUrl("href")

Document doc = Jsoup.parse(new URL("http://www.oschina.net/bbs/thread/12975"), 5*1000);
Elements es = doc.select("a[href]");
for(Iterator it = es.iterator();it.hasNext();){
Element e = it.next();
System.out.println(e.absUrl("href"));
}

attr("abs:href") ------ <a href="?p=1">1</a>
result: ------------------- http://www.oschina.net/bbs/thread/?p=1

I think it's a wrong result~.
The correct results should be "http://www.oschina.net/bbs/thread/12975?p=1"

Page results in malformed tree

The page I will attach results in a Jsoup tree with two body elements, neither if which is a direct child of the html element.

However, I can't see how to attach a file to an issue here.

Issue with <tr>

When calling append to add a table row the resulting tr gets wrapped in a table even though I appended to an existing table.

Add option to output non-pretty-printed HTML

Add an option to output HTML that is formatted (spaces / newlines / indentation) as the original source, and not force pretty-printed.

Implement with a switch in Document, to force preserve whitespace on all nodes. Will require Nodes to have a direct accessor to their parent Document

IndexOutOfBoundsException in HttpConnection whene empty headers in the response

I get this exception, because a response header is empty.

    java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.jsoup.helper.HttpConnection$Response.setupFromConnection(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:338)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:132)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:121)
at org.jsoup.Jsoup.parse(Jsoup.java:133)

Normalise document after parse

Add a post-parse document normalisation phase.

Particularly, move text nodes that aren't in #body (ie in #root, #html, #head) into body.

Add a textNode#isWhitespace method to check if textnode should be moved.

Parsing a HTML snippet causes the leading text to be moved to back

Code:

String html = "foo <b>bar</b> baz";
String text = Jsoup.parse(html).text();
System.out.println(text);

Result:

bar baz foo

Expected:

foo bar baz

Suggestion: new method Elements.parents()

A function similar to jQuery's parents() - http://api.jquery.com/parents/ - would be a nice addition. The function would return all parent elements of the current Element. Or, given an optional parameter, would filter based on that.

So if you for example selected all bold text with Elements elems = doc.select('b') you could then find all bold tags that were in paragraphs with elems.parents('p'), and that would select the paragraphs themselves if you wanted to do some processing on them.

You could also add the optional selector to the parent() function too - although it is as easy in this case to simply select the parent and check if the tag or class etc matches.

Does not auto-detect undeclared character sets

When looking at: http://money.cnn.com/2010/10/25/news/companies/motley_crue_bp.fortune/index.htm?section=money_latest the umlaut character is being picked up in the paragraph text correctly as htmlentities but if you grab the title it's showing up as invalid unicode characters.

doc.getElementsByTag("title").first().text();

Modify Elements#attr to get from first element with match

Currently, Elements#attr pulls the attribute from the first element. But Elements#hasAttr scans all of the elements in the collection to check if one has an attribute. So these do not align.

Modify Elements#attr to scan for the first Element that hasAttr, and return the value from that element.

uppercase umlauts get replaced by lowercase umlaut entities

The line

System.out.println(Jsoup.clean("<h1>Überschrift</h1>", Whitelist.none()));

should print

&Uuml;berschrift

but prints

&uuml;berschrift

This used to work correctly in v0.3.1, but fails in v1.2.3.

While baseArray in Entities.java distinguishes between lowercase and uppercase umlauts, the above call yields the wrong result.

lower cased html attributes

As I already stated in a previous post, we are using JSTL tags (java custom tags) and we require the attributes to be camel cased to match some methods in our java code. Is it possible to give an option to leave the attributes as they are and not modify them by making them lower case?

e.g. <abc:ourtag returnUrl="http://abc.com" /> does not change to <abc:ourtag returnurl="http://abc.com" />

Thanks!