Giter VIP home page Giter VIP logo

jsoup's People

Contributors

benbenw avatar cketti avatar cromoteca avatar dependabot[bot] avatar hannibal218bc avatar hazendaz avatar isira-seneviratne avatar jairideout avatar jaredstehler avatar jhy avatar kno10 avatar kovacstamasx avatar krystiangorecki avatar kzn avatar legioth avatar mccxj avatar mitemitreski avatar morokosi avatar offa avatar pascalschumacher avatar schmid-michael avatar sebkur avatar sedran avatar steinarb avatar suarez12138 avatar talgatakhm avatar tc avatar tipabu avatar travisfw avatar zjiajun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jsoup's Issues

Unadorned text following data-only tags doesn't parse properly

This HTML, parsed and immediately printed out, results in:

<html>
<body>
<script type="text/javascript">
var inside = true;
</script>
this should be outside.
</body>
</html>

Results:

<html>
<head>
</head>
<body>
<script type="text/javascript">
var inside = true;

this should be outside.

</script>
</body>
</html>

Note how "this should be outside" ends up inside the <script> tag, instead of following it. From what I can tell, this only happens to data-only tags.

Cleaner.isValid improvement idea for form processing

Hi,
I'm using jsoup behind some wicket form processing. I really like it.
Maybe there is a way to call e.g. Cleaner.isValid(String input, Whitelist list).
Which returns false on the first tag removed.
Of course it could be coded manually, but I think that might be a nice feature.

What do you think?

Page results in malformed tree

The page I will attach results in a Jsoup tree with two body elements, neither if which is a direct child of the html element.

You will find the page in "[email protected]:bimargulies/Misc.git" under the jsoup-tc directory.

StringIndexOutOfBoundsException when parsing link http://news.yahoo.com/s/nm/20100831/bs_nm/us_gm_china

java.lang.StringIndexOutOfBoundsException: String index out of range: 1
at java.lang.String.charAt(String.java:686)
at java.util.regex.Matcher.appendReplacement(Matcher.java:711)
at org.jsoup.nodes.Entities.unescape(Entities.java:69)
at org.jsoup.nodes.TextNode.createFromEncoded(TextNode.java:95)
at org.jsoup.parser.Parser.parseTextNode(Parser.java:222)
at org.jsoup.parser.Parser.parse(Parser.java:94)
at org.jsoup.parser.Parser.parse(Parser.java:54)
at org.jsoup.Jsoup.parse(Jsoup.java:30)

JSoup cannot parse IDs with dash

If I trying to use the following expression
doc.select("#expandable-nav");
I'll get following error
Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query #expandable-nav

Set USER_AGENT

It would be good to be able to set the user agent on the fly for Jsoup.parse(url). Many sites block a java user_agent and return a 403.

Can get text of a <link></link> node

    String html = "<link>http://www.google.com</link><link1>http://link1.com</link1>";
    Document doc = Jsoup.parse(html);
    String link = doc.select("link").first().text();
    System.out.println("Link: " + link);
    String link1 = doc.select("link1").first().text();
    System.out.println("Link1: " + link1);

The result is :

    Link: 
    Link1: http://link1.com

It seems the content of "" node is ignored

Html entities containing digits are not unescaped correctly

Some html entities (such as sup1, sup2) are not unescaped correctly by Entities.unescape because they contain digits.

The problem is the pattern Entities.unescapePattern. I changed it to '&(#(x|X)?([0-9a-fA-F]+)|[0-9a-zA-Z]+);?', and it worked fine for me. But there might be side effects ...

You can see my changes here : clementdenis@d65387c

Should treat unknown tags as inline, not block

See: http://groups.google.com/group/jsoup/browse_thread/thread/711fb6d0c4818ead?hl=en_US#

We should probably treat unknown tags as inline, rather than block tags. Otherwise an unknown tag within a <p> causes the auto-closer to close the P, so <p><custom>Test</custom></p> parses to <p></p><custom>Test</custom>.

Need to think about what impact that would have on unknown tags that should be blocks.

Thanks to François Goldgewicht (http://francois.goldgewicht.com) for reporting the issue.

options tags not properly normalised from ugly HTML

After parsing a large HTML document from the wild, unclosed <option> tags are not being automatically closed when a second <option> tag (or finishing </select> tag) is met.

Example:

Element node:
DetailsTurnsCRXP ... etc. Then there is another element node containing the first <option> tag (value="title") and onward. Within that element node exists a single data node: DetailsTurnsCRXP ... etc. Nothing else follows.

Incorrect normalisation on headless body

Parsing <html><body><span class="foo">bar</span> creates <html><body><span class="foo">bar</span><head></head></html>: in the normalisation process, the head element is appended to the html element, instead of prepended.

Thanks to Patrick Smith @ ucsc.edu for reporting the issue.

StringIndexOutOfBoundsException when testing whether String content is valid HTML

If I try to parse a tag with an equals sign (an empty attribute) but without any single or double quotes around an attribute value, then I get a StringIndexOutOfBoundsException. The stack trace is pasted below.

An example String would be "<a =a"

The following JUnit test case should not throw a StringIndexOutOfBoundsException:

import static org.junit.Assert.assertTrue;
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
import org.junit.Test;
public class BadAttributeTest {
@test
public void aTagWithABadAttributeIsValid() throws Exception {
assertTrue(Jsoup.isValid("<a =a", Whitelist.relaxed()));
}
}

java.lang.StringIndexOutOfBoundsException: String index out of range: 13
at java.lang.String.charAt(String.java:686)
at org.jsoup.parser.TokenQueue.consume(TokenQueue.java:130)
at org.jsoup.parser.Parser.parseAttribute(Parser.java:207)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:142)
at org.jsoup.parser.Parser.parse(Parser.java:91)
at org.jsoup.parser.Parser.parseBodyFragment(Parser.java:64)
at org.jsoup.Jsoup.parseBodyFragment(Jsoup.java:99)
at org.jsoup.Jsoup.isValid(Jsoup.java:155)

302 redirects are not followed

Not sure if this is a bug or done intentionally, but HTTP 302 redirects are not followed. It'd be great if they could be.

-edit-

I saw "// todo: error handling options, allow user to get !200 without exception" in HttpConnection, so maybe this more of a feature request...

JSoup cannot parse IDs with underscores

Example:
Elements id = doc.select( "#An_ID_name" );

Error output:
Could not parse query #An_ID_name

Underscores are valid characters for IDs, but JSoup seems to choke on them. Regular IDs are working fine. There are other valid characters that I haven't tested, like dashes - these should all be accepted.

Add #html() method to Elements

Add a collecting html() method to Elements, to align with text().

Also think about supporting Elements#html(String). Not sure we want to do this (effectively you'd use this to avoid getting a single element via first() and setting HTML on this. Still, would support some use cases. If we do, should also implement the prepend, wrap methods as well.

Add support for Element class manipulation

Add support for Element addClass, removeClass, toggleClass (hasClass, classNames exist, this adds convenience)

Also include in Elements. addClass / removeClass / toggleClass acts on all, hasClass finds first match to true.

Wrong html parsing (probably) due to isEmptyElement

Look at this original HTML code
----HTML START---

        <p id="pivot">
            <span style="font-weight:bold;">
                <table width="1" align="left" class="foto-v-left">
                    <tr>
                        <td>
                            <img align="left" alt="x" title="y" border="0" width="140" height="180" src="http://foo.org/iPhoneApp1.jpg"></td>
                        </tr>
                        <tr>
                            <td>Txt1</td>
                        </tr>
                    </table>
        Txt2 - </span>
        Txt3 
        </p>
</body>
---HTML END--- Try to parse it! The "toString()" of resulting org.jsoup.nodes.Document figure like:

---ToString start---

La nuova â«appâ» per iPhone della Notte della Taranta
Txt1
Txt2 - Txt3

---ToString end---

As you can see the documents are differnt in the structure. For example "Txt2" and "Txt3" are not children the "p" element but they are children of a "div"

JSoup cannot CSS select IDs with a colon

If I trying to use the following expression
doc.select ( "#" + pageId );
where pageId happens to be 'PlugIn100:PlugIn0_ManageMailStoreUserMultipleSelectionsPage' in one case I use, I get the following error:

Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query '#PlugIn100:PlugIn0_ManageMailStoreUserMultipleSelectionsPage': unexpected token at ':PlugIn0_ManageMailStoreUserMultipleSelectionsPage'

I know this issue has come up with underscores and dashes, but I thought I would bring it to your attention that it happens with colons as well.

JSoup unable to extract text from paragraphs

I have the following test case for a CNN url: http://pastebin.com/yqZ1fbY1

if you look at the output you'll be able to see that it doesn't print most of the paragraphs, in fact the second paragraph of the story is rendered as: http://pastebin.com/Hh8KyRwD

expected output would be the text from the 2nd paragraph
"We will continue to highlight the Democratic Party's role in strengthening it and the Republican Party's role in opposing it," etc..........

Suggestion: operators at the start of a selector

In jQuery, when doing further DOM selection on an element (e.g. using find), you can use operators at the start of the query to filter based on the current element.

For example, this jQuery: $('table.data > tbody > tr').find('> td') will select td elements that are direct children of the rows found in the first query. It will not select td elements from any nested tables.

With JSoup, this would be something like:

Elements tableRows = doc.select( "table.data>tbody>tr" );
for ( Element tr : tableRows )
{
    // do something with tr here
    tr.select(">td");
}

I currently get this error: Could not parse query >td

"charset=latin-1" is not properly detected

From the mailing list: http://groups.google.com/group/jsoup/browse_thread/thread/09d8325e0e5a46c6#

I just downloaded jsoup 1.3.3 and gave it a try. It works great for
UTF-8 encoded websites, but dies for LATIN-1 encoded sites.
The site that caused the error below is:
http://www.macupdate.com
In the html source you'll find this line:

Here the full stacktrace:
Exception in thread "main"
java.nio.charset.UnsupportedCharsetException: LATIN-1
at java.nio.charset.Charset.forName(Charset.java:505)
at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:58)
at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:
376)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:122)
at rgse.test.Main.main(Main.java:15)
System:
Mac OS X 10.5
Java 1.6
jsoup 1.3.3

Reason:
The general problem:
In DateUtil.java, line 56, the charset name is identified as
"LATIN-1". That name is handed to Charset.forName(). However,
"LATIN-1" does not seem to be recognized as valid character set alias
as defined in http://www.iana.org/assignments/character-sets
The correct character set alias for "LATIN-1" should be "latin1". I
wrote a small test program and the following line runs without problems:
Charset c = Charset.forName("latin1"); // WORKS
Charset c = Charset.forName("LATIN-1"); // FAILS
Solution:
Maybe somewhere in DateUtil.getCharsetFromContentType()? At least this
is where the character set is parsed and turned into all uppercase
(breaks for latin1).
Thanks!
Rico

Problem with <td tag

Hello

making follow:

final Elements rows = doc.select("body > table > tr");
for ( Element row: rows ) {
final Element date = row.child(0); // select("td").first();
}

for first < tr > will return < td class="company"..., first child ignored
for second < tr > will return < td >21-Feb...</ td > correct

see comments beside tags in html below

This html:

< table cellspacing="0" cellpadding="0" border="0">



    <col width="12%">
    <col>
</colgroup>
<tbody>
    <tr>
        <th class="tl">Date Posted</th>
        <th>Details Preview</th>
        <th>Type</th>
        <th>Amount</th>
        <th class="tr">Location</th>
    </tr>
    <tr>    <!-- if inspect code then displayed as <tr td=""> and first child is <td class="company">...</td>    -->  
        <td>21-Feb-2010 10:44</td>
        <td class="company">
            <h2>
                1.
                <a id="AdvertTitleForRow1"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="TITLE"
                >Title title title</a>
            </h2>
            <p>
                vText Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text .....
                <a
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="MORE_ADVERT_INFO"
                >More</a>
            </p>
            <p>Advertiser : AAAA Services</p>
        </td>
        <td>Contract</td>
        <td class="viewItem">
            United Kingdom,City of London
            <div class="view_advert_link">
                <a id="view_advert_link_7801464"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="view_advert_link"
                >View </a>
            </div>
        </td>
    </tr>
    <tr class="alternate"> <!-- BUT in this row all ok. First child is <td>21-Feb...</td>-->
        <td>21-Feb-2010 10:44</td>
        <td class="company">
            <h2>
                1.
                <a id="AdvertTitleForRow1"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="TITLE"
                >Title title title</a>
            </h2>
            <p>
                vText Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text .....
                <a
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="MORE_ADVERT_INFO"
                >More</a>
            </p>
            <p>Advertiser : AAAA Services</p>
        </td>
        <td>Contract</td>
        <td class="viewItem">
            United Kingdom,City of London
            <div class="view_advert_link">
                <a id="view_advert_link_7801464"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="view_advert_link"
                >View </a>
            </div>
        </td>
    </tr>

</tbody>

toString NPE for orphans

I'm working on code that frequently calls 'remove' and then re-adds an element. While the element is in a detached string, toString throws something, so Eclipse prints only an 'invocation target exception.' It would be nice if this were not so.

Selector for data attributes in HTML5

Hi Jhy,

is it possible to consume data elements?

    <li class="user" data-name="John Resig" data-city="Boston"
      data-lang="js" data-food="Bacon">
      <b>John says:</b> <span>Hello, how are you?</span>
    </li>

Jsoup.parse(document).select("[data]"); doesn't work for me.

I really love jsoup, thanks for your awesome work.

Page results in malformed tree

The page I will attach results in a Jsoup tree with two body elements, neither if which is a direct child of the html element.

However, I can't see how to attach a file to an issue here.

Issue with <tr>

When calling append to add a table row the resulting tr gets wrapped in a table even though I appended to an existing table.

Add option to output non-pretty-printed HTML

Add an option to output HTML that is formatted (spaces / newlines / indentation) as the original source, and not force pretty-printed.

Implement with a switch in Document, to force preserve whitespace on all nodes. Will require Nodes to have a direct accessor to their parent Document

IndexOutOfBoundsException in HttpConnection whene empty headers in the response

I get this exception, because a response header is empty.

    java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.jsoup.helper.HttpConnection$Response.setupFromConnection(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:338)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:132)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:121)
at org.jsoup.Jsoup.parse(Jsoup.java:133)

Normalise document after parse

Add a post-parse document normalisation phase.

Particularly, move text nodes that aren't in #body (ie in #root, #html, #head) into body.

Add a textNode#isWhitespace method to check if textnode should be moved.

Suggestion: new method Elements.parents()

A function similar to jQuery's parents() - http://api.jquery.com/parents/ - would be a nice addition. The function would return all parent elements of the current Element. Or, given an optional parameter, would filter based on that.

So if you for example selected all bold text with Elements elems = doc.select('b') you could then find all bold tags that were in paragraphs with elems.parents('p'), and that would select the paragraphs themselves if you wanted to do some processing on them.

You could also add the optional selector to the parent() function too - although it is as easy in this case to simply select the parent and check if the tag or class etc matches.

Modify Elements#attr to get from first element with match

Currently, Elements#attr pulls the attribute from the first element. But Elements#hasAttr scans all of the elements in the collection to check if one has an attribute. So these do not align.

Modify Elements#attr to scan for the first Element that hasAttr, and return the value from that element.

uppercase umlauts get replaced by lowercase umlaut entities

The line

System.out.println(Jsoup.clean("<h1>Überschrift</h1>", Whitelist.none()));

should print

&Uuml;berschrift

but prints

&uuml;berschrift

This used to work correctly in v0.3.1, but fails in v1.2.3.

While baseArray in Entities.java distinguishes between lowercase and uppercase umlauts, the above call yields the wrong result.

lower cased html attributes

As I already stated in a previous post, we are using JSTL tags (java custom tags) and we require the attributes to be camel cased to match some methods in our java code. Is it possible to give an option to leave the attributes as they are and not modify them by making them lower case?

e.g. <abc:ourtag returnUrl="http://abc.com" /> does not change to <abc:ourtag returnurl="http://abc.com" />

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.