jhy / jsoup Goto Github PK
View Code? Open in Web Editor NEWjsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
Home Page: https://jsoup.org
License: MIT License
jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
Home Page: https://jsoup.org
License: MIT License
This HTML, parsed and immediately printed out, results in:
<html>
<body>
<script type="text/javascript">
var inside = true;
</script>
this should be outside.
</body>
</html>
Results:
<html>
<head>
</head>
<body>
<script type="text/javascript">
var inside = true;
this should be outside.
</script>
</body>
</html>
Note how "this should be outside" ends up inside the <script> tag, instead of following it. From what I can tell, this only happens to data-only tags.
Hi,
I'm using jsoup behind some wicket form processing. I really like it.
Maybe there is a way to call e.g. Cleaner.isValid(String input, Whitelist list).
Which returns false on the first tag removed.
Of course it could be coded manually, but I think that might be a nice feature.
What do you think?
The page I will attach results in a Jsoup tree with two body elements, neither if which is a direct child of the html element.
You will find the page in "[email protected]:bimargulies/Misc.git" under the jsoup-tc directory.
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
at java.lang.String.charAt(String.java:686)
at java.util.regex.Matcher.appendReplacement(Matcher.java:711)
at org.jsoup.nodes.Entities.unescape(Entities.java:69)
at org.jsoup.nodes.TextNode.createFromEncoded(TextNode.java:95)
at org.jsoup.parser.Parser.parseTextNode(Parser.java:222)
at org.jsoup.parser.Parser.parse(Parser.java:94)
at org.jsoup.parser.Parser.parse(Parser.java:54)
at org.jsoup.Jsoup.parse(Jsoup.java:30)
Some bad sites returns gzipped responses regardless of the client's capabilities. It would be nice if Jsoup detects the presence of Content-Encoding: gzip in the response header and then uses GZIPInputStream to read the response. Right now it is not doing that.
Example of such a site: http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289
This issue is triggered by http://stackoverflow.com/questions/3406289
If I trying to use the following expression
doc.select("#expandable-nav");
I'll get following error
Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query #expandable-nav
I put a reward of $80 as well
http://nextsprocket.com/tasks/add-xpath-to-jsoup-java-html-parser-library
It would be good to be able to set the user agent on the fly for Jsoup.parse(url). Many sites block a java user_agent and return a 403.
String html = "<link>http://www.google.com</link><link1>http://link1.com</link1>";
Document doc = Jsoup.parse(html);
String link = doc.select("link").first().text();
System.out.println("Link: " + link);
String link1 = doc.select("link1").first().text();
System.out.println("Link1: " + link1);
The result is :
Link:
Link1: http://link1.com
It seems the content of "" node is ignored
Hi
I tried different nodes, but all I get is an empty String when calling .absUrl(.attr("href")). In contrast, when calling .attr("abs:href") I DO GET THE CORRECT absolute URL!?
So it seems something is wrong with the absUrl(...) function.
Cheers, Christian
Some html entities (such as sup1, sup2) are not unescaped correctly by Entities.unescape because they contain digits.
The problem is the pattern Entities.unescapePattern. I changed it to '&(#(x|X)?([0-9a-fA-F]+)|[0-9a-zA-Z]+);?', and it worked fine for me. But there might be side effects ...
You can see my changes here : clementdenis@d65387c
See: http://groups.google.com/group/jsoup/browse_thread/thread/711fb6d0c4818ead?hl=en_US#
We should probably treat unknown tags as inline, rather than block tags. Otherwise an unknown tag within a <p>
causes the auto-closer to close the P, so <p><custom>Test</custom></p>
parses to <p></p><custom>Test</custom>
.
Need to think about what impact that would have on unknown tags that should be blocks.
Thanks to François Goldgewicht (http://francois.goldgewicht.com) for reporting the issue.
After parsing a large HTML document from the wild, unclosed <option> tags are not being automatically closed when a second <option> tag (or finishing </select> tag) is met.
Example:
Element node:
DetailsTurnsCRXP
... etc.
Then there is another element node containing the first <option> tag (value="title") and onward. Within that element node exists a single data node:
DetailsTurnsCRXP
... etc.
Nothing else follows.
Parsing <html><body><span class="foo">bar</span>
creates <html><body><span class="foo">bar</span><head></head></html>
: in the normalisation process, the head
element is appended to the html
element, instead of prepended.
Thanks to Patrick Smith @ ucsc.edu for reporting the issue.
Add support for direct DOM tree remove and replacement.
If I try to parse a tag with an equals sign (an empty attribute) but without any single or double quotes around an attribute value, then I get a StringIndexOutOfBoundsException. The stack trace is pasted below.
An example String would be "<a =a"
The following JUnit test case should not throw a StringIndexOutOfBoundsException:
import static org.junit.Assert.assertTrue;
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
import org.junit.Test;
public class BadAttributeTest {
@test
public void aTagWithABadAttributeIsValid() throws Exception {
assertTrue(Jsoup.isValid("<a =a", Whitelist.relaxed()));
}
}
java.lang.StringIndexOutOfBoundsException: String index out of range: 13
at java.lang.String.charAt(String.java:686)
at org.jsoup.parser.TokenQueue.consume(TokenQueue.java:130)
at org.jsoup.parser.Parser.parseAttribute(Parser.java:207)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:142)
at org.jsoup.parser.Parser.parse(Parser.java:91)
at org.jsoup.parser.Parser.parseBodyFragment(Parser.java:64)
at org.jsoup.Jsoup.parseBodyFragment(Jsoup.java:99)
at org.jsoup.Jsoup.isValid(Jsoup.java:155)
Add support for the eq()
jquery pseudo-selector.
If you try to clean <img alt="" />
, an IllegalArgumentException is thrown; there's an assertion in the Whitelist TypedValue that the value is not empty. Should be testing not null instead.
Thanks to François Goldgewicht (http://francois.goldgewicht.com) for reporting the issue.
Not sure if this is a bug or done intentionally, but HTTP 302 redirects are not followed. It'd be great if they could be.
-edit-
I saw "// todo: error handling options, allow user to get !200 without exception" in HttpConnection, so maybe this more of a feature request...
Example:
Elements id = doc.select( "#An_ID_name" );
Error output:
Could not parse query #An_ID_name
Underscores are valid characters for IDs, but JSoup seems to choke on them. Regular IDs are working fine. There are other valid characters that I haven't tested, like dashes - these should all be accepted.
Add a collecting html()
method to Elements, to align with text()
.
Also think about supporting Elements#html(String)
. Not sure we want to do this (effectively you'd use this to avoid getting a single element via first()
and setting HTML on this. Still, would support some use cases. If we do, should also implement the prepend
, wrap
methods as well.
Throw an IO exception if the content-type of the HTTP response is not text/*
This is to prevent trying to parse PDFs for example.
When cleaning HTML with relative links, the href attributes should be resolved to absolute links against the base URI, to confirm the protocol is allowed. At the moment relative links are dropped.
Simplify getter and setter on TextNodes.
Add support for Element addClass, removeClass, toggleClass (hasClass, classNames exist, this adds convenience)
Also include in Elements. addClass / removeClass / toggleClass acts on all, hasClass finds first match to true.
Look at this original HTML code
----HTML START---
<p id="pivot">
<span style="font-weight:bold;">
<table width="1" align="left" class="foto-v-left">
<tr>
<td>
<img align="left" alt="x" title="y" border="0" width="140" height="180" src="http://foo.org/iPhoneApp1.jpg"></td>
</tr>
<tr>
<td>Txt1</td>
</tr>
</table>
Txt2 - </span>
Txt3
</p>
</body>
---ToString start---
---ToString end---
As you can see the documents are differnt in the structure. For example "Txt2" and "Txt3" are not children the "p" element but they are children of a "div"
Include bulk methods in Elements, too.
Return empty string if not a textarea
or an element with a value
attribute.
See also http://stackoverflow.com/questions/4194486
I think it's useful to let Document
implement Cloneable
and return a deep copy on Document#clone()
.
If I trying to use the following expression
doc.select ( "#" + pageId );
where pageId happens to be 'PlugIn100:PlugIn0_ManageMailStoreUserMultipleSelectionsPage' in one case I use, I get the following error:
Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query '#PlugIn100:PlugIn0_ManageMailStoreUserMultipleSelectionsPage': unexpected token at ':PlugIn0_ManageMailStoreUserMultipleSelectionsPage'
I know this issue has come up with underscores and dashes, but I thought I would bring it to your attention that it happens with colons as well.
In version 1.3.3, the pseudo selector :not is not implemented.
I have the following test case for a CNN url: http://pastebin.com/yqZ1fbY1
if you look at the output you'll be able to see that it doesn't print most of the paragraphs, in fact the second paragraph of the story is rendered as: http://pastebin.com/Hh8KyRwD
expected output would be the text from the 2nd paragraph
"We will continue to highlight the Democratic Party's role in strengthening it and the Republican Party's role in opposing it," etc..........
In jQuery, when doing further DOM selection on an element (e.g. using find
), you can use operators at the start of the query to filter based on the current element.
For example, this jQuery: $('table.data > tbody > tr').find('> td')
will select td
elements that are direct children of the rows found in the first query. It will not select td
elements from any nested tables.
With JSoup, this would be something like:
Elements tableRows = doc.select( "table.data>tbody>tr" );
for ( Element tr : tableRows )
{
// do something with tr here
tr.select(">td");
}
I currently get this error: Could not parse query >td
From the mailing list: http://groups.google.com/group/jsoup/browse_thread/thread/09d8325e0e5a46c6#
I just downloaded jsoup 1.3.3 and gave it a try. It works great for
UTF-8 encoded websites, but dies for LATIN-1 encoded sites.
The site that caused the error below is:
http://www.macupdate.com
In the html source you'll find this line:
Here the full stacktrace:
Exception in thread "main"
java.nio.charset.UnsupportedCharsetException: LATIN-1
at java.nio.charset.Charset.forName(Charset.java:505)
at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:58)
at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:
376)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:122)
at rgse.test.Main.main(Main.java:15)
System:
Mac OS X 10.5
Java 1.6
jsoup 1.3.3
Reason:
The general problem:
In DateUtil.java, line 56, the charset name is identified as
"LATIN-1". That name is handed to Charset.forName(). However,
"LATIN-1" does not seem to be recognized as valid character set alias
as defined in http://www.iana.org/assignments/character-sets
The correct character set alias for "LATIN-1" should be "latin1". I
wrote a small test program and the following line runs without problems:
Charset c = Charset.forName("latin1"); // WORKS
Charset c = Charset.forName("LATIN-1"); // FAILS
Solution:
Maybe somewhere in DateUtil.getCharsetFromContentType()? At least this
is where the character set is parsed and turned into all uppercase
(breaks for latin1).
Thanks!
Rico
A problem with parsing tags such as <abc:door />. Currently, JSoup parses this and splits the tag into- <abc :door=""></abc> which is invalid and makes the code useless.
Hello
making follow:
final Elements rows = doc.select("body > table > tr");
for ( Element row: rows ) {
final Element date = row.child(0); // select("td").first();
}
for first < tr > will return < td class="company"..., first child ignored
for second < tr > will return < td >21-Feb...</ td > correct
see comments beside tags in html below
This html:
< table cellspacing="0" cellpadding="0" border="0">
<col width="12%">
<col>
</colgroup>
<tbody>
<tr>
<th class="tl">Date Posted</th>
<th>Details Preview</th>
<th>Type</th>
<th>Amount</th>
<th class="tr">Location</th>
</tr>
<tr> <!-- if inspect code then displayed as <tr td=""> and first child is <td class="company">...</td> -->
<td>21-Feb-2010 10:44</td>
<td class="company">
<h2>
1.
<a id="AdvertTitleForRow1"
href="/7801464/en/?source=Search&SearchTerms=&LocationSearchTerms=&DatePostedFilter=2&Page=1&OrderBy=0&CountryId=0&nocache=1266753038"
name="TITLE"
>Title title title</a>
</h2>
<p>
vText Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text .....
<a
href="/7801464/en/?source=Search&SearchTerms=&LocationSearchTerms=&DatePostedFilter=2&Page=1&OrderBy=0&CountryId=0&nocache=1266753038"
name="MORE_ADVERT_INFO"
>More</a>
</p>
<p>Advertiser : AAAA Services</p>
</td>
<td>Contract</td>
<td class="viewItem">
United Kingdom,City of London
<div class="view_advert_link">
<a id="view_advert_link_7801464"
href="/7801464/en/?source=Search&SearchTerms=&LocationSearchTerms=&DatePostedFilter=2&Page=1&OrderBy=0&CountryId=0&nocache=1266753038"
name="view_advert_link"
>View </a>
</div>
</td>
</tr>
<tr class="alternate"> <!-- BUT in this row all ok. First child is <td>21-Feb...</td>-->
<td>21-Feb-2010 10:44</td>
<td class="company">
<h2>
1.
<a id="AdvertTitleForRow1"
href="/7801464/en/?source=Search&SearchTerms=&LocationSearchTerms=&DatePostedFilter=2&Page=1&OrderBy=0&CountryId=0&nocache=1266753038"
name="TITLE"
>Title title title</a>
</h2>
<p>
vText Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text .....
<a
href="/7801464/en/?source=Search&SearchTerms=&LocationSearchTerms=&DatePostedFilter=2&Page=1&OrderBy=0&CountryId=0&nocache=1266753038"
name="MORE_ADVERT_INFO"
>More</a>
</p>
<p>Advertiser : AAAA Services</p>
</td>
<td>Contract</td>
<td class="viewItem">
United Kingdom,City of London
<div class="view_advert_link">
<a id="view_advert_link_7801464"
href="/7801464/en/?source=Search&SearchTerms=&LocationSearchTerms=&DatePostedFilter=2&Page=1&OrderBy=0&CountryId=0&nocache=1266753038"
name="view_advert_link"
>View </a>
</div>
</td>
</tr>
</tbody>
I'm working on code that frequently calls 'remove' and then re-adds an element. While the element is in a detached string, toString throws something, so Eclipse prints only an 'invocation target exception.' It would be nice if this were not so.
Review the HTML5 spec; span and header tags (and what else?) can be block elements and not phrasing content, so the tag spec needs to be updated.
Hi Jhy,
is it possible to consume data elements?
<li class="user" data-name="John Resig" data-city="Boston"
data-lang="js" data-food="Bacon">
<b>John says:</b> <span>Hello, how are you?</span>
</li>
Jsoup.parse(document).select("[data]");
doesn't work for me.
I really love jsoup, thanks for your awesome work.
Document doc = Jsoup.parse(new URL("http://www.oschina.net/bbs/thread/12975"), 5*1000);
Elements es = doc.select("a[href]");
for(Iterator it = es.iterator();it.hasNext();){
Element e = it.next();
System.out.println(e.absUrl("href"));
}
attr("abs:href") ------ <a href="?p=1">1</a>
result: ------------------- http://www.oschina.net/bbs/thread/?p=1
I think it's a wrong result~.
The correct results should be "http://www.oschina.net/bbs/thread/12975?p=1"
The page I will attach results in a Jsoup tree with two body elements, neither if which is a direct child of the html element.
However, I can't see how to attach a file to an issue here.
When calling append to add a table row the resulting tr gets wrapped in a table even though I appended to an existing table.
Add an option to output HTML that is formatted (spaces / newlines / indentation) as the original source, and not force pretty-printed.
Implement with a switch in Document, to force preserve whitespace on all nodes. Will require Nodes to have a direct accessor to their parent Document
I get this exception, because a response header is empty.
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.jsoup.helper.HttpConnection$Response.setupFromConnection(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:338)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:132)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:121)
at org.jsoup.Jsoup.parse(Jsoup.java:133)
Add a post-parse document normalisation phase.
Particularly, move text nodes that aren't in #body (ie in #root, #html, #head) into body.
Add a textNode#isWhitespace method to check if textnode should be moved.
Code:
String html = "foo <b>bar</b> baz";
String text = Jsoup.parse(html).text();
System.out.println(text);
Result:
bar baz foo
Expected:
foo bar baz
A function similar to jQuery's parents() - http://api.jquery.com/parents/ - would be a nice addition. The function would return all parent elements of the current Element. Or, given an optional parameter, would filter based on that.
So if you for example selected all bold text with Elements elems = doc.select('b')
you could then find all bold tags that were in paragraphs with elems.parents('p')
, and that would select the paragraphs themselves if you wanted to do some processing on them.
You could also add the optional selector to the parent() function too - although it is as easy in this case to simply select the parent and check if the tag or class etc matches.
When looking at: http://money.cnn.com/2010/10/25/news/companies/motley_crue_bp.fortune/index.htm?section=money_latest the umlaut character is being picked up in the paragraph text correctly as htmlentities but if you grab the title it's showing up as invalid unicode characters.
doc.getElementsByTag("title").first().text();
Currently, Elements#attr
pulls the attribute from the first element. But Elements#hasAttr
scans all of the elements in the collection to check if one has an attribute. So these do not align.
Modify Elements#attr to scan for the first Element that hasAttr
, and return the value from that element.
The line
System.out.println(Jsoup.clean("<h1>Überschrift</h1>", Whitelist.none()));
should print
Überschrift
but prints
überschrift
This used to work correctly in v0.3.1, but fails in v1.2.3.
While baseArray in Entities.java distinguishes between lowercase and uppercase umlauts, the above call yields the wrong result.
As I already stated in a previous post, we are using JSTL tags (java custom tags) and we require the attributes to be camel cased to match some methods in our java code. Is it possible to give an option to leave the attributes as they are and not modify them by making them lower case?
e.g. <abc:ourtag returnUrl="http://abc.com" /> does not change to <abc:ourtag returnurl="http://abc.com" />
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.