gitpan / html-astext-fix Goto Github PK
View Code? Open in Web Editor NEWRead-only release history for HTML-AsText-Fix
Home Page: http://metacpan.org/release/HTML-AsText-Fix
License: Other
Read-only release history for HTML-AsText-Fix
Home Page: http://metacpan.org/release/HTML-AsText-Fix
License: Other
NAME HTML::AsText::Fix - extends HTML::Element::as_text() to render text properly VERSION version 0.003 SYNOPSIS # fix individual objects my $tree = HTML::TreeBuilder::XPath->new_from_content($html); my $guard = HTML::AsText::Fix::object($tree); # fix deeply nested objects use URI; use Web::Scraper; # First, create your scraper block my $tweets = scraper { process "li.status", "tweets[]" => scraper { process ".entry-content", body => 'TEXT'; process ".entry-date", when => 'TEXT'; process 'a[rel="bookmark"]', link => '@href'; }; }; my $res; { my $guard = HTML::AsText::Fix::global(); $res = $tweets->scrape( URI->new("http://twitter.com/creaktive") ); } DESCRIPTION Consider the following HTML sample: <p> <span>AAA</span> BBB </p> <h2>CCC</h2> DDD <br> EEE HTML::Element::as_text() method stringifies it as AAABBBCCCDDDEEE. Despite being correct, this is far from the actual renderization within a "real" browser. links(1), lynx(1) & w3m(1) break lines this way: AAABBB CCC DDD EEE This module tries to implement the same behavior in the method "as_text" in HTML::Element. By default, $/ value is inserted in place of line breaks, and "\x{200b}" (Unicode zero-width space) separates text from adjacent inline elements. Distinction between block/inline nodes "span", for instance, is an inline node: <p><span>A</span>pple</p> In that case, there really shouldn't be a space between "A" and "pple". To handle inline nodes properly, only block nodes are separated by line break. Following nodes are currently assumed being blocks: * p * h1 h2 h3 h4 h5 h6 * dl dt dd * ol ul li * dir * address * blockquote * center * del * div * hr * ins * noscript script * pre * br (just to make sense) (source: http://en.wikipedia.org/wiki/HTML_element#Block_elements) FUNCTIONS as_text The replacement function. Not to be used separately. It is injected inside HTML::Element. global Hook into every HTML::Element within the lexical scope. Returns the guard object, destroying it will unhook safely. Accepts following options: * lf_char: character inserted between block nodes (by default, $/); * zwsp_char: character inserted between inline nodes (by default, "\x{200b}", Unicode zero-width space); * trim: trim heading/trailing spaces (considers "\x{A0}" as space!); * extra_chars: extra characters to trim; * skip_dels: if true, then text content under "del" nodes is not included in what's returned. For example, to completely get rid of separation between inline nodes: my $guard = HTML::AsText::Fix::global(zwsp_char => ''); object Hook object instance. Accepts the same options as "global": my $guard = HTML::AsText::Fix::object($tree, zwsp_char => ''); SEE ALSO * HTML::Element * HTML::Tree * HTML::FormatText * Monkey::Patch ACKNOWLEDGEMENTS * Αριστοτέλης Παγκαλτζής <https://metacpan.org/author/ARISTOTLE> * Toby Inkster <https://metacpan.org/author/TOBYINK> AUTHOR Stanislaw Pusep <[email protected]> COPYRIGHT AND LICENSE This software is copyright (c) 2014 by Stanislaw Pusep. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.