Giter VIP home page Giter VIP logo

xhtml-util's Introduction

NAME
    XHTML::Util - (alpha software) powerful utilities for common but
    difficult to nail HTML munging.

  VERSION
    0.99_08

SYNOPSIS
     use strict;
     use warnings;
     use XHTML::Util;
     my $xu = XHTML::Util
        ->new(\"This is naked\n\ntext for making into paragraphs.");
     print $xu->enpara, $/;
 
     # <p>This is naked</p>
     #
     # <p>text for making into paragraphs.</p>

     $xu = XHTML::Util
         ->new(\"<blockquote>Quotes should probably have paras.</blockquote>");
     print $xu->enpara("blockquote");
 
     # <blockquote>
     #   <p>Quotes should probably have paras.</p>
     # </blockquote>

     $xu = XHTML::Util
         ->new(\'<i><a href="#"><b>Something</b></a>.</i>');
 
     print $xu->strip_tags('a');
     # <i><b>Something</b>.</i>

DESCRIPTION
    You can use CSS expressions to most of the methods. E.g., to only enpara
    the contents of div tags with a class of "enpara" -- "E<lt>div
    class="enpara"/E<gt>" -- you could do this-

     print $xu->enpara("div.enpara");

    To do the contents of all blockquotes and divs-

     print $xu->enpara("div, blockquote");

    Alterations to the XHTML in the object are persistent.

     my $xu = XHTML::Util
         ->new(\'<script>alert("OH HAI")</script>');
     $xu->strip_tags('script');

    Will remove the script tags—not the script content though—so the next
    time you call anything that returns the stringified object the changes
    will remain–

     print $xu->as_string, $/;
     # alert("OH HAI")

    Well... really you'll get "E<lt>![CDATA[alert(&quot;OH
    HAI&quot;)]]E<gt>".

METHODS
  new
    Creates a new "XHTML::Util" object.

  strip_tags
    Why you might need this-

     my $post_title = "I <3 <a href="http://icanhascheezburger.com/">kittehs</a>";
     my $blog_link = some_link_maker($post_title);
     print $blog_link;

     <a href="/oh-noes">I <3 <a href="http://icanhascheezburger.com/">kittehs</a></a>

    That isn't legal so there's no definition for what browsers should do
    with it. Some sort of tolerate it, some don't. It's never going to be a
    good user experience.

    What you can do is something like this–

     my $post_title = "I <3 <a href="http://icanhascheezburger.com/">kittehs</a>";
     my $safe_title = $xu->strip_tags($post_title, ["a"]);
     # Menu link should only go to the single post page.
     my $menu_view_title = some_link_maker($safe_title);
     # No need to link back to the page you're viewing already.
     my $single_view_title = $post_title;

  remove
    Takes a CSS selector string. Completely removes the matched nodes,
    including their content. This differs from "strip_tags" which retains
    the child nodes intact and only removes the tags proper.

     # Remove <center/> tags and external images.
     my $cleaned = $xu->remove("center, img[src^='http']");

  traverse
    Walks the given nodes and executes the given callback. Can be called
    with a selector or without. If called with a selector, the callback sub
    receives the selected nodes as its arguments.

     $xu->traverse("div.fancy", sub { my $div_node = shift });

    Without a selector it receives the document root.

     $xu->traverse(sub { my $root = shift });

  translate_tags
    [Not implemented.] Translates one tag to another.

  remove_style
    [Not implemented.] Removes styles from matched nodes. To remove all
    style from a fragment-

     $xu->remove_style("*");

    (Should also remove style sheets, yes?)

  inline_stylesheets
    [Not implemented.] Moves all linked stylesheet information into inline
    style attributes. This is useful, for example, when distributing a
    document fragment like an RSS/Atom feed and having it match its online
    appearance.

  sanitize
    [Not implemented.] Upgrades old or broken HTML to valid XHTML.

  fix
    [Partially implemented.] Attempts to make many known problems go away.
    E.g., entity escaping, missing alt attributes of images, etc.

  validate
    Validates a given document or fragment (which is actually contained in a
    full document) against a DTD provided by name or, if none is provided,
    it will validate against xhtml1-transitional. Uses XML::LibXML's
    validate under the covers.

  is_valid
    A non-fatal version of "validate". Returns true on success, false on
    failure.

  enpara
    To add paragraph markup to naked text. There are many, many
    implementations of this basic idea out there as well as many like
    Markdown which do much more. While some are decent, none is really meant
    to sling arbitrary HTML and get DWIM behavior from places where it's
    left out; every implementation I've seen either has rigid syntax or has
    beaucoup failure prone edge cases. Consider these-

     Is this a paragraph
     or two?

     <p>I can do HTML when I'm paying attention.</p>
     <p style="color:#a00">Or I need to for some reason.</p>
     Oh, I stopped paying attention... What happens here? Or <i>here</i>?

     I'd like to see this in a paragraph so it's legal markup.
     <pre>
     now
     this
     should


     not be touched!
     </pre>I meant to do that.

    With "XHTML::Util-E<gt>enpara" you will get-

     <p>Is this a paragraph<br/>
     or two?</p>

     <p>I can do HTML when I'm paying attention.</p>
     <p style="color:#a00">Or I need to for some reason.</p>
     <p>Oh, I stopped paying attention... What happens here? Or <i>here</i>?</p>

     <p>I'd like to see this in a paragraph so it's legal markup.</p>
     <pre>
     now
     this
     should
 
 
     not be touched!
     </pre>
     <p>I meant to do that.</p>

  parser
    The XML::LibXML parser object used to parse (X)HTML.

  doc
    The XML::LibXML::Document object created from input.

  root
    The documentElement of the XML::LibXML::Document object.

  text
    The "textContent" of the root node.

  head
    The head element.

  body
    The body element.

    Note there is always an implicit head and body even with fragments
    because libxml creates them, well, we ask it to do so.

  as_fragment
    Returns the original (intent-wise) fragment or the elements within the
    body if starting with a full document.

  as_string
    Stringified version of object. If the object was created from an HTML
    fragment, a fragment will be returned.

  debug
    Yep. 1-5 with higher giving more info to STDERR.

  is_document
    Returns true if the originally parsed item was a full HTML document.

  is_fragment
    Returns true if the originally parsed item was a fragment.

  clone
  same_same
    Takes another XHTML::Util object or the valid argument to create one.
    Attempts to determine if the resulting object is the same as the calling
    object. E.g.,

     print $xu->same_same(\"<p>OH HAI</p>") ?
         "Yepper!\n" : "Noes...\n";

  tags
    Returns a list of all known HTML tags. Please ignore method. I'm not
    sure it's a good idea, well named, or will remain.

  selector_to_xpath
    This wraps "selector_to_xpath" in selector_to_xpath
    HTML::Selector::Xpath. Not really meant to be used but exposed in case
    you want it.

     print $xu->selector_to_xpath("form[name='register'] input[type='password']");
     # //form[@name='register']//input[@type='password']

TO DO
    I think the default doc should be \"". There is no reason to jump
    through that hoop if wanting to build up something from scratch.

    Finish spec and tests. Get it running solid enough to remove alpha
    label. Generalize the argument handling. Provide optional setting or
    methods for returning nodes instead of serialized content. Improve
    document/head related handling/options.

    I can see this being easier to use functionally. I haven't decided on
    the argspec or method-->sub approach for that yet. I think it's a good
    idea.

BUGS AND LIMITATIONS
    All input should be UTF-8 or at least safe to run decode_utf8 on.
    Regular Latin character sets, I suspect, will be fine but extended sets
    will probably give garbage or unpredictable results; guessing.

    This will wreck XML and probably XHTML with a custom DTD too. It uses
    HTML::Tagset's conception of what valid tags are. This is not optimal
    but it is easier than DTD handling. It might improve to more automatic
    detection in the future.

    I have used many of these methods and snippets in many projects and I'm
    tired of recycling them. Some are extremely useful and, at least in the
    case of "enpara", better than any other implementation I've been able to
    find in any language.

    That said, a lot of the code herein is not well tested or at least not
    well tested in this incarnation. Bug reports and good feedback are
    adored.

SEE ALSO
    XML::LibXML, HTML::Tagset, HTML::Entities, HTML::Selector::XPath,
    HTML::TokeParser::Simple, CSS::Tiny.

    CSS W3Schools, <http://www.w3schools.com/Css/default.asp>, Learning CSS
    at W3C, <http://www.w3.org/Style/CSS/learning>.

REPOSITORY
    git://github.com/pangyre/p5-xhtml-util

AUTHOR
    Ashley Pond V, ashley at cpan.org.

COPYRIGHT & LICENSE
    Copyright (©) 2006-2009.

    This program is free software; you can redistribute it or modify it or
    both under the same terms as Perl itself.

DISCLAIMER OF WARRANTY
    Because this software is licensed free of charge, there is no warranty
    for the software, to the extent permitted by applicable law. Except when
    otherwise stated in writing the copyright holders or other parties
    provide the software "as is" without warranty of any kind, either
    expressed or implied, including, but not limited to, the implied
    warranties of merchantability and fitness for a particular purpose. The
    entire risk as to the quality and performance of the software is with
    you. Should the software prove defective, you assume the cost of all
    necessary servicing, repair, or correction.

    In no event unless required by applicable law or agreed to in writing
    will any copyright holder, or any other party who may modify and/or
    redistribute the software as permitted by the above licence, be liable
    to you for damages, including any general, special, incidental, or
    consequential damages arising out of the use or inability to use the
    software (including but not limited to loss of data or data being
    rendered inaccurate or losses sustained by you or third parties or a
    failure of the software to operate with any other software), even if
    such holder or other party has been advised of the possibility of such
    damages.

xhtml-util's People

Contributors

pangyre avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.