Giter VIP home page Giter VIP logo

html-truncator's Introduction

HTML Truncator

Wants to truncate an HTML string properly? This gem is for you. It's powered by Nokogiri!

How to use it

It's very simple. Install it with rubygems:

gem install html_truncator

Or, if you use bundler, add it to your Gemfile:

gem "html_truncator", "~>0.2"

Then you can use it in your code:

require "html_truncator"
HTML_Truncator.truncate("<p>Lorem ipsum dolor sit amet.</p>", 3)
# => "<p>Lorem ipsum dolor…</p>"

The HTML_Truncator class has only one method, truncate, with 3 arguments:

  • the HTML-formatted string to truncate
  • the number of words to keep (real words, tags and attributes aren't count)
  • some options like the ellipsis (optional, '…' by default).

And 3 attributes:

  • ellipsable_tags, which lists the tags that can contain the ellipsis (by default: p ol ul li div header article nav section footer aside dd dt dl)
  • self_closing_tags, with the tags to keep when empty (by default: br hr img param embed)
  • punctuation_chars, with the punctation characters to remove before the ellipsis (by default: , . : ; ! ?).

Examples

A simple example:

HTML_Truncator.truncate("<p>Lorem ipsum dolor sit amet.</p>", 3)
# => "<p>Lorem ipsum dolor…</p>"

If the text is too short to be truncated, it won't be modified:

HTML_Truncator.truncate("<p>Lorem ipsum dolor sit amet.</p>", 5)
# => "<p>Lorem ipsum dolor sit amet.</p>"

If you prefer, you can have the length in characters instead of words:

HTML_Truncator.truncate("<p>Lorem ipsum dolor sit amet.</p>", 12, :length_in_chars => true)
# => "<p>Lorem ipsum…</p>"

It doesn't cut inside a word but goes back to the immediately preceding word boundary:

HTML_Truncator.truncate("<p>Lorem ipsum dolor sit amet.</p>", 10, :length_in_chars => true)
# => "<p>Lorem…</p>"

You can customize the ellipsis:

HTML_Truncator.truncate("<p>Lorem ipsum dolor sit amet.</p>", 3, :ellipsis => " (truncated)")
# => "<p>Lorem ipsum dolor (truncated)</p>"

And even have HTML in the ellipsis:

HTML_Truncator.truncate("<p>Lorem ipsum dolor sit amet.</p>", 3, :ellipsis => '<a href="/more-to-read">...</a>')
# => "<p>Lorem ipsum dolor<a href="/more-to-read">...</a></p>"

The ellipsis is put at the right place, inside <p>, but not <i>:

HTML_Truncator.truncate("<p><i>Lorem ipsum dolor sit amet.</i></p>", 3)
# => "<p><i>Lorem ipsum dolor</i>…</p>"

And the punctation just before the ellipsis is not kept:

HTML_Truncator.truncate("<p>Lorem ipsum: lorem ipsum dolor sit amet.</p>", 2)
# => "<p>Lorem ipsum…</p>"

You can indicate that a tag can contain the ellipsis but adding it to the ellipsable_tags:

HTML_Truncator.ellipsable_tags << "blockquote"
HTML_Truncator.truncate("<blockquote>Lorem ipsum dolor sit amet.</blockquote>", 3)
# => "<blockquote>Lorem ipsum dolor…</blockquote>"

You can know if a string was truncated with the html_truncated? method:

HTML_Truncator.truncate("<p>Lorem ipsum dolor sit amet.</p>", 3).html_truncated?
# => true

You can ignore images in the text by overriding the self_closing_tags attribute:

HTML_Truncator.self_closing_tags.delete "img"
HTML_Truncator.truncate("<p>Lorem ipsum <img src='...'>dolor sit amet.</p>", 3)
# => "<p>Lorem ipsum dolor…</p>"

If you already have parsed an HTML document with Nokogiri, you can use it directly to truncate:

document = Nokogiri::HTML::DocumentFragment.parse(text)
# Doing something with this document
options = HTML_Truncator::DEFAULT_OPTIONS.merge(length_in_char: true)
document.truncate(12, options)

Alternatives

Rails has a truncate helper, but as the doc says:

Care should be taken if text contains HTML tags or entities, because truncation may produce invalid HTML (such as unbalanced or incomplete tags).

I know there are some Ruby code to truncate HTML, like:

But I'm not pleased with these solutions: they are either based on regexp for parsing the content (too fragile), they don't put the ellipsis where expected, they cut words and sometimes leave empty DOM nodes. So I made my own gem ;-)

Issues or Suggestions

Found an issue or have a suggestion? Please report it on Github's issue tracker.

If you wants to make a pull request, please check the specs before:

rspec spec

Credits

Thanks to François de Metz for his awesome help! Thanks to kuroir and benhutton for their suggestions.

The code is released under the MIT license. See the MIT-LICENSE file for the full license.

♡2011 by Bruno Michel. Copying is an act of love. Please copy and share.

html-truncator's People

Contributors

aakash-dhingra avatar adammck avatar krukgit avatar makaio avatar nono avatar olleolleolle avatar tinynumbers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

html-truncator's Issues

Removing img tag

When i truncate my text it just remove the img tag:

`

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sed rhoncus mauris. Pellentesque tempus, sapien sit amet volutpat tristique, felis lectus rhoncus sem, ut laoreet velit nisi ac turpis.


Select01-assim Se Lhe Parece_de Carla Gallo_frame3 Divulgação

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sed rhoncus mauris. Pellentesque tempus, sapien sit amet volutpat tristique, felis lectus rhoncus sem, ut laoreet velit nisi ac turpis.

`

number of words vs number of chars

Any interest in making API similar to rails truncate? You tell it the max number of chars, instead of max number of words. But you can also tell it the separator to use, for instance ' ', to make sure it truncates on word boundary -- it'll truncate at the first separator before the limit.

My concern with truncating on number of words, is if the input has a really long 'word', hundreds of chars without any spaces -- nothing will get truncated.

Messing up Nokogiri

Am I wrong in assuming that you are modifying the way that Nokogiri's built-in classes work for all code that is calling into Nokogiri? Thus making this unusable in large projects that already depend on Nokogiri?

Characters instead of Words

First of all thanks for this awesome gem, I'm really happy with it! It's just what I needed. But I found a case where I need a fixed character length. Would it be possible for you to consider adding an option for character length as an extra option?

Pattern in words scan

To search words and count in text node you use pattern like this

words = content.scan(/\s*\S+/)

but it doesn't covers special charecters like NO-BREAK SPACE
I think pattern /[[:space:]]*[[[:punct:]][[:word:]]]+/ is better because also covers non-ASCII characters in utf-8

Is there any way to ignore specified tags?

Suppose we have a blog with images and texts, we wanna truncate the blog as thumbnail in blog list, but we don't want images.

Seems the gem hasn't support it?

Thanks for any help.

HTML encoded strings are decoded

Just stumbled at this:

HTML_Truncator.truncate('12345678901', 10, length_in_chars: true)
=> "1234567890…" # good

HTML_Truncator.truncate('<br>12345678901', 10, length_in_chars: true)
=> "<br>1234567890…" # good

HTML_Truncator.truncate('<br>&lt;br&gt;12345678901', 10, length_in_chars: true)
=> "<br><br>123456…" # bad, second '<br>' is decoded!

HTML_Truncator.truncate('&lt;br&gt;', 10, length_in_chars: true)
=> "&lt;br&gt;" # inconsistent: if length is shorter, it is not decoded as opposed to example before

I think the method should never decode strings. The encoded chars could count as 1 length (&lt; is length 1 etc), so:

HTML_Truncator.truncate('<br>&lt;br&gt;12345678901', 10, length_in_chars: true)
=> "<br>&lt;br&gt;123456…"

This would make most sense I think.

Comma after text

I truncate some text with comma and recieve result: 'some long text,...'
So, i would like to remove that comma after text too
Can you add this feature?

For speed, allow Nokogiri nodes to be passed in

I am already using Nokogiri for processing HTML, and it would be more efficient if I could just pass in a node to your sanitizer, instead of outputting to HTML, then having your library re-convert it to a node.

Strips iframe tags on ruby 2.1

@nono

2.0.0-p247 :061 > text = "<iframe width=640 height=360 src=//www.youtube.com/embed/WLIfmnlSkQ4?feature=player_detailpage frameborder=0 allowfullscreen></iframe>"
 => "<iframe width=640 height=360 src=//www.youtube.com/embed/WLIfmnlSkQ4?feature=player_detailpage frameborder=0 allowfullscreen></iframe>"
2.0.0-p247 :062 > HTML_Truncator.truncate(text, 2)
 => "<iframe width=\"640\" height=\"360\" src=\"//www.youtube.com/embed/WLIfmnlSkQ4?feature=player_detailpage\" frameborder=\"0\" allowfullscreen></iframe>"


2.1.0 :001 > text = "<iframe width=640 height=360 src=//www.youtube.com/embed/WLIfmnlSkQ4?feature=player_detailpage frameborder=0 allowfullscreen></iframe>"
 => "<iframe width=640 height=360 src=//www.youtube.com/embed/WLIfmnlSkQ4?feature=player_detailpage frameborder=0 allowfullscreen></iframe>"
2.1.0 :002 > HTML_Truncator.truncate(text, 2)
 => ""
2.1.0 :003 >

Strips script and style

HTML_Truncator.truncate("<style>Lorem ipsum dolor sit amet.</style>", 3)
HTML_Truncator.truncate("<script>Lorem ipsum dolor sit amet.</script>", 3)

results in

<style></style>…

some tags should not be touched at all

Truncations after tag boundary

For html like the following:

<p>
  five words in this paragraph
</p>
<p>
  some more text which will be truncated
</p>

Given the above html is stored in the html var:

HTML_Truncator.truncate(html, 5)

returns html as follows:

<p>
  five words in this paragraph
</p>
<p>
  ...
</p>

Ideally I would expect the ... to be appended to the first <p> block, with the second block removed, like so:

<p>
  five words in this paragraph...
</p>

This looks like it could be a little bit of a pain to implement, sorry :3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.