Giter VIP home page Giter VIP logo

extract-text-html's Introduction

extract-text-html

Extract text from HTML. Excludes content from metadata tags by default. For example, script and style. Reduces multiple spaces to a single space and trims whitespace from the start and end by default. Set preserveWhitespace to true to disable this behavior. Optionally, replace tags with text.

Offers a much nicer out-of-the-box experience compared to striptags. See comparison here.

Single dependency on htmlparser2

export interface Replacement {
    /** Tag name to match (without brackets) */
    matchTag: string
    /** Text to replace the tag with */
    text: string
    /** Is the tag self-closing?  */
    isSelfClosing?: boolean
}

export interface Options {
    /** Exclude content from the set of tags. Defaults to all HTML metadata tags. */
    excludeContentFromTags?: string[]
    /** Whitespace is trimmed by default. Set this to true to preserve whitespace. */
    preserveWhitespace?: boolean
    /** Replace a tag with some text. Flag self-closing tags with isSelfClosing: true. */
    replacements?: Replacement[]
}

// Content from the following tags are excluded by default
export const defaultExcludeContentFromTags = [
    'head',
    'base',
    'link',
    'meta',
    'noscript',
    'script',
    'style',
    'title',
]

Example

import { extractText } from 'extract-text-html'

const html = `
<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <link rel="stylesheet" href="https://static-production.npmjs.com/styles.74f9073cf68d3c5f4990.css" />
    <title data-react-helmet="true">extracttext - npm search</title>
  </head>
  <body>
    <h1>Some Title</h1>
    <div style="font-weight: bold">Some text</div>
    <script crossorigin="anonymous" src="https://static-production.npmjs.com/minicssextractbug.536095f4b1a94d2b149c.js"></script>
    <script crossorigin="anonymous" src="https://static-production.npmjs.com/search/search.9fbe393f02970084bce5.js"></script>
    <script>
      const FOO = 'bar'
    </script>
    <br />
    <br />
  </body>
</html>
    `
const extracted = extractText(html)
// Some Title Some text

Replacements example usage

const html = `<b>bold <span>text</span></b>
<div>some text</div><br /><br><p>more text</p>`
const extracted = extractText(html, {
    preserveWhitespace: true,
    replacements: [
        { matchTag: 'br', text: '\n', isSelfClosing: true },
        { matchTag: 'b', text: '__' },
    ],
})
/*
__bold text__
some text

more text
*/

extract-text-html's People

Contributors

dubiousdavid avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.