justext

Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages.

Demo

Try online

History

Version 0.0.1 - Convert from python code
Version 0.0.2 - Add logger lib
Version 0.0.3 - Migrate to rollup

justext's People

Contributors

Stargazers

Watchers

justext's Issues

Function url is incorrect

Function Core.getHtmlOfUrl returns promise. So you need return this promise to a caller.

export function url(externalUrl, language = '', format = 'default', options = {
lengthLow: LENGTH_LOW_DEFAULT,
lengthHigh: LENGTH_HIGH_DEFAULT,
stopwordsLow: STOPWORDS_LOW_DEFAULT,
stopwordsHigh: STOPWORDS_HIGH_DEFAULT,
maxLinkDensity: MAX_LINK_DENSITY_DEFAULT,
maxHeadingDistance: MAX_HEADING_DISTANCE_DEFAULT,
noHeadings: NO_HEADINGS_DEFAULT,
}) {
return Core.getHtmlOfUrl(externalUrl).then((response) => {
const htmlText = response.data;
return rawHtml(htmlText, language, format, options);
}).catch((error) => {
throw error;
});
}

And justext.url(url, language, 'detailed', options).then((data)=>console.log(data));

Question about reviseParagraphClassification

reviseParagraphClassification returns paragraphs, but it doesn't look like the paragraphs get manipulated in any way, and that reviseParagraphs is where the new filtered data is. Is reviseParagraphs supposed to be returned instead of paragraphs ?

I added the python reference code for convenience.

def revise_paragraph_classification(paragraphs, max_heading_distance=MAX_HEADING_DISTANCE_DEFAULT):
    """
    Context-sensitive paragraph classification. Assumes that classify_pragraphs
    has already been called.
    """
    # copy classes
    for paragraph in paragraphs:
        paragraph.class_type = paragraph.cf_class

    # good headings
    for i, paragraph in enumerate(paragraphs):
        if not (paragraph.heading and paragraph.class_type == 'short'):
            continue
        j = i + 1
        distance = 0
        while j < len(paragraphs) and distance <= max_heading_distance:
            if paragraphs[j].class_type == 'good':
                paragraph.class_type = 'neargood'
                break
            distance += len(paragraphs[j].text)
            j += 1

    # classify short
    new_classes = {}
    for i, paragraph in enumerate(paragraphs):
        if paragraph.class_type != 'short':
            continue
        prev_neighbour = get_prev_neighbour(i, paragraphs, ignore_neargood=True)
        next_neighbour = get_next_neighbour(i, paragraphs, ignore_neargood=True)
        neighbours = set((prev_neighbour, next_neighbour))
        if neighbours == set(['good']):
            new_classes[i] = 'good'
        elif neighbours == set(['bad']):
            new_classes[i] = 'bad'
        # it must be set(['good', 'bad'])
        elif (prev_neighbour == 'bad' and get_prev_neighbour(i, paragraphs, ignore_neargood=False) == 'neargood') or \
             (next_neighbour == 'bad' and get_next_neighbour(i, paragraphs, ignore_neargood=False) == 'neargood'):
            new_classes[i] = 'good'
        else:
            new_classes[i] = 'bad'

    for i, c in new_classes.items():
        paragraphs[i].class_type = c

    # revise neargood
    for i, paragraph in enumerate(paragraphs):
        if paragraph.class_type != 'neargood':
            continue
        prev_neighbour = get_prev_neighbour(i, paragraphs, ignore_neargood=True)
        next_neighbour = get_next_neighbour(i, paragraphs, ignore_neargood=True)
        if (prev_neighbour, next_neighbour) == ('bad', 'bad'):
            paragraph.class_type = 'bad'
        else:
            paragraph.class_type = 'good'

    # more good headings
    for i, paragraph in enumerate(paragraphs):
        if not (paragraph.heading and paragraph.class_type == 'bad' and paragraph.cf_class != 'bad'):
            continue
        j = i + 1
        distance = 0
        while j < len(paragraphs) and distance <= max_heading_distance:
            if paragraphs[j].class_type == 'good':
                paragraph.class_type = 'good'
                break
            distance += len(paragraphs[j].text)
            j += 1

Question: Is this sufficiently up to date with the python implementation?

Thank you for making this package. Is this sufficiently up to date with the python implementation (https://github.com/miso-belica/jusText) ? Or was it a nearly identical replica at the time of this implementation?

Migrate to build with Rollup

[] — Migrate to build with Rollup
[] - Fix the demo

Ref https://github.com/developit/microbundle

Recommend Projects

jellydn / justext Goto Github PK

justext's Introduction

justext

Demo

History

justext's People

Contributors

Stargazers

Watchers

Forkers

justext's Issues

Function url is incorrect

Question about reviseParagraphClassification

Question: Is this sufficiently up to date with the python implementation?

Migrate to build with Rollup

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent