Giter VIP home page Giter VIP logo

iww's Introduction

IWW-IntelliWebWrapper


GitHub license made-with-python GitHub version Generic badge Ask Me Anything !

an AI based web-mining library for web-content-extraction using machine learning algorithms.

currently, the library offers many functionalities to be exploited & some interesting algos to look at:

  • DOM extractor, mapper, reducer and flattening functionality...
  • DoC, degree of coherence, a euclidean distance based similarity.
  • LD, Lists detector algorithm.
  • MCD, Main content detector algorithm.
  • MCD algorithms results integrator method.
  • CETD algorithm.
  • DOM tags detector script (highlighting the chosen nodes).

P.S :

  • the documentation isn't available yet.
  • LD & MCD algorithms are to be released as a research article in the near future.
  • the pip package of iww will be available online as soon as possible.

USE CASE EXAMPLE :

1- extraction :

from iww.extractor import extractor
from iww.detector import detector
from iww.features_extraction.lists_detector import Lists_Detector as LD
from iww.features_extraction.main_content_detector import MCD
url = "https://www.theiconic.com.au/catalog/?q=kids%20sunglasses"
json_file = "./iconic.json"

extractor.extract(
    url = url, 
    destination = json_file
)

2- data exploratory analysis :

from iww.utils.dom_mapper import DOM_Mapper as DM

dm = DM()
dm.retrieve_DOM_tree("./iconic.json")
print("total number of nodes : {}".format(dm.DOM['CETD']['tagsCount']))

total numbre of nodes : 2098

3- LD algorithm :

ld = LD()
ld.retrieve_DOM_tree(file_path = "./iconic.json")
ld.apply(
    node = ld.DOM, 
    coherence_threshold= (0.75,1), 
    sub_tags_threshold = 2
)
ld.update_DOM_tree()
detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_ld.png",
    mark_path = "LISTS.mark", 
    mark_value = "1"
)

4- MCD algorithm :

mcd = MCD()
mcd.retrieve_DOM_tree("./iconic.json")
mcd.apply(
    node = mcd.DOM, 
    min_ratio_threshold = 0.0, 
    nbr_nodes_threshold = 1
)
mcd.update_DOM_tree()
detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_mcd.png",
    mark_path = "MCD.mark", 
    mark_value = "1"
)

5- LD/MCD integration (main list detection) :

mcd.integrate_other_algorithms_results(
    node = mcd.DOM, 
    nbr_nodes = 1,
    mode = "ancestry", 
    condition_features = [("LISTS.mark","1")])

mcd.update_DOM_tree()
detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_main_list.png",
    mark_path = "MCD.main_node", 
    mark_value = "1"
)

License

MIT

MOHAMED-HMINI 2019

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.