Giter VIP home page Giter VIP logo

iww's Introduction

IWW-IntelliWebWrapper


GitHub license made-with-python GitHub version Generic badge Ask Me Anything !

an AI based web-mining library for web-content-extraction using machine learning algorithms.

currently, the library offers many functionalities to be exploited & some interesting algos to look at:

  • DOM extractor, mapper, reducer and flattening functionality...
  • DoC, degree of coherence, a euclidean distance based similarity.
  • LD, Lists detector algorithm.
  • MCD, Main content detector algorithm.
  • MCD algorithms results integrator method.
  • CETD algorithm.
  • DOM tags detector script (highlighting the chosen nodes).

P.S :

  • the documentation isn't available yet.
  • LD & MCD algorithms are to be released as a research article in the near future.
  • the pip package of iww will be available online as soon as possible.

USE CASE EXAMPLE :

1- extraction :

from iww.extractor import extractor
from iww.detector import detector
from iww.features_extraction.lists_detector import Lists_Detector as LD
from iww.features_extraction.main_content_detector import MCD
url = "https://www.theiconic.com.au/catalog/?q=kids%20sunglasses"
json_file = "./iconic.json"

extractor.extract(
    url = url, 
    destination = json_file
)

2- data exploratory analysis :

from iww.utils.dom_mapper import DOM_Mapper as DM

dm = DM()
dm.retrieve_DOM_tree("./iconic.json")
print("total number of nodes : {}".format(dm.DOM['CETD']['tagsCount']))

total numbre of nodes : 2098

3- LD algorithm :

ld = LD()
ld.retrieve_DOM_tree(file_path = "./iconic.json")
ld.apply(
    node = ld.DOM, 
    coherence_threshold= (0.75,1), 
    sub_tags_threshold = 2
)
ld.update_DOM_tree()
detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_ld.png",
    mark_path = "LISTS.mark", 
    mark_value = "1"
)

4- MCD algorithm :

mcd = MCD()
mcd.retrieve_DOM_tree("./iconic.json")
mcd.apply(
    node = mcd.DOM, 
    min_ratio_threshold = 0.0, 
    nbr_nodes_threshold = 1
)
mcd.update_DOM_tree()
detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_mcd.png",
    mark_path = "MCD.mark", 
    mark_value = "1"
)

5- LD/MCD integration (main list detection) :

mcd.integrate_other_algorithms_results(
    node = mcd.DOM, 
    nbr_nodes = 1,
    mode = "ancestry", 
    condition_features = [("LISTS.mark","1")])

mcd.update_DOM_tree()
detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_main_list.png",
    mark_path = "MCD.main_node", 
    mark_value = "1"
)

License

MIT

MOHAMED-HMINI 2019

iww's People

Contributors

arnavdas88 avatar mohamedhmini avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

iww's Issues

Please provide requirements file

@MohamedHmini Hey your project looks so interesting, I am working on a project which is on similar lines and came across your repo while exploring automated web-content extraction. I am trying to set up this project in my machine, facing issues with dependencies installation. If u can provide the requirements file it will be a great help for me to deep dive into this project.

This is interesting

Hey Mohamed, this is interesting repo. ARe you still working on it? Do you have some plans for future development?

Peace,

Unable to create json file

I have cloned the repo and ran test.py but it throws this error

/content/iww/iww/extractor/resources_extractor.js https://stackoverflow.com/questions/66102275/commanderjs-i-cant-get-value-from-option /content/iww/test.json
internal/modules/cjs/loader.js:883
  throw err;
  ^

Error: Cannot find module 'puppeteer'
Require stack:
- /content/iww/iww/extractor/resources_extractor.js
    at Function.Module._resolveFilename (internal/modules/cjs/loader.js:880:15)
    at Function.Module._load (internal/modules/cjs/loader.js:725:27)
    at Module.require (internal/modules/cjs/loader.js:952:19)
    at require (internal/modules/cjs/helpers.js:88:18)
    at Object.<anonymous> (/content/iww/iww/extractor/resources_extractor.js:1:19)
    at Module._compile (internal/modules/cjs/loader.js:1063:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1092:10)
    at Module.load (internal/modules/cjs/loader.js:928:32)
    at Function.Module._load (internal/modules/cjs/loader.js:769:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:72:12) {
  code: 'MODULE_NOT_FOUND',
  requireStack: [ '/content/iww/iww/extractor/resources_extractor.js' ]
}
Traceback (most recent call last):
  File "/content/iww/test.py", line 21, in <module>
    dm.retrieve_DOM_tree("./test.json")
  File "/content/iww/iww/utils/dom_mapper.py", line 44, in retrieve_DOM_tree
    file = open(self.DOM_file_path, 'r', encoding = 'UTF-8')
FileNotFoundError: [Errno 2] No such file or directory: '/content/test.json'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.