Giter VIP home page Giter VIP logo

webpage-segmentation--wps-'s Introduction

WebPage-Segmentation--WPS-

Introduction

This is WPS-DB, our webpage segmentation method, different from other method like VIPS, Block-o-matic, we use DB-SCAN instead of K-mean for clustering our data.

Testing for Stack Overflow (Questions tab)

https://stackoverflow.com/questions

Testing for Stack Exchange

https://stackexchange.com

Testing on more pages (using Block-O-Matic's dataset)

Please visit this site to view the results:

https://drive.google.com/drive/folders/1uEAfsyFiR82Vejc26fgoWBLR1VpSaI-b?usp=sharing

Usage

  • Install independencies: pip install -r requirments.txt

  • Run WPS-DB:

    • Download our Jupyter Notebook and run your testing
    • Use command: python WPS_DB_Test.py <your webpage's url>
  • Check your Screenshots folder in the current work directory to see the segmentation layout.

webpage-segmentation--wps-'s People

Contributors

lqtri avatar tungdaoxuan123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

hydrion-qlz

webpage-segmentation--wps-'s Issues

Some encoding problem

when test the homepage of the CSDN, it went wrong

python WPS_DB_Test.py https://www.csdn.net/

i got the log as follows:

Traceback (most recent call last):
  File "D:\TEMP\WebPage-Segmentation--WPS--main\WebPage-Segmentation--WPS--new\WPS_DB_Test.py", line 39, in <module>
    main()
  File "D:\TEMP\WebPage-Segmentation--WPS--main\WebPage-Segmentation--WPS--new\WPS_DB_Test.py", line 20, in main
    wpsdb = Wpsdb.Wpsdb(unquote(sys.argv[1], encoding="utf-8"))
  File "D:\TEMP\WebPage-Segmentation--WPS--main\WebPage-Segmentation--WPS--new\WPS_DB.py", line 36, in __init__
    self.getDomTree()
  File "D:\TEMP\WebPage-Segmentation--WPS--main\WebPage-Segmentation--WPS--new\WPS_DB.py", line 118, in getDomTree
    self.toHTMLFile(self.browser.page_source)
  File "D:\TEMP\WebPage-Segmentation--WPS--main\WebPage-Segmentation--WPS--new\WPS_DB.py", line 114, in toHTMLFile
    file.write(str(page_source))
UnicodeEncodeError: 'gbk' codec can't encode character '\xa9' in position 125149: illegal multibyte sequence

i'm not familar with the PR operation,so i just leave an issue

from the log it can be easily found that it's the encoding problem, i just add encoding='utf-8' and the problem has been solved, there are two places with this problem

  • ImageOut.py: line15
  • WPS_DB.py: line13

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.