Giter VIP home page Giter VIP logo

aelous's Introduction

Project of Team Aelous in Google ML Winter Camp 2020, Shanghai Site

13 - 17 January, 2020

Project Name: Webpage Wizard

Team Members: Muge Chen, Yujie Lu & Yanqiong Chen

1. Project Introduction

In the world of machine learning, people train models based on dataset that are fully labeled. However, most of the application domain suffers from not having sufficient labeled data whereas unlabeled data is available cheaply. And our project also faced the same problem. Specifically, we try to do webpage classification on a dataset that only a small amount of data is labeled.

Here is the basic information of our dataset

  • Dataset name: webpage-classification
  • Data size: 10K pairs of {web page screenshot image, web page html source}.
  • 800 samples are manually labeled {is_entity, category}
  • 2590 samples are manually labeled {is_entity}

Our model aims to solve such a challenging issue by semi-supervised learning combined with some rule based methods.

2. Implementation Details

First we did some pre-processing on the dataset.

  • We crop and resize the render images to a fixed size of (224,224,3).
  • We use a html parser to extract useful information from the html data. And use word2vec to process the extracted information.
  • We convert the given labels to probability of different categories.
  • We also write a script to label some of the unlabeled data that we are 95% confident. For example, the one with url that starts with wikipedia or contains imdb/title, etc.

Second we have built 3 independent models to classify the web page.

2.1 url model

In this sub model, we use unsupervised cluster algorithm to do classification. However, the category can not be mapped to the clusters we have found, since the dataset is really biased.

2.2 html model

First we use html parser to parse the html into plain text, and next we use doc2vec to transform the text data into a vector of 300 dimension. Then use a deep learning model to predict the category.

It finally reaches an accuracy of 80% on the dataset. However, the model sometimes can not converge, we can see that there are still some problems in the model or the feature engineering.

2.3 image model

We use deep CNN to train the model, and it reaches an accuracy of 81% on the dataset and 95% on the validation set. .pipeline

2.4 Combined model

.pipeline In addition, we have put the 2 model together. That is, we use both the html and the render to train the model, and it turns out to perform well on the dataset.

3. Demo and Posters

There are many possible applications of webpage classification, for example

  • We can build an extension for Chrome which can beautify the UI with different strategies based on the category of the webpage. Just like switch omega.
  • The browser can collect information about which category the user visited most frequently to decide which AD to present.

References

[1] S4L: Self-Supervised Semi-Supervised Learning (ICCV 2019)

[2] Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results (NIPS 2017)

[3] Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. 2014.

aelous's People

Contributors

yujie6 avatar foreseeable avatar reise-rm avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.