Giter VIP home page Giter VIP logo

charlotte's Introduction

Charlotte

An off-screen Chromium web browser used to extract the computed DOM from any web page using .NET Core WCF.

Installation

  1. Build project
  2. execute bin/x64/Debug/Charlotte.exe -register in PowerShell to register the console application as a Windows Service, which will automatically start the WCF Hosted Service
  3. Using WCF in your own project, connect to http://localhost:7077/Browser and retrieve the computed DOM element hierarchy as a JSON string:

IBrowser.cs

[ServiceContract]
public interface IBrowser
{
    [OperationContract]
    string Collect(string url);
}

Client.cs

public static class Client{
	static string Collect(string url){
		var binding = new BasicHttpBinding()
			{
				MaxReceivedMessageSize = 50 * 1024 * 1024 //50 MB
			};
		var endpoint = new EndpointAddress(new Uri("http://localhost:7077/Browser"));
		var channelFactory = new ChannelFactory<IBrowser>(binding, endpoint);
		var serviceClient = channelFactory.CreateChannel();
		var result = serviceClient.Collect(url);
		channelFactory.Close();
		return result;
	}
}

  1. Deserialize the JSON string into a POCO object using the following models:
public class Document
{
    public string[] a; //list of attribute names
    public Node dom;
}

public class Node
{
    public string t = ""; //tag name
    public int[] s = null; //array of style values [display (0 = none, 1 = block, 2 = inline), font-size, bold, italic]
    public Dictionary<int, string> a; //dictionary of element attributes [{attribute name index, value}, {...}]
    public List<Node> c; //list of child elements
    public string v; //optional #text element value
}

NOTE: Get the Node attribute name from the Document.a array using the Dictionary key as an index

How Does It Work?

Charlotte uses CefSharp to load a web page within an off-screen Chromium web browser.

Once the DOM is completely loaded, Charlotte injects script code from extractDOM.js into the web page, which is used to traverse the rendered DOM and generate a JSON representation of the DOM. Any style information extracted from each DOM element is actually the computed style for that element, which makes Charlotte so powerful. You don't need to decypher complex CSS files or <style> tags in order to figure out the styling for a given element on the page. That information is extracted for you and attached to each node within the JSON object.

Customize The Output

You can modify extractDOM.js to extract more information about each DOM element. For example, you could include more style information. Unfortunately, the style object only accepts integers, so you'll need to use some creativity to extract string values from element styles and convert them into integers.

NOTE: The style object (Node.s) is an integer array in order to save a significant amount of bytes when generating the JSON string.

NOTE: Right now, Charlotte extracts only four integers from an element's style attribute; display, font-size, bold, and italic. The reason being is that these values can help determine the importance of a block of text based on if it's being displayed, the size of the font, and whether or not the text is bold and/or italic.

Why the name "Charlotte"?

Named after the spider in the children's story "Charlotte's Web". Charlotte prevented the slaughter of a piglette named Wilbur by weaving words of praise for Wilbur into her web, which in turn surprised the local farmers and convinced them to keep Wilbur alive.

Similarly, the Charlotte console app weaves a JSON object from the computed DOM of a web page, so that developers can keep the contents of an article instead of letting it go to the slaughter (when the publisher of an article decides to archive the URL, which would then return a 404 or 500 error).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.