cs-traversing-trees
Learning goals
- Parse HTML using jsoup.
- Traverse a DOM tree.
- Use the
Deque
interface and its implementations.
Overview
This README introduces the application we will develop during the upcoming units. We describe the elements of a Web search engine and introduce the first application, a Web crawler that downloads and parses pages from Wikipedia. We present a recursive implementation of depth-first search and an iterative implementation that uses a Java Deque
to implement a "last in, first out" stack.
The road ahead
At the end of each unit, you will have a chance to work on an application that's a bit more substantial than a lab. You will write more code, and you'll have to make more design decisions, rather than just filling in methods.
By the time we get to the end of this track, you will have built a simple Web search engine, which is a tool, like Google Search and Bing, that takes a set of "search terms" and returns a list of web pages that are relevant to those terms (we'll discuss what "relevant" means later). If you are not familiar with search engines, you can read more here, but we'll explain what you need as we go along.
The essential pieces of a search engine are:
-
Crawling: We'll need a program that can download a web page, parse it, and extract the text and any links to other pages.
-
Indexing: We'll need an index that makes it possible to look up a search term and find the pages that contain it.
-
Retrieval: And we'll need a way to collect results from the Index and identify pages that are most relevant to the search terms.
We'll start with the Crawler; the goal of a Crawler is to discover and download a set of web pages. For search engines like Google and Bing, the goal is to find all web pages, but often Crawlers are limited to a smaller domain. In our case, we will only read pages from Wikipedia.
As a first step, we'll build a Crawler that reads a Wikipedia page, finds the first link, follows the link to another page, and repeats. We will use this Crawler to test the "Getting to Philosophy" conjecture, which states:
Clicking on the first lowercase link in the main text of a Wikipedia article, and then repeating the process for subsequent articles, usually eventually gets one to the Philosophy article.
This conjecture is stated on this Wikipedia page, and you can read its history there.
Testing the conjecture will allow us to build the basic pieces of a Crawler without having to crawl the entire web, or even all of Wikipedia. And we think the exercise is kind of fun!
In the next unit, we'll work on the Indexer, and then we'll get to the Retriever.
Parsing HTML
When you download a web page, the contents are written in HyperText Markup Language, aka HTML. For example, here is a minimal HTML document:
<!DOCTYPE html>
<html>
<head>
<title>This is a title</title>
</head>
<body>
<p>Hello world!</p>
</body>
</html>
The phrases "This is a title" and "Hello world!" are the text that actually appears on the page; the other elements are tags that indicate how the text should be displayed.
When our Crawler downloads a page, it will need to parse the HTML in order to extract the text and find the links. To do that, we'll use jsoup, which is an open-source Java library that downloads and parses HTML.
The result of parsing HTML is a Document Object Model tree or DOM tree, that contains the elements of the document, including text and tags. The tree is a linked data structure made up of nodes; the nodes represent text, tags, and other document elements.
The relationships between the nodes are determined by the structure of the document. In the example above, the first node, called the root, is the <html>
tag, which contains links to the two nodes it contains, <head>
and <body>
; these nodes are the children of the root node.
The <head>
node has one child, <title>
, and the <body>
node has one child, <p>
(which stands for "paragraph). The following figure represents this tree graphically:
Each node contains links to its children; in addition, each node contains a link to its parent, so from any node it is possible to navigate up and down the tree. The DOM tree for real pages is usually more complicated than this example.
Most web browsers provide tools for inspecting the DOM of the page you are viewing. In Chrome, you can right-click on any part of a web page and select ""Inspect" from the menu that pops up. In Firefox, you can right-click and select "Inspect Element" from the menu. Safari provides a tool called Web Inspector, which you can read about here. For Internet Explorer, you can read the instructions here.
Here is a screenshot of the DOM for the Wikipedia page on Java.
The element that's highlighted is the first paragraph of the main text of the article, which is contained in a <div>
element with id="mw-content-text"
. We'll use this element id to identify the main text of each article we download.
Using jsoup
Jsoup makes it easy to download and parse web pages, and to navigate the DOM tree. Here's an example:
String url = "https://en.wikipedia.org/wiki/Java_(programming_language)";
// download and parse the document
Connection conn = Jsoup.connect(url);
Document doc = conn.get();
// select the content text and pull out the paragraphs.
Element content = doc.getElementById("mw-content-text");
// TODO: avoid selecting paragraphs from sidebars and boxouts
Elements paras = content.select("p");
Jsoup.connect
takes a URL as a String and makes a connection to the web server; the get
method downloads the HTML, parses it, and returns a Document
object, which represents the DOM.
Document
provides lots of helpful methods for navigating the tree and selecting nodes. In fact, it provides so many methods, it can be confusing. This example demonstrates two ways to select nodes:
-
getElementById
takes a String and searches the tree for an element that has a matching "id" field. Here it selects the node<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr">
, which appears on every Wikipedia page to identify the<div>
element that contains the main text of the page, as opposed to the navigation sidebar and other elements.The return value from
getElementById
is anElement
object that represents this<div>
and contains the elements in the<div>
as children, grandchildren, etc. -
select
takes a String, traverses the tree, and returns all the elements with tags that match the String. In this example, it returns all paragraph tags that appear incontent
. The return value is anElements
object, which is a Collection that contains multiple elements.
Before you go on, you should skim the documentation of these classes so you know what they can do. The most important classes are: Element
, Elements
, and Node
.
It's easy to get Node
and Element
confused: Node
is the superclass of Element
, so every Element
is a Node
, but not every Node
is an Element
. Other kinds of Nodes include TextNode
(which we will use soon), DataNode
, and Comment
.
Iterating through the DOM
To make your life easier, we've provided a class called WikiNodeIterable
that lets you iterate through the nodes in a DOM tree. Here's an example that shows how to use it:
Elements paras = content.select("p");
Element firstPara = paras.get(0);
Iterable<Node> iter = new WikiNodeIterable(firstPara);
for (Node node: iter) {
if (node instanceof TextNode) {
System.out.print(node);
}
}
This example picks up where the previous one leaves off. It selects the first paragraph in paras
and then creates a WikiNodeIterable
, which implements Iterable<Node>
. The for
loop iterates the nodes in the tree using a "depth first search", which produces the nodes in the order they would appear on the page.
In this example, we print a Node
only if it is a TextNode
and ignore other types of Node
, specifically the Element
objects that represent tags. The result is the plain text of the HTML paragraph without any markup. The output is:
Java is a general-purpose computer programming language that is concurrent, class-based, object-oriented,[13] and specifically designed...
At this point, you know what you need for the next lab, so you can skip ahead. But if you are not familiar with depth-first search, you should keep reading.
Depth-first search
There are several ways you might reasonably traverse a tree, each with different applications. We'll start with "depth-first search," or DFS. DFS starts at the root of the tree and selects the first child. If the child has children, it selects the first child again. When it gets to a node with no children, it backtracks, moving up the tree to the parent node, where it selects the next child if there is one; otherwise it backtracks again. When it has explored the last child of the root, it's done.
There are two common ways to implement DFS, recursively and iteratively. The recursive implementation is simple and elegant:
private static void recursiveDFS(Node node) {
if (node instanceof TextNode) {
System.out.print(node);
}
for (Node child: node.childNodes()) {
recursiveDFS(child);
}
}
This method gets invoked on every Node
in the tree, starting with the root. If the Node
it gets is a TextNode
, it prints the contents. If the Node
has any children, it invokes recursiveDFS
on each one of them in order.
In this example, we print the contents of each TextNode
before traversing the children, so this is an example of a "pre-order" traversal. You can read about "pre-order", "post-order", and "in-order" traversals here. For this application, the traversal order doesn't matter.
By making recursive calls, recursiveDFS
uses the call stack to keep track of the child nodes and process them in the right order. As an alternative, we can use a stack data structure to keep track of the nodes ourselves; if we do that, we can avoid the recursion and traverse the tree iteratively.
Stacks in Java
Before we can explain the iterative version of DFS, we have to explain the stack data structure. We'll start with the general concept of a stack, which we'll call a "stack" with a lowercase "s". Then we'll talk about two Java implementations of a stack, which are called Stack
and Deque
.
A stack is a data structure that is similar to a list: it is a collection that maintains the order of the elements. The primary difference between a stack and a list is that the stack provides fewer methods. In the usual convention, it provides:
push
: which adds an element to the top of the stack.pop
: which removes the top-most element from the stack.peek
: which returns the top-most element without modifying the stack.isEmpty
: which indicates whether the stack is empty.
Because pop
always returns the top-most element, a stack is also called a "LIFO", which stands for "last in, first out". An alternative to a stack is a "queue", which returns elements in the same order they are added; that is, "first in, first out", or FIFO.
It might not be obvious why stacks and queues are useful: they don't provide any capabilities that aren't provided by lists; in fact, they provide fewer capabilities. So why not use lists for everything? There are two reasons:
-
If you limit yourself to a small set of methods โ that is, a small API โ your code will be more readable and less error-prone. For example, if you use a list to represent a stack, you might accidentally remove an element in the wrong order. With the stack API, this kind of mistake is literally impossible. And the best way to avoid errors it to make them impossible.
-
If a data structure provides a small API, it is easier to implement efficiently. For example, a simple way to implement a stack is a singly-linked list. When we push an element onto the stack, we add it to the beginning of the list; when we pop an element, we remove it from the beginning. For a linked list, adding and removing from the beginning are constant time operations, so this implementation is efficient. Conversely, big APIs are harder to implement efficiently.
To implement a stack in Java, you have three options:
-
Go ahead and use
ArrayList
orLinkedList
. If you useArrayList
, be sure to add and remove from the end, which is a constant time operation. And be careful not to add elements in the wrong place or remove them in the wrong order. -
Java provides a class called
Stack
that provides the standard set of stack methods. But this class is an old part of Java: it is not consistent with the Java Collections Framework, which came later, and it does not handle synchronization as well as more recent classes. -
Probably the best choice is to use one of the implementations of the
Deque
interface, likeArrayDeque
.
"Deque" stands for "double-ended queue"; it's supposed to be pronounced "deck", but some people say "deek". In Java, the Deque
interface provides push
, pop
, peek
, and isEmpty
, so you can use a Deque
as a stack. It provides other methods you can read about here, but we won't use them for now.
Iterative DFS
Here is an iterative version of DFS that uses an ArrayDeque
to represent a stack of Node
objects:
private static void iterativeDFS(Node root) {
Deque<Node> stack = new ArrayDeque<Node>();
stack.push(root);
while (!stack.isEmpty()) {
Node node = stack.pop();
if (node instanceof TextNode) {
System.out.print(node);
}
List<Node> nodes = new ArrayList<Node>(node.childNodes());
Collections.reverse(nodes);
for (Node child: nodes) {
stack.push(child);
}
}
}
The parameter, root
, is the root of the tree we want to traverse, so we start by creating the stack and pushing the root onto it.
The loop continues until the stack it empty. Each time through, it pops a Node
off the stack. If it gets a TextNode
, it prints the contents. Then it pushes the children onto the stack. In order to process the children in the right order, we have to push them onto the stack in reverse order; we do that by copying the children into an ArrayList
, reversing the elements in place, and then iterating through the reversed ArrayList
.
One advantage of the iterative version of DFS is that it it easier to implement as a Java Iterator
; you'll find out how in the next lab. But first, one last note about the Deque
interface: in addition to ArrayDeque
, Java provides another implementation of Deque
, our old friend LinkedList
. LinkedList
implements both interfaces, List
and Deque
. Which interface you get depends on how you use it. For example, if you assign a LinkedList
object to a Deque
variable, like this:
Deqeue<Node> deque = new LinkedList<Node>();
You can use the methods in the Deque
interface, but not all methods in the List
interface. If you assign it to a List
variable, like this:
List<Node> deque = new LinkedList<Node>();
You can use List
methods but not all Deque
methods. And if you assign it like this:
LinkedList<Node> deque = new LinkedList<Node>();
You can use all the methods. But if you combine methods from different interfaces, your code will be less readable and more error-prone.
Resources
Web search engine: Wikipedia.
HyperText Markup Language: Wikipedia
Element
: jsoup documentation Elements
: jsoup documentation Node
: jsoup documentation
Tree traversal: Wikipedia Call stack: Wikipedia
Deque interface: Java documentation
View Traversing Trees on Learn.co and start learning to code for free.