Home >Blog

HTML Parsing In Java With Jsoup

HTML is a markup language. As such, It’s not easy to get the real data out of it with bare eyes. Again, we do not want to extract data by hand—we want our computer to do that for us. To get various types of data from HTML we need to parse it and create a tree like structure inside the computer's memory. In Java we have various classes for working with structured data, we have collections framework to work with our desired data structures. But there is no good, built-in HTML parser. We need an external HTML parser in Java to do that. Jsoup is a popular third party HTML parser written in Java. In this tutorial we are going to use it for parsing HTML and extracting data out of it.

Getting Started

To get started with coding for parsing HTML in Java you need to have JDK (latest version is preferred) installed on your system. To edit code you can use a plain text editor, a code editor or an IDE. I am going to use the IntelliJ IDEA from JetBrains. You will need to put the the Jsoup library to parse HTML in the java path or add it to the project with your IDE.

Create a project with your IDE—it's better to create a maven project for easy management of dependency. You can also use gradle to resolve dependency. Create a public class with a main method in it. I am going to name the class as JavaSoup but you can of course name it whatever you like. It is advised that you provide some sensible name though.

So, our initial code should look like the following:

public class JavaSoup {
   public static void main(String[] args){
       // Our initial code goes here.
   }
}

Getting The Library

You can download the jar from https://jsoup.org/download and add to the path with the help of your IDE. There is a better option though: you can add the library with maven. To add the library with maven, create a maven project and then add the following XML in your pom.xml for adding Jsoup as a dependency.

<dependency>
   <groupId>org.jsoup</groupId>
   <artifactId>jsoup</artifactId>
   <version>1.10.3</version>
</dependency>

If you are using gradle then use the following line of code for gradle:

compile 'org.jsoup:jsoup:1.10.3'

Retrieving HTML From The Internet

We can retrieve HTML directly from the internet using the URL class from java.net package. But jsoup provides a helper method for the purpose of retrieving HTML and converting it to a jsoup Document object. Let's get the HTML as a jsoup Document object of the https://techcrunch.com/.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class JavaSoup {
   public static void main(String[] args){
       String url = "https://techcrunch.com/";

       try{
           Document doc = Jsoup.connect(url).get();
           System.out.println(doc);

       } catch (IOException ioe){
           ioe.printStackTrace();
       }
   }
}

Hit compile and run to see if everything is alright. You will see HTML source code of TechCrunch printed on your console.

Parsing The HTML

In the method shown above the HTML source is automatically retrieved, parsed and put as Document object. But we may not always retrieve HTML from the internet to parse that. No matter where the HTML lives, we always need to get HTML source as a java string in this or that way. So, the most basic way of parsing HTML is parsing it from a string. The following code shows how to parse HTML from string.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JavaSoup {
   public static void main(String[] args){
       String html = "<html><head><title>This is the title</title></head>"
               + "<body>This is a text content inside body tag</body></html>";
       Document doc = Jsoup.parse(html);
   }
}

Hit compile and run to check if everything is alright.

Note: To properly resolve the links inside the HTML source always provide baseUri argument to the parse() method. The method signature for this is parse(String html, String baseUri).

HTML Nodes

An HTML source code is parsed and transformed into a tree like structure. You can traverse through the tree with the help of different types of nodes. There are nodes for element (a tag along with its data becomes an Element), comments, texts, CData, etc.

The HTML document, when parsed, as a whole is considered as a document and represented with an object of Document class from jsoup library. This is also referred to as the root node.

Navigating The Document

After parsing the HTML we need to navigate through the tree of nodes to extract or insert different data or node. If you’ve ever worked with JavaScript and HTML nodes then you will get it easy and comfortably at first glance in jsoup too.

Let's say, we want to get the body element from the document. We need to invoke the getElementsByTagName() method on the Document object. This method will return a list of Element objects. To get the body we need to invoke get(0) on that list.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JavaSoup {
   public static void main(String[] args){
       String html = "<html><head><title>This is the title</title></head>"
               + "<body>This is a text content inside body tag</body></html>";
       Document doc = Jsoup.parse(html);
       Element body = doc.getElementsByTag("body").get(0);
       System.out.println(body);
   }
}

Hit compile and run to see the following result.

<body>
This is a text content inside body tag
</body>

Notice that the HTML of the body does not exactly look like as it was originally. It's a pretty printed version of the original HTML and it does not change the semantic in any way.

We can also get a single element by its HTML tag id. We need to invoke getElementById(). Let's change the html string to create an element with an HTML id.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JavaSoup {
   public static void main(String[] args){
       String html = "<html><head><title>This is the title</title></head>"
               + "<body>" +
               "<p>" +
                   "This is a text content inside body tag" +
               "</p>" + "<div id='div1'> Some text inside div tag with id div1 </div>" +
               "</body></html>";
       Document doc = Jsoup.parse(html);
       Element div1 = doc.getElementById("div1");
       System.out.println(div1);
   }
}

Hit compile and run to see the following result.

<div id="div1">
 Some text inside div tag with id div1 
</div>

Selecting With CSS Selectors

If you are acquainted with CSS selectors or jQuery selector syntax you will obviously search for a way of doing the same with jsoup. Jsoup has ways of doing that. You can invoke the select() method on the Element or Document object to select with CSS selectors. Let's find the div with id div1 with CSS selector.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JavaSoup {
   public static void main(String[] args){
       String html = "<html><head><title>This is the title</title></head>"
               + "<body>" +
               "<p>" +
                   "This is a text content inside body tag" +
               "</p>" + "<div id='div1'> Some text inside div tag with id div1 </div>" +
               "</body></html>";
       Document doc = Jsoup.parse(html);
       Element div1 = doc.select("#div1").get(0);
       System.out.println(div1);
   }
}

Hit compile and run to check if everything is alright. It will output the following text:

<div id="div1">
 Some text inside div tag with id div1 
</div>

Conclusion

Jsoup is a very powerful and versatile library. In this article I tried my best to teach you the basics of using it. In future articles I will try to cover more elaborate and advanced topics. You are advised to take a look at the official documentation of jsoup and try various classes and methods yourself.

More Java? Check out Processing ZIP Files with Java.

Be sure to stop by the homepage to search and compare SDKs, Dev Tools, and Libraries.

By Md. Sabuj Sarker | 10/1/2017 | General