Web scrape and parsing in Java (5 Ways)

Internet is the great resource to get information, for study, business or hobby. There are many ways of information retrieval with web scrape and parsing from Internet. All programing languages provide the libraries to do so. In this post, I introduce 5 ways of web scrape and parsing.

1. HttpURLConnection – send request and receive data

HttpURLConnection has been part of Java jdk since version 1.1. It provides methods to send GET/POST request and receive response over HTTP protocol. They work with methods in BufferReader and InputStreamReader to read the data.  You don’t need any external libraries.  

This is the code snippet for Http GET request.   

This is the code snippet for Http POST request. 

Over the years, there have been enhanced Http libraries available. But the idea remain the same.

2. Gson – map Json to objects

The data you receive have different formats, such as plain text, html, xml, json, pdf and jpg etc. How would you like to extract the exact data you want? If it is plain text, you can use the String methods, such as indexOf() to find or use substring() to extract.

Xml used to be a popular data transfer format. Nowadays json is preferred for data transfer, because it is easy to read and parse. For each json data item, you can use libraries such as Jackson or Google Gson to map Json data items to Java objects for process .

First you should define the class. If we have a json file, you can use tool to generate class from json. Here is a online tool. After you have the Java class, you can use Google Gson to map json data to objects.

This is the code snippet to extract data from json.

3. Jsoup – web scrape and parsing in HTML

If data format is html, Jsoup is a good tool because it provides retrieve and parse together. Therefore, Jsoup is ideal tool for web scraping or web crawling. To set up, you can download Jsoup here. If you use maven, you can add following in pom.xml.

To use it, you map the page to Jsoup Document. Then you can retrieve the whole page by html(). If you want retrieve some elements on the page, you can specify by using select().

Here is code snippet for Jsoup GET request.

Here is code snippet for Jsoup POST request.

Before you scrape any website, please read the Terms of Use to see whether you have the permission.

4. Jsoup and HttpURLConnection – download images

If you want download the images from the HTML page, you can use both Jsoup and HttpURLConnection. First use Jsoup to get the image links. Then use HttpURLConnection to download to your local directory.

Here is the code snippet to download images.

5. Selenium with chrome headless – web scrape and parsing dynamic data

Jsoup cannot work in some cases. For example, some websites request to login to see data; Sometimes the data are dynamic generated. There are work around for this – selenium with chrome headless.

Selenium is the tool for web application testing. It can simulate human actions such as click or enter data. It can also work with browser in headless mode, which means the browser is invisible. Google has been using Chrome headless for their crawling. It is available for public to use as well. Here we can use selenium with chrome headless to extract data from dynamically generated data.

First download selenium here . Alternatively you can add the dependency in pom.xml like this.

Next is to download chrome driver. After you download, unzip the “chromedriver.exe” to a directory. In the code, you will set the chrome webdriver by specifying the absolute path of the exe file. Then define ChomeOptions and add arguments of “–headless”. After you initialize ChromeDriver, you can retrieve the whole page or particular elements.

This is the code snippet of using selenium with chrome headless .

6. Conclusion

This tutorial introduces 5 ways of information retrieval and parsing in Java:
1. HttpURLConnection – send and receive data
2. Gson – map Json to objects
3. Jsoup – receive and parse HTML
4. Jsoup and HttpURLConnection – download images
5. Selenium with chrome headless – retrieve dynamic data

Download WebScraperAndParser (GitHub)

Comments are closed