Web scraping in Java – Jsoup and selenium

Web scraping is a great way to retrieve data and save the information. with a simple Java web scraping setup, you can download content using Jsoup and selenium. Download the source code from the GitHub.

Table of Content


Web scraping and parsing in HTML – Jsoup


If data format is html, Jsoup is a good tool because it provides retrieve and parse together. Therefore, Jsoup is ideal tool for web scraping or web crawling. To set up, you can download Jsoup here. If you use maven, you can add following in pom.xml.

To use it, you map the page to Jsoup Document. Then you can retrieve the whole page by html(). If you want retrieve some elements on the page, you can specify by using select().

Here is code snippet for Jsoup GET request.

Here is code snippet for Jsoup POST request.


Download images – Jsoup and HttpURLConnection


If you want download the images from the HTML page, you need the help of HttpURLConnection. HttpURLConnection provides methods to send GET/POST request and receive response over HTTP protocol. They work with methods in BufferReader and InputStreamReader to read any data. It is part of JKD. You don’t need any external libraries.  

First you use Jsoup to get the image links. Then use HttpURLConnection to download to your local directory.

Here is the code snippet to download images.


Web scrape and parsing dynamic data – Selenium with chrome headless

Jsoup cannot work in some cases. For example, some websites request to login to see data; Sometimes the data are dynamic generated. There are work around for this – selenium with chrome headless.

Selenium is the tool for web application testing. It can simulate human actions such as click or enter data. It can also work with browser in headless mode, which means the browser is invisible. Google has been using Chrome headless for their crawling. It is available for public to use as well. Here we can use selenium with chrome headless to extract data from dynamically generated data.

First download selenium here . Alternatively you can add the dependency in pom.xml like this.

Next is to download chrome driver. After you download, unzip the “chromedriver.exe” to a directory. In the code, you will set the chrome webdriver by specifying the absolute path of the exe file. Then define ChomeOptions and add arguments of “–headless”. After you initialize ChromeDriver, you can retrieve the whole page or particular elements.

This is the code snippet of using selenium with chrome headless .


FAQ


How to scrape web pages in Java?

The simple way is to use Jsoup. You can download Jsoup or define Jsoup dependency in maven’s pom.xml. Then you call Jsoup.connect(url) to map the page to Jsoup Document, and call html() to retrieve the page.

How to download an image using JSoup?

First you use Jsoup to get the image links. Then use HttpURLConnection to download to your local directory.

How to scrape web pages when the site has login?

You can use selenium with chrome headless. First download selenium. Next is to download chrome driver and set the chrome webdriver. Then define ChomeOptions and add arguments of “–headless”. After you initialize ChromeDriver, you can retrieve the whole page or particular elements.

WebScraperAndParser (GitHub)
Java web scrape using Jsoup (YouTube)

Comments are closed