Web scraping in Java - Jsoup and selenium

Web scraping is a great way to retrieve data and save the information. with a simple Java web scraping setup, you can download content using Jsoup and selenium. Download the source code from the GitHub.

Table of Content

1. Web scraping and parsing in HTML – Jsoup

If data format is html, Jsoup is a good tool because it provides retrieve and parse together. Therefore, Jsoup is ideal tool for web scraping or web crawling. To set up, you can download Jsoup here. If you use maven, you can add following in pom.xml.

<dependency>
            <groupid>org.jsoup</groupid>
            <artifactid>jsoup</artifactid>
            <version>1.18.3</version>
</dependency>

<groupid>org.jsoup</groupid>

<artifactid>jsoup</artifactid>

</dependency>

To use it, you map the page to Jsoup Document. Then you can retrieve the whole page by html(). If you want retrieve some elements on the page, you can specify by using select().

Here is code snippet for Jsoup GET request.

String url = "https://www.cnn.com/";   	
Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0").get();  
//get the whole page
String	html = doc.html();
System.out.println(html);
//get the links on the page
Elements links = doc.select("a[href]");
for (Element link : links) {
	String line = link.attr("href");
	System.out.println(line);    	    
}

String url = "https://www.cnn.com/";

Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0").get();

//get the whole page

String html = doc.html();

System.out.println(html);

//get the links on the page

Elements links = doc.select("a[href]");

for (Element link : links) {

String line = link.attr("href");

System.out.println(line);

}

Here is code snippet for Jsoup POST request.

String url = "https://httpbin.org/post";			
Document doc = Jsoup.connect(url).ignoreContentType(true).timeout(1000)
			 .data("custname", "google")	
			 .data("custtel", "234")
			 .userAgent("Mozilla/5.0")
			 .post();
String str = doc.html();	
System.out.println(html);

String url = "https://httpbin.org/post";

Document doc = Jsoup.connect(url).ignoreContentType(true).timeout(1000)

.data("custname", "google")

.data("custtel", "234")

.userAgent("Mozilla/5.0")

.post();

String str = doc.html();

System.out.println(html);

2. Download images – Jsoup and HttpURLConnection

If you want download the images from the HTML page, you need the help of HttpURLConnection. HttpURLConnection provides methods to send GET/POST request and receive response over HTTP protocol. They work with methods in BufferReader and InputStreamReader to read any data. It is part of JKD. You don’t need any external libraries.

First you use Jsoup to get the image links. Then use HttpURLConnection to download to your local directory.

Here is the code snippet to download images.

String url = "https://www.istockphoto.com/stock-illustrations";
String filepath = "images\\";
Document document = Jsoup.connect(url).userAgent("Mozilla/5.0").get();	      
Elements tag = document.getElementsByTag("img");
for (Element link : tag) {
 	String imgurl = link.attr("abs:src");
 	if (imgurl.length() > 0 ) {
     		System.out.println(imgurl);
     		String fn = FilenameUtils.getBaseName(imgurl)+"."+FilenameUtils.getExtension(imgurl);     
     		HttpURLConnection con = (HttpURLConnection) new URL(imgurl).openConnection();
   	    	con.addRequestProperty("User-Agent", "Chrome 86.0.4240/Windows");
   	   	BufferedImage c = ImageIO.read(con.getInputStream());
   	    	Thread.sleep(1000);
  	   	File outputFile = new File(filepath+fn);
  	    	ImageIO.write(c, "jpg", outputFile);
 	}
}

String url = "https://www.istockphoto.com/stock-illustrations";

String filepath = "images\\";

Document document = Jsoup.connect(url).userAgent("Mozilla/5.0").get();

Elements tag = document.getElementsByTag("img");

for (Element link : tag) {

String imgurl = link.attr("abs:src");

if (imgurl.length() > 0 ) {

System.out.println(imgurl);

String fn = FilenameUtils.getBaseName(imgurl)+"."+FilenameUtils.getExtension(imgurl);

HttpURLConnection con = (HttpURLConnection) new URL(imgurl).openConnection();

con.addRequestProperty("User-Agent", "Chrome 86.0.4240/Windows");

BufferedImage c = ImageIO.read(con.getInputStream());

Thread.sleep(1000);

File outputFile = new File(filepath+fn);

ImageIO.write(c, "jpg", outputFile);

}

3. Web scrape and parsing dynamic data – Selenium with chrome headless

Jsoup cannot work in some cases. For example, some websites request to login to see data; Sometimes the data are dynamic generated. There are work around for this – selenium with chrome headless.

Selenium is the tool for web application testing. It can simulate human actions such as click or enter data. It can also work with browser in headless mode, which means the browser is invisible. Google has been using Chrome headless for their crawling. It is available for public to use as well. Here we can use selenium with chrome headless to extract data from dynamically generated data.

First download selenium here . Alternatively you can add the dependency in pom.xml like this.

<dependency>
    <groupid>org.seleniumhq.selenium</groupid>
    <artifactid>selenium-java</artifactid>
    <version>3.8.1</version>
</dependency>

<groupid>org.seleniumhq.selenium</groupid>

<artifactid>selenium-java</artifactid>

</dependency>

Next is to download chrome driver. After you download, unzip the “chromedriver.exe” to a directory. In the code, you will set the chrome webdriver by specifying the absolute path of the exe file. Then define ChomeOptions and add arguments of “–headless”. After you initialize ChromeDriver, you can retrieve the whole page or particular elements.

This is the code snippet of using selenium with chrome headless .

System.setProperty("webdriver.chrome.driver","C:\\Program Files\\chromedriver\\chromedriver.exe");
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless","--user-agent=Mozilla/5.0");
WebDriver driver = new ChromeDriver(options);			
driver.get(url);
String t = driver.getPageSource();
WebElement findElement = driver.findElement(By.id("subscribers"));

System.setProperty("webdriver.chrome.driver","C:\\Program Files\\chromedriver\\chromedriver.exe");

ChromeOptions options = new ChromeOptions();

options.addArguments("--headless","--user-agent=Mozilla/5.0");

WebDriver driver = new ChromeDriver(options);

driver.get(url);

String t = driver.getPageSource();

WebElement findElement = driver.findElement(By.id("subscribers"));

FAQ

How to scrape web pages in Java?

The simple way is to use Jsoup. You can download Jsoup or define Jsoup dependency in maven’s pom.xml. Then you call Jsoup.connect(url) to map the page to Jsoup Document, and call html() to retrieve the page.

How to download an image using JSoup?

First you use Jsoup to get the image links. Then use HttpURLConnection to download to your local directory.

How to scrape web pages when the site has login?

You can use selenium with chrome headless. First download selenium. Next is to download chrome driver and set the chrome webdriver. Then define ChomeOptions and add arguments of “–headless”. After you initialize ChromeDriver, you can retrieve the whole page or particular elements.

WebScraperAndParser (GitHub)
Java web scrape using Jsoup (YouTube)

Web scraping in Java – Jsoup and selenium

Table of Content

1. Web scraping and parsing in HTML – Jsoup

2. Download images – Jsoup and HttpURLConnection

3. Web scrape and parsing dynamic data – Selenium with chrome headless

FAQ

Related

Recent Posts

La Vivien’s Illustrated Data structures (Java) book

Grokking Data Structures

Big O Notation Cheat sheet

Java coding interview book

Data structures Python book

Data structures JavaScript book

Web scraping in Java – Jsoup and selenium

Table of Content

1. Web scraping and parsing in HTML – Jsoup

2. Download images – Jsoup and HttpURLConnection

3. Web scrape and parsing dynamic data – Selenium with chrome headless

FAQ

Share this:

Related

Recent Posts

La Vivien’s Illustrated Data structures (Java) book

Grokking Data Structures

Big O Notation Cheat sheet

Java coding interview book

Data structures Python book

Data structures JavaScript book