Using Jsoup with Codenvy

JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Here in Codenvy we make use of jsoup to fetch all the links on a given URL (i.e, images, hyperlinks).

Create a Project

Login to your Codenvy workspace and create a New WAR Project say “jsoup”. Project file structure is shown below:

project_structure_jsoup

Create pom.xml file

Specify the below dependencies in pom.xml which are prerequisite for the jsoup implementation. Make sure these jars are included under Maven Dependencies.

  • Jsoup-1.7.2.jar
  • Javaee-web-api-6.0.jar

Pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.codenvy</groupId>
<artifactId>jsoup</artifactId>
<packaging>war</packaging>
<version>1.0-SNAPSHOT</version>
<name>jsoupWeb</name>

<dependencies>
<dependency>
<!-- jsoup HTML parser library @ http://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>javax</groupId>
<artifactId>javaee-web-api</artifactId>
<version>6.0</version>
<scope>provided</scope>
</dependency>
</dependencies>

<repositories>
<repository>
<snapshots>
<enabled>false</enabled>
</snapshots>
<id>central</id>
<name>Maven Repository Switchboard</name>
<url>http://repo1.maven.org/maven2</url>
</repository>
<repository>
<id>java.net2</id>
<name>Repository hosting the jee6 artifacts</name>
<url>http://download.java.net/maven/2</url>
</repository>
</repositories>

<build>
<finalName>jsoupWeb</finalName>
</build>
</project>

Create FetchLinks class

This class takes a Web URL as input and parses all the web elements and responsible for extracting all the Links present on that page.

package com.codenvy;

import java.io.IOException;

import javax.servlet.RequestDispatcher;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class FetchLinks extends HttpServlet {
	private static final long serialVersionUID = 1L;

    public FetchLinks() {
        super();
    }

	protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
	}

	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {

		StringBuffer sb =new StringBuffer();
		String url = request.getParameter("inputURL");

		if(url.length()==0)
        {
        	sb.append("usage: supply url to fetch").append("<br/>");
        }else
        {
        	try
        	{
        		sb.append(fmt("Fetched %s...", url)).append("<br/>");
                Document doc = Jsoup.connect(url).get();
                Elements links = doc.select("a[href]");
                Elements media = doc.select("[src]");
                Elements imports = doc.select("link[href]");

                sb.append(fmt("\nMedia: (%d)", media.size())).append("<br/>");
                for (Element src : media) {
                    if (src.tagName().equals("img"))
                    	sb.append(fmt(" * %s: &lt;%s&gt; %sx%s (%s)",src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),trim(src.attr("alt"), 20))).append("<br/>");
                    else
                        sb.append(fmt(" * %s: &lt;%s&gt;", src.tagName(), src.attr("abs:src"))).append("<br/>");
                }

                sb.append(fmt("\nImports: (%d)", imports.size())).append("<br/>");
                for (Element link : imports) {
                	sb.append(fmt(" * %s &lt;%s&gt; (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"))).append("<br/>");
                }

                sb.append(fmt("\nLinks: (%d)", links.size())).append("<br/>");
                for (Element link : links) {
                	sb.append(fmt(" * a: &lt;%s&gt;  (%s)", link.attr("abs:href"), trim(link.text(), 35))).append("<br/>");
                }
        	}catch(Exception ex)
        	{
        		sb = new StringBuffer("Exception occured, please rectify : <br/> "+ex.toString());
        	}

        }
        	request.setAttribute("inputURL",url);
      		request.setAttribute("data", sb.toString());
      		RequestDispatcher rd = request.getRequestDispatcher("/success.jsp");
      		rd.forward(request, response);
	}

    private static String fmt(String msg, Object... args) {
    	System.out.println(String.format(msg, args));
        return String.format(msg, args);
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }

}

Create index.jsp

Index jsp acts as an interface to input the URL of a webpage for which links needs to be extracted which inturn triggers an action for FetchLinks class to process the request.

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"
pageEncoding="ISO-8859-1"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Jsoup HTML Parser</title>
<link rel="stylesheet" href="css/style.css" type="text/css" />
</head>
<body>
<div class="listlinkform">
<p id="head">List links - using jsoup</p>
<form action="fetchLinks" method="post">
<table width="400" border="0" cellspacing="0" cellpadding="0"
class="table">
<tr>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td width="144"><strong>Website URL</strong></td>
<td width="8">&nbsp;</td>
<td width="203"><label> <input type="text" name="inputURL" />
</label></td>
</tr>
<tr>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>(example : http://www.codenvy.com)</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td><label> <input type="submit" name="Fetch" value="ListLinks" class="button" />
</label></td>
</tr>
<tr>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
</table>
</form>
</div>
</body>
</html>

jsoup

Create web.xml and success.jsp

We are mapping the index.jsp file to the earlier created FetchLinks class in this web.xml file and the action will load the success.jsp:
Web.xml :

<!DOCTYPE web-app PUBLIC
"-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
"http://java.sun.com/dtd/web-app_2_3.dtd" >

<web-app>
<display-name>jsoup - HTML parser</display-name>
<welcome-file-list>
<welcome-file>index.jsp</welcome-file>
</welcome-file-list>
<servlet>
<display-name>fetchLinks</display-name>
<servlet-name>fetchLinks</servlet-name>
<servlet-class>com.codenvy.FetchLinks</servlet-class>
</servlet>

<servlet-mapping>
<servlet-name>fetchLinks</servlet-name>
<url-pattern>/fetchLinks</url-pattern>
</servlet-mapping>

</web-app>

Success.jsp :

<%@ page language="java" contentType="text/html; charset=ISO-8859-1" pageEncoding="ISO-8859-1"%>

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Jsoup HTML Parser</title>
<link rel="stylesheet" href="css/style.css" type="text/css" />
</head>
<body>
<div class="listlink">
<p>Data : <br/><%=request.getAttribute("data") %></p>
<br/>
<p><a href="index.jsp">Try</a> another url</p>
</div>
</body>
</html>

Build and Run the application

Upon successful building the application and executing, it loads index.jsp where we input a URL, then it will be successfully fetch the links present on given URL and displays on screen as success.jsp.
jsoup_links