Using HTMLParser with Codenvy

Get Page Title Using htmlparser

HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.
HTMLParser – a super-fast real-time parser for real-world HTML. What has attracted most developers to HTMLParser has been its simplicity in design, speed and ability to handle streaming real-world html.

The two fundamental use-cases that are handled by the parser are extraction and transformation (the syntheses use-case, where HTML pages are created from scratch, is better handled by other tools closer to the source of data). While prior versions concentrated on data extraction from web pages, Version 1.4 of the HTMLParser has substantial improvements in the area of transforming web pages, with simplified tag creation and editing, and verbatim toHtml() method output.

Create a Project

Login to your codenvy workspace and create a New WAR Project say “htmlparser”. Project file structure is shown below :

Add Dependencies

Specify the below dependencies in pom.xml which are prerequisite for the htmlparser implementation. Make sure these jars are included under Maven Dependencies.

GetImage (2)

Pom.xml :

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.codenvy</groupId>
  <artifactId>htmlparser</artifactId>
  <packaging>war</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>htmlparserTut</name>
  <build>
    <finalName>htmlparserTut</finalName>
  </build>
  <dependencies>
    <dependency>
      <groupId>org.htmlparser</groupId>
      <artifactId>htmlparser</artifactId>
      <version>2.1</version>
    </dependency>
    <dependency>
      <groupId>javax</groupId>
      <artifactId>javaee-web-api</artifactId>
      <version>6.0</version>
      <scope>provided</scope>
    </dependency>
  </dependencies>
  <repositories>
    <repository>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
      <id>central</id>
      <name>Maven Repository Switchboard</name>
      <url>http://repo1.maven.org/maven2</url>
    </repository>
    <repository>
      <id>java.net2</id>
      <name>Repository hosting the jee6 artifacts</name>
      <url>http://download.java.net/maven/2</url>
    </repository>
  </repositories>
</project>

Create GetTitle class :
This class takes a Web URL as input and parses title web element and extract specific title text on that page.

package com.codenvy;

import java.io.IOException;
import javax.servlet.RequestDispatcher;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.htmlparser.Parser;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;

public class GetTitle extends HttpServlet
{
   private static final long serialVersionUID = 1L;

   public GetTitle()
   {
      super();
   }

   protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
   {
   }

   protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
   {
      StringBuffer sb = new StringBuffer();
      String url = request.getParameter("inputURL");
      if (url.length() == 0)
      {
         sb.append("usage: supply url to fetch title").append("<br/>");
      }
      else
      {
         try
         {
            Parser parser = new Parser(url);
            NodeList list = parser.parse(new TagNameFilter("TITLE"));
            String title = "";
            if (list.elementAt(0) != null)
            {
               title = list.elementAt(0).toPlainTextString();
               sb.append("Title - " + title);
            }
         }
         catch (Exception ex)
         {
            sb = new StringBuffer("Exception occured, please rectify : <br/> " + ex.toString());
         }
      }
      request.setAttribute("data", sb.toString());
      RequestDispatcher rd = request.getRequestDispatcher("/success.jsp");
      rd.forward(request, response);
   }
}

Create index.jsp

Index jsp acts as an interface to input the URL of a webpage for which title needs to be extracted which in turn triggers an action for GetTitle class to process the request.

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"
pageEncoding="ISO-8859-1"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    <title>htmlparser Tutorial</title>
    <link rel="stylesheet" href="css/style.css" type="text/css" />
  </head>
  <body>
    <div class="titlegetform">
      <p id="head">Get Title - using htmlparser</p>
      <form action="getTitle" method="post">
        <table width="400" border="0" cellspacing="0" cellpadding="0"
        class="table">
          <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
          </tr>
          <tr>
            <td width="144"><strong>Website URL</strong></td>
            <td width="8">&nbsp;</td>
            <td width="203"><label> <input type="text" name="inputURL" />
              </label></td>
          </tr>
          <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
            <td>(example : http://www.codenvy.com)</td>
          </tr>
          <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
            <td><label> <input type="submit" name="getTitle" value="GetTitle" class="button" />
              </label></td>
          </tr>
          <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
          </tr>
        </table>
        </form>
        </div>
        </body>
        </html>

GetImage (3)

Create web.xml and success.jsp

We are mapping the index.jsp file to the earlier created GetTitle class in this web.xml file and the action will load the success.jsp:
Web.xml :

<!DOCTYPE web-app PUBLIC
"-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
"http://java.sun.com/dtd/web-app_2_3.dtd" >

<web-app>
  <display-name>htmlparser Tutorial</display-name>
  <welcome-file-list>
    <welcome-file>index.jsp</welcome-file>
  </welcome-file-list>
  <servlet>
    <display-name>GetTitle</display-name>
    <servlet-name>GetTitle</servlet-name>
    <servlet-class>com.codenvy.GetTitle</servlet-class>
  </servlet>

  <servlet-mapping>
    <servlet-name>GetTitle</servlet-name>
    <url-pattern>/getTitle</url-pattern>
  </servlet-mapping>

</web-app>

Success.jsp :

<%@ page language="java" contentType="text/html; charset=ISO-8859-1" pageEncoding="ISO-8859-1"%>

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    <title>htmlparser Tutorial</title>
    <link rel="stylesheet" href="css/style.css" type="text/css" />
  </head>
  <body>
    <div class="titleget">
      <p>Data : <br/><%=request.getAttribute("data") %></p>
      <br/>
      <p><a href="index.jsp">Try</a> another url</p>
    </div>
  </body>
</html>

Build and Run the application :
Upon successful building the application and executing, it loads index.jsp where we input a URL, then it will be successfully fetch the title present on given URL and displays on screen as success.jsp.

GetImage (4)