Using Lucene Txt Indexing and Searching with Codenvy

Apache Lucene TM is an open source high-performance, full-featured text search engine library.
Lucene can be used to index various document types (doc,pdf,html,jsp,txt etc ..) and Database objects also.
In this tutorial we are going to understand how to index on .txt (Meta-info of type txt) files.
There are two things we are going to do with Lucene

  • Indexing
  • Searching the index

Indexing the Documents:

IndexWriter class will be used to index the documents .Here we are indexing the .txt files to a folder i.e. index.

The index folder will have some flat file information about the documents indexed. These Indexed Flat files will have the information written using Document object. Document Object will store some information in name value pairs and this Document objects will sit in the index folder

TextIndexer.java:

package com.codenvy.text;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class TextIndexer {

	public static void main(String[] args) {
		String indexPath = "index";
		String docsPath = "txtFiles";
		boolean create = true;
		createIndex(indexPath, docsPath, create);
	}

	public static boolean createIndex(String indexPath, String docsPath,
			boolean create) {
		final File docDir = new File(docsPath);
		Date start = new Date();
		try {
			System.out.println("Indexing to directory '" + indexPath + "'...");

			Directory dir = FSDirectory.open(new File(indexPath));
			Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_43);
			IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,analyzer);
			if (create) {
				iwc.setOpenMode(OpenMode.CREATE);
			} else {
				// update index
				iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
			}
			iwc.setRAMBufferSizeMB(256.0);
                                                // creating the index writer
			IndexWriter writer = null;
			try {
				writer = new IndexWriter(dir, iwc);
			} catch (IOException e) {
				if (!new File(indexPath).exists()) {
					new File(indexPath).mkdir();
					writer = new IndexWriter(dir, iwc);
				}
			}

			indexDocs(writer, docDir);
			// writer.forceMerge();
			writer.close();
			Date end = new Date();
			System.out.println(end.getTime() - start.getTime()
					+ " total milliseconds");
			return true;
		} catch (IOException e) {
			e.printStackTrace();
			return false;
		}
	}

	private static void indexDocs(IndexWriter writer, File file)
			throws IOException {
		if (file.canRead()) {
			if (file.isDirectory()) {
				String[] files = file.list();
				if (files != null) {
					for (int i = 0; i < files.length; i++) {
						indexDocs(writer, new File(file, files[i]));
					}
				}
			} else {
				FileInputStream fis = null;
				try {
					fis = new FileInputStream(file);
				} catch (FileNotFoundException fnfe) {
					fnfe.printStackTrace();
				}
				try {
                                                                              // creating the Document object to store in the index
					Document doc = new Document();
					doc.add(new StringField("path", file.getPath(),Field.Store.YES));
					doc.add(new LongField("modified", file.lastModified(),Field.Store.NO));
					doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));

					if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
                                                                                                 // for creating the index
						System.out.println("adding " + file);
						writer.addDocument(doc);
					} else {
                                                                                               // for updating the index’s
						System.out.println("updating " + file);
						writer.updateDocument(new Term("path", file.getPath()),doc);
					}
				} finally {
					fis.close();
					}
			}
		}
	}
}

Searching for a query:

After Indexing the .txt documents (in folder txtFiles) .we are going to search for a query (searchkey) .
i.e. we are searching a word in the txtFolder and going to find out which txt document is having that query word. Simply searching for a word in a Folder.

The Index folder’s Flat file’s have data which understands by the Analyzer implementation classes.
Here we are making use of “StandardAnalyzer” class to search in the index folder.

There are many analyzers available in Lucene 4.3.1

ArabicAnalyzer, ArmenianAnalyzer, BasqueAnalyzer, BrazilianAnalyzer, BulgarianAnalyzer, CatalanAnalyzer, CJKAnalyzer, ClassicAnalyzer, CzechAnalyzer, DanishAnalyzer, EnglishAnalyzer, FinnishAnalyzer, FrenchAnalyzer, GalicianAnalyzer, GermanAnalyzer, GreekAnalyzer, HindiAnalyzer, HungarianAnalyzer, IndonesianAnalyzer, IrishAnalyzer, ItalianAnalyzer, LatvianAnalyzer, NorwegianAnalyzer, PersianAnalyzer, PortugueseAnalyzer, RomanianAnalyzer, RussianAnalyzer, SpanishAnalyzer, StandardAnalyzer, StopAnalyzer, SwedishAnalyzer, ThaiAnalyzer, TurkishAnalyzer, UAX29URLEmailAnalyzer

These Analyzers will create tokenized streams that analyze the indexed files by parsing, to search for the information.

SearchForTextFiles.java

package com.codenvy.text;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class SearchForTextFiles {

	public static boolean searchFiles(String indexPath, String queryStr,
			int maxHits) {
		String field = "contents";
		IndexReader reader;
		try {
			reader = DirectoryReader.open(FSDirectory.open(new File(indexPath)));
			IndexSearcher searcher = new IndexSearcher(reader);
			Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
			QueryParser parser = new QueryParser(Version.LUCENE_43, field,analyzer);
			Query query = parser.parse(queryStr);
			TopDocs topDocs = searcher.search(query, maxHits);
			ScoreDoc[] hits = topDocs.scoreDocs;
			for (int i = 0; i < hits.length; i++) {
				int docId = hits[i].doc;
				Document d = searcher.doc(docId);
				System.out.println(d.get("path"));
			}
			System.out.println("Found " + hits.length);
			return true;
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
			return false;
		} catch (ParseException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
			return false;
		}

	}

	public static void main(String[] args) {
		// index folder
		String indexPath = "index";
		// query to search
		String queryStr = "codenvy";
                                // maximum hits to search in a file
		int maxHits = 100;
		searchFiles(indexPath, queryStr, maxHits);
	}

}

Note: Here I used Index and content folders as part of my IDE project. Please change it according to your requirement

TestCases:

TextIndexerTest.java
/**
 *
 */
package com.codenvy.text;

import junit.framework.Assert;
import junit.framework.TestCase;

/**
 * @author anils
 *
 */
public class ATextIndexerTest extends TestCase {

	/**
	 * Test method for {@link com.codenvy.TextIndexer#createIndex(java.lang.String, java.lang.String, boolean)}.
	 */
	public void testCreateIndex() {
		String indexPath = "index";
		String docsPath = "txtfiles";
		boolean create = true;
		Assert.assertTrue(TextIndexer.createIndex(indexPath, docsPath, create));

	}

}

SearchForTextFilesTest.java
package com.codenvy.text;
import junit.framework.Assert;
import junit.framework.TestCase;
public class SearchForTextFilesTest extends TestCase {
	public void testSearchFiles() {
		String indexPath = "index";
		String queryStr = "code";
		int maxHits = 100;
		Assert.assertTrue(SearchForTextFiles.searchFiles(indexPath, queryStr, maxHits));
	}
}

pom.xml (to Create jar file for TextIndexer.java)

Change this line

 <mainClass>com.codenvy.text.TextIndexer</mainClass>

To

<mainClass>com.codenvy.text. SearchForTextFiles </mainClass>

And

Change this line

<groupId>lucence-index</groupId>
<artifactId> lucence-index</artifactId>

To

<groupId>lucence-search</groupId>
<artifactId> lucence-search </artifactId>

Pom.xml content

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>

	<groupId>lucence-index</groupId>
	<artifactId> lucence-index</artifactId>
	<version>1.0-SNAPSHOT</version>
	<packaging>jar</packaging>

	<name>lucence</name>
	<url>http://maven.apache.org</url>

	<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
	</properties>

	<dependencies>

		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<version>3.8.1</version>
			<scope>test</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-core</artifactId>
			<version>4.3.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-queries</artifactId>
			<version>4.3.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-queryparser</artifactId>
			<version>4.3.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-analyzers-common</artifactId>
			<version>4.3.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-sandbox</artifactId>
			<version>4.3.1</version>
		</dependency>

		<dependency>
			<groupId>jakarta-regexp</groupId>
			<artifactId>jakarta-regexp</artifactId>
			<version>1.4</version>
		</dependency>
	</dependencies>

	<build>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>2.3.2</version>
				<configuration>
					<source>1.6</source>
					<target>1.6</target>
				</configuration>
			</plugin>

			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-jar-plugin</artifactId>
				<version>2.4</version>
				<configuration>
					<archive>
						<manifest>
							<addClasspath>true</addClasspath>
							<mainClass>com.codenvy.text.TextIndexer</mainClass>
							<classpathPrefix>lib/</classpathPrefix>
						</manifest>
					</archive>
				</configuration>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-dependency-plugin</artifactId>
				<version>2.5.1</version>
				<executions>
					<execution>
						<id>copy-dependencies</id>
						<phase>package</phase>
						<goals>
							<goal>copy-dependencies</goal>
						</goals>
						<configuration>
							<includeGroupIds>org.apache.lucene</includeGroupIds>
							<outputDirectory>${project.build.directory}/lib/</outputDirectory>
						</configuration>
					</execution>
				</executions>
			</plugin>

		</plugins>
	</build>

</project>

Lucene Features: (details taken from apache Lucene Documentation)

Scalable, High-Performance Indexing

• over 150GB/hour on modern hardware
• small RAM requirements — only 1MB heap
• incremental indexing as fast as batch indexing
• index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
• ranked searching — best results returned first
• many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
• fielded searching (e.g. title, author, contents)
• sorting by any field
• multiple-index searching with merged results
• allows simultaneous update and searching
• flexible faceting, highlighting, joins and result grouping
• fast, memory-efficient and typo-tolerant suggesters
• pluggable ranking models, including the Vector Space Model and Okapi BM25
• configurable storage engine (codecs)
Cross-Platform Solution
• Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
• 100%-pure Java
• Implementations in other programming languages available that are index-compatible