Thursday, 29 November 2012

Search text in PDF files using Java (Apache Lucene and Apache PDFBox)

search text in pdf files using apache lucene

I came across this requirement recently, to find whether a specific word is present or not in a PDF file. Initially I thought this is a very simple requirement and created a simple application in Java, that would first extract text from PDF files and then do a linear character matching like mystring.contains(mysearchterm) == true. It did give me the expected output, but linear character matching operations are suitable only when the content you are searching is very small. Otherwise it is very expensive, in complexity terms O(np) where n= number of words to search and p= number of search terms.


The best solution is to go for a simple search engine which will first pre-parse all your data in to tokens to create an index and then allow us to query the index to retrieve matching results. This means the whole content will be first broken down into terms and then each of it will point to the content. For example, consider the raw data, 

1,hello world
2,god is good all the time
3,all is well
4,the big bang theory

The search engine will create an index like this,

all-> 2,3
hello-> 1
is->2,3
good->2
world->1
the->2,4
god->2
big->4

Full Text Search engines are what I am referring to here and these search engines quickly and effectively search large volume of unstructured text. There are many other things you can do with a search engine but I am not going to deal with any of it in this post. The aim is to let you know how to create a simple java application that can search for a particular keyword in PDF documents and tell you whether the document contains that particular keyword or not. That being said, the open source full text search engine that I am going to use for this purpose is Apache Lucene, which is a high performance, full-featured text search engine completely written in Java. Apache Lucene does not have the ability to extract text from PDF files. All it does is, creates index from text and then enables us to query against the indices to retrieve the matching results. To extract text from PDF documents, let us use Apache PDFBox, an open source java library that will extract content from PDF documents which can be fed to Lucene for indexing.

Lets get started by downloading the required libraries. Please stick to the version of software's that I am using, since latest versions may require different kind of implementation. 

1. Download Apache lucene 3.6.1 from here. Unzip the content and find lucene-core-3.6.1.jar.

2. Download Apache PDFBox 0.7.3 from here. Unzip it and find pdfbox-0.7.3.jar

3. Download fontbox-0.1.0.jar from here. This project will throw Class not found exception if this library is not present.

Next step is to create a Java Project in Eclipse. Right click the project in project explorer, Go to -> Configure build Path -> Add External jars -> add lucene-core-3.6.1.jar,pdfbox-0.7.3.jar and fontbox-0.1.0.jar -> Click OK.

4. Create a class and name it as SimplePDFSearch.java. This is the main class that is going to perform each action one by one. Copy paste the below code in this class. Edit the package name to the name of package in which you are creating this class.

package com.programmingfree.simplepdfsearch;

import org.apache.lucene.queryParser.ParseException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;

import java.io.File;
import java.io.IOException;


 public class SimplePDFSearch {
     // location where the index will be stored.
     private static final String INDEX_DIR = "src/main/resources/index";
     private static final int DEFAULT_RESULT_SIZE = 100;

     public static void main(String[] args) throws IOException, ParseException {

         File pdfFile = new File("src/resources/SamplePDF.pdf");
         IndexItem pdfIndexItem = index(pdfFile);

         // creating an instance of the indexer class and indexing the items
         Indexer indexer = new Indexer(INDEX_DIR);
         indexer.index(pdfIndexItem);
         indexer.close();

         // creating an instance of the Searcher class to the query the index
         Searcher searcher = new Searcher(INDEX_DIR);
         int result = searcher.findByContent("Hello", DEFAULT_RESULT_SIZE);
         print(result);
         searcher.close();
     }
     
     //Extract text from PDF document
     public static IndexItem index(File file) throws IOException {
         PDDocument doc = PDDocument.load(file);
         String content = new PDFTextStripper().getText(doc);
         doc.close();
         return new IndexItem((long)file.getName().hashCode(), file.getName(), content);
     }

    //Print the results
     private static void print(int result) {
      if(result==1)
         System.out.println("The document contains the search keyword");
      else
      System.out.println("The document does not contain the search keyword");

     }
 }


5. We have to create a class to set and get the items that need to be indexed from a PDF file. Create a class and name it as IndexItem.java and copy the below code and paste in it. By doing this we are instructing the search engine to create and to retrieve the following contents of the PDF file, an Unique ID, the file name and the contents (text) of the file.


package com.programmingfree.simplepdfsearch;

public class IndexItem {
 private Long id;
    private String title;
    private String content;

    public static final String ID = "id";
    public static final String TITLE = "title";
    public static final String CONTENT = "content";

    public IndexItem(Long id, String title, String content) {
        this.id = id;
        this.title = title;
        this.content = content;
    }

    public Long getId() {
        return id;
    }

    public String getTitle() {
        return title;
    }

    public String getContent() {
        return content;
    }

    @Override
    public String toString() {
        return "IndexItem{" +
                "id=" + id +
                ", title='" + title + '\'' +
                ", content='" + content + '\'' +
                '}';
    }

}


6. Next step is to create a class to index the contents of the PDF documents. Create a new class and name it as Indexer.java as we have referred here. Copy and paste the below code to Indexer.java,


package com.programmingfree.simplepdfsearch;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.IOException;

public class Indexer {
 private IndexWriter writer;

    public Indexer(String indexDir) throws IOException {
        // create the index
        if(writer == null) {
        writer = new IndexWriter(FSDirectory.open(
                new File(indexDir)), new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36)));
        }
    }

    /** 
      * This method will add the items into index
      */
    public void index(IndexItem indexItem) throws IOException {

        // deleting the item, if already exists
        writer.deleteDocuments(new Term(IndexItem.ID, indexItem.getId().toString()));

        Document doc = new Document();

        doc.add(new Field(IndexItem.ID, indexItem.getId().toString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.add(new Field(IndexItem.TITLE, indexItem.getTitle(), Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field(IndexItem.CONTENT, indexItem.getContent(), Field.Store.YES, Field.Index.ANALYZED));

        // add the document to the index
        writer.addDocument(doc);
    }

    /**
      * Closing the index
      */
    public void close() throws IOException {
        writer.close();
    }
}

7. The last step is to create a class that provides features to query the index that is created using the indexer class. Create a class and name it as Searcher.java. Copy and paste the below code in it.

package com.programmingfree.simplepdfsearch;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Searcher {
 
    private IndexSearcher searcher;
    private QueryParser contentQueryParser;

    public Searcher(String indexDir) throws IOException {
        // open the index directory to search
        searcher = new IndexSearcher(IndexReader.open(FSDirectory.open(new File(indexDir))));
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);

        // defining the query parser to search items by content field.
        contentQueryParser = new QueryParser(Version.LUCENE_36, IndexItem.CONTENT, analyzer);
    }

    
    /**
      * This method is used to find the indexed items by the content.
      * @param queryString - the query string to search for
      */
    public int findByContent(String queryString, int numOfResults) throws ParseException, IOException {
        // create query from the incoming query string.
        Query query = contentQueryParser.parse(queryString);
         // execute the query and get the results
        ScoreDoc[] queryResults = searcher.search(query, numOfResults).scoreDocs;
        
        if(queryResults.length>0)
         return 1;
        else 
         return 0;
        
    }

    public void close() throws IOException {
        searcher.close();
    }
}

That is all we have to do before we run this program to find whether a word is present in a PDF file or not in a more quick and efficient way. Note in the main class (SimplePDFSearch.java), I have used a field named INDEX_DIR which contains the path where the index will be stored. Every time this program is run, the old index will be cleared and new index will be created. I have used a sample PDF document that consists of the following text in it,

"Hello World by PDFBox"

I am searching for the word "Hello", that is passed as a parameter to findByContent method of the Searcher class and the output is,

The document contains the search keyword

Download the source code from here and practice it yourself to understand this better.


Please leave your comments and queries about this post in the comment sections in order for me to improve my writing skills and to showcase more useful posts. Thanks for reading!!


Subscribe to GET LATEST ARTICLES!


Most Shared - Last Week


10 comments:

  1. Hello, i'm trying use Phrasequery to search exact phrase 'Hello World'. Can help me?

    ReplyDelete
    Replies
    1. Hello Luciano,

      You should PhraseQuery class instead of Query class.

      // search for documents that have "foo bar" in them
      String sentence = "foo bar";
      IndexSearcher searcher = new IndexSearcher(directory);
      PhraseQuery query = new PhraseQuery();
      String[] words = sentence.split(" ");
      for (String word : words) {
      query.add(new Term("contents", word));
      }


      Check out these links for more working examples,
      http://stackoverflow.com/questions/5527868/exact-phrase-search-using-lucene
      http://www.avajava.com/tutorials/lessons/how-do-i-query-for-words-near-each-other-with-a-phrase-query.html
      http://www.ibm.com/developerworks/java/library/os-apache-lucenesearch/

      Hope this helps!

      Delete
  2. Hello Priya,
    I am trying to write a java program to search a word from first page(or paragraph) of a pdf file. Searching a word and its count of ocurance is enough. Advice please. Thanks in advance.

    ReplyDelete
  3. Hi
    i have multiple pdf files in one folder ...so task is that in software there will be 2 input box
    for
    Browsing :- this will browse to that folder
    name:- name of any person which you want to find (search in pdf)

    and then when we will click on search it will check all the pdf available in that folder and then will check the name inside all pdf when it will get it should show the output below...
    pdf file name :- first output
    pdf file page no.:-second
    person name:-which we searched
    father's name:- searched person's father's name
    sex:-M or F
    Age:-
    PLEASE HELP...

    ReplyDelete
  4. Hi Pryia,

    My application was returning error org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: EI
    PDDocument doc = PDDocument.load(file);
    String content = new PDFTextStripper().getText(doc);

    Can help me?
    Thanks

    ReplyDelete
    Replies
    1. Hi luciano,

      What is the content you have inside your PDF file? Do you have any text in the PDF files? You might encounter this error when you have only images in your PDF files.

      This post extracts text from PDF files and if it finds no text, then you might get this error. Please try with PDF files that has some text content in it.

      Thanks,
      Priya

      Delete
  5. Pryia thanks for all,
    I have one more question, i'm trying to remove the accents in the search, find words removing special characters such as accents ("ANDRÉ" equals "ANDRE").
    I found the class ICUTokenizer but got the error NoSuchMethodError: com.ibm.icu.text.UnicodeSet.freeze.
    http://lucene.apache.org/core/4_2_0/analyzers-icu/index.html


    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Hi Priya..
    I am into Testing. I need to write a code to verify a PDF.

    I need to Search for Somewords in a PDF which contain data in a Tabular format.
    Here are my queries..
    How to retrieve the position of a word if it is found in the PDF.
    If the word found, How to read that entire line

    Please help me
    Thanks in advance..
    Ani

    ReplyDelete
  8. what changes would i have to do for content based search on the following files
    txt,doc,pdf,xls and csv

    ReplyDelete