Search text in PDF files using Java (Apache Lucene and Apache PDFBox)



search text in pdf files using apache lucene

I came across this requirement recently, to find whether a specific word is present or not in a PDF file. Initially I thought this is a very simple requirement and created a simple application in Java, that would first extract text from PDF files and then do a linear character matching like mystring.contains(mysearchterm) == true. It did give me the expected output, but linear character matching operations are suitable only when the content you are searching is very small. Otherwise it is very expensive, in complexity terms O(np) where n= number of words to search and p= number of search terms.


The best solution is to go for a simple search engine which will first pre-parse all your data in to tokens to create an index and then allow us to query the index to retrieve matching results. This means the whole content will be first broken down into terms and then each of it will point to the content. For example, consider the raw data, 

1,hello world
2,god is good all the time
3,all is well
4,the big bang theory

The search engine will create an index like this,

all-> 2,3
hello-> 1
is->2,3
good->2
world->1
the->2,4
god->2
big->4

Full Text Search engines are what I am referring to here and these search engines quickly and effectively search large volume of unstructured text. There are many other things you can do with a search engine but I am not going to deal with any of it in this post. The aim is to let you know how to create a simple java application that can search for a particular keyword in PDF documents and tell you whether the document contains that particular keyword or not. That being said, the open source full text search engine that I am going to use for this purpose is Apache Lucene, which is a high performance, full-featured text search engine completely written in Java. Apache Lucene does not have the ability to extract text from PDF files. All it does is, creates index from text and then enables us to query against the indices to retrieve the matching results. To extract text from PDF documents, let us use Apache PDFBox, an open source java library that will extract content from PDF documents which can be fed to Lucene for indexing.

Lets get started by downloading the required libraries. Please stick to the version of software's that I am using, since latest versions may require different kind of implementation. 

1. Download Apache lucene 3.6.1 from here. Unzip the content and find lucene-core-3.6.1.jar.

2. Download Apache PDFBox 0.7.3 from here. Unzip it and find pdfbox-0.7.3.jar

3. Download fontbox-0.1.0.jar from here. This project will throw Class not found exception if this library is not present.

Next step is to create a Java Project in Eclipse. Right click the project in project explorer, Go to -> Configure build Path -> Add External jars -> add lucene-core-3.6.1.jar,pdfbox-0.7.3.jar and fontbox-0.1.0.jar -> Click OK.

4. Create a class and name it as SimplePDFSearch.java. This is the main class that is going to perform each action one by one. Copy paste the below code in this class. Edit the package name to the name of package in which you are creating this class.

package com.programmingfree.simplepdfsearch;

import org.apache.lucene.queryParser.ParseException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;

import java.io.File;
import java.io.IOException;


 public class SimplePDFSearch {
     // location where the index will be stored.
     private static final String INDEX_DIR = "src/main/resources/index";
     private static final int DEFAULT_RESULT_SIZE = 100;

     public static void main(String[] args) throws IOException, ParseException {

         File pdfFile = new File("src/resources/SamplePDF.pdf");
         IndexItem pdfIndexItem = index(pdfFile);

         // creating an instance of the indexer class and indexing the items
         Indexer indexer = new Indexer(INDEX_DIR);
         indexer.index(pdfIndexItem);
         indexer.close();

         // creating an instance of the Searcher class to the query the index
         Searcher searcher = new Searcher(INDEX_DIR);
         int result = searcher.findByContent("Hello", DEFAULT_RESULT_SIZE);
         print(result);
         searcher.close();
     }
     
     //Extract text from PDF document
     public static IndexItem index(File file) throws IOException {
         PDDocument doc = PDDocument.load(file);
         String content = new PDFTextStripper().getText(doc);
         doc.close();
         return new IndexItem((long)file.getName().hashCode(), file.getName(), content);
     }

    //Print the results
     private static void print(int result) {
      if(result==1)
         System.out.println("The document contains the search keyword");
      else
      System.out.println("The document does not contain the search keyword");

     }
 }


5. We have to create a class to set and get the items that need to be indexed from a PDF file. Create a class and name it as IndexItem.java and copy the below code and paste in it. By doing this we are instructing the search engine to create and to retrieve the following contents of the PDF file, an Unique ID, the file name and the contents (text) of the file.


package com.programmingfree.simplepdfsearch;

public class IndexItem {
 private Long id;
    private String title;
    private String content;

    public static final String ID = "id";
    public static final String TITLE = "title";
    public static final String CONTENT = "content";

    public IndexItem(Long id, String title, String content) {
        this.id = id;
        this.title = title;
        this.content = content;
    }

    public Long getId() {
        return id;
    }

    public String getTitle() {
        return title;
    }

    public String getContent() {
        return content;
    }

    @Override
    public String toString() {
        return "IndexItem{" +
                "id=" + id +
                ", title='" + title + '\'' +
                ", content='" + content + '\'' +
                '}';
    }

}


6. Next step is to create a class to index the contents of the PDF documents. Create a new class and name it as Indexer.java as we have referred here. Copy and paste the below code to Indexer.java,


package com.programmingfree.simplepdfsearch;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.IOException;

public class Indexer {
 private IndexWriter writer;

    public Indexer(String indexDir) throws IOException {
        // create the index
        if(writer == null) {
        writer = new IndexWriter(FSDirectory.open(
                new File(indexDir)), new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36)));
        }
    }

    /** 
      * This method will add the items into index
      */
    public void index(IndexItem indexItem) throws IOException {

        // deleting the item, if already exists
        writer.deleteDocuments(new Term(IndexItem.ID, indexItem.getId().toString()));

        Document doc = new Document();

        doc.add(new Field(IndexItem.ID, indexItem.getId().toString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.add(new Field(IndexItem.TITLE, indexItem.getTitle(), Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field(IndexItem.CONTENT, indexItem.getContent(), Field.Store.YES, Field.Index.ANALYZED));

        // add the document to the index
        writer.addDocument(doc);
    }

    /**
      * Closing the index
      */
    public void close() throws IOException {
        writer.close();
    }
}

7. The last step is to create a class that provides features to query the index that is created using the indexer class. Create a class and name it as Searcher.java. Copy and paste the below code in it.

package com.programmingfree.simplepdfsearch;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Searcher {
 
    private IndexSearcher searcher;
    private QueryParser contentQueryParser;

    public Searcher(String indexDir) throws IOException {
        // open the index directory to search
        searcher = new IndexSearcher(IndexReader.open(FSDirectory.open(new File(indexDir))));
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);

        // defining the query parser to search items by content field.
        contentQueryParser = new QueryParser(Version.LUCENE_36, IndexItem.CONTENT, analyzer);
    }

    
    /**
      * This method is used to find the indexed items by the content.
      * @param queryString - the query string to search for
      */
    public int findByContent(String queryString, int numOfResults) throws ParseException, IOException {
        // create query from the incoming query string.
        Query query = contentQueryParser.parse(queryString);
         // execute the query and get the results
        ScoreDoc[] queryResults = searcher.search(query, numOfResults).scoreDocs;
        
        if(queryResults.length>0)
         return 1;
        else 
         return 0;
        
    }

    public void close() throws IOException {
        searcher.close();
    }
}

That is all we have to do before we run this program to find whether a word is present in a PDF file or not in a more quick and efficient way. Note in the main class (SimplePDFSearch.java), I have used a field named INDEX_DIR which contains the path where the index will be stored. Every time this program is run, the old index will be cleared and new index will be created. I have used a sample PDF document that consists of the following text in it,

"Hello World by PDFBox"

I am searching for the word "Hello", that is passed as a parameter to findByContent method of the Searcher class and the output is,

The document contains the search keyword

Download source code(use download button at the beginning of this article) and practice it yourself to understand this better.


Please leave your comments and queries about this post in the comment sections in order for me to improve my writing skills and to showcase more useful posts. Thanks for reading!!

Subscribe to GET LATEST ARTICLES!


advertise here

Related

Search 5479138296702051106

Post a Comment

  1. Hello, i'm trying use Phrasequery to search exact phrase 'Hello World'. Can help me?

    ReplyDelete
    Replies
    1. Hello Luciano,

      You should PhraseQuery class instead of Query class.

      // search for documents that have "foo bar" in them
      String sentence = "foo bar";
      IndexSearcher searcher = new IndexSearcher(directory);
      PhraseQuery query = new PhraseQuery();
      String[] words = sentence.split(" ");
      for (String word : words) {
      query.add(new Term("contents", word));
      }


      Check out these links for more working examples,
      http://stackoverflow.com/questions/5527868/exact-phrase-search-using-lucene
      http://www.avajava.com/tutorials/lessons/how-do-i-query-for-words-near-each-other-with-a-phrase-query.html
      http://www.ibm.com/developerworks/java/library/os-apache-lucenesearch/

      Hope this helps!

      Delete
  2. Hello Priya,
    I am trying to write a java program to search a word from first page(or paragraph) of a pdf file. Searching a word and its count of ocurance is enough. Advice please. Thanks in advance.

    ReplyDelete
    Replies
    1. Hi,

      As explained in the post, we are converting the content of the whole pdf file to text using pdfbox and then indexing it. So, your first requirement of analyzing the first page or paragraph alone is not possible. Next, you can very well find the number of times it occurs in the index if you build your index with content from only one pdf that is of interest to you.

      If you have more than one pdf file then the count will include occurrences of the search term in all pdf files. The above post is just a sample that lets you know how to use lucene to search pdf files. I recommend you to go through the official documentation to understand which analyzer and QueryParser best suits your requirement.

      Thanks,
      Priya

      Delete
    2. I want multiple pages searching text in pdf file.... I try to this code working single only single pages.... so please help me .....

      Delete
  3. Hi
    i have multiple pdf files in one folder ...so task is that in software there will be 2 input box
    for
    Browsing :- this will browse to that folder
    name:- name of any person which you want to find (search in pdf)

    and then when we will click on search it will check all the pdf available in that folder and then will check the name inside all pdf when it will get it should show the output below...
    pdf file name :- first output
    pdf file page no.:-second
    person name:-which we searched
    father's name:- searched person's father's name
    sex:-M or F
    Age:-
    PLEASE HELP...

    ReplyDelete
  4. Hi Pryia,

    My application was returning error org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: EI
    PDDocument doc = PDDocument.load(file);
    String content = new PDFTextStripper().getText(doc);

    Can help me?
    Thanks

    ReplyDelete
    Replies
    1. Hi luciano,

      What is the content you have inside your PDF file? Do you have any text in the PDF files? You might encounter this error when you have only images in your PDF files.

      This post extracts text from PDF files and if it finds no text, then you might get this error. Please try with PDF files that has some text content in it.

      Thanks,
      Priya

      Delete
  5. Pryia thanks for all,
    I have one more question, i'm trying to remove the accents in the search, find words removing special characters such as accents ("ANDRÉ" equals "ANDRE").
    I found the class ICUTokenizer but got the error NoSuchMethodError: com.ibm.icu.text.UnicodeSet.freeze.
    http://lucene.apache.org/core/4_2_0/analyzers-icu/index.html


    ReplyDelete
  6. Hi Priya..
    I am into Testing. I need to write a code to verify a PDF.

    I need to Search for Somewords in a PDF which contain data in a Tabular format.
    Here are my queries..
    How to retrieve the position of a word if it is found in the PDF.
    If the word found, How to read that entire line

    Please help me
    Thanks in advance..
    Ani

    ReplyDelete
  7. what changes would i have to do for content based search on the following files
    txt,doc,pdf,xls and csv

    ReplyDelete
  8. if i search for "by" then it says doesnt contain the keyword

    ReplyDelete
    Replies
    1. Hi,

      Do you mean to say that,search using other words (hello/world/pdfbox) all works except the word 'by' in the sample application I have provided? Or is it not working in your own application which you have implemented following the above tutorial?

      Thanks,
      Priya

      Delete
    2. Sorry for incomplete information...
      I downloaded the source file you have given...
      Configure the jar files (i got lucene-core-3.6.2 as the link you have provided was broken)
      when i run the program with hello , world or pdfbox in the queryString
      it gives "The document contains the search keyword"
      but i give by in the queryString
      it gives "The document does not contain the search keyword"

      I am using the same pdf you have provided.
      PDF when opened by adobe reader shows "Hello World by PDFBox"

      Delete
    3. Hi Akshit,

      I know its too late for you to find this response useful. But for others who have the same question as yours, this is the reason why when you search for the word 'by' you don't get any result. It is because I have used "StandardAnalyzer" in this example which is used to index the PDF file's text content. By default, StandardAnalyzer has a set of stop words that are omitted from being indexed.

      You can find a list of all the words that are filtered out by default here along with a solution to stop this behavior if you wish,

      http://stackoverflow.com/questions/4871709/stop-words-in-sitecore

      Thanks,
      Priya

      Delete
    4. Okay thanks.
      My project was complete but I was still wondering the answer.
      Thanks for the reply.

      Delete
  9. Hi All, Please find the below requirement and suggest any solution for this.
    --> I have a PDF document in my Local drive.
    --> There is a table in the document and i need to find the exact value under a column name. e.g., i have 3 columns such as 'User ID', 'Password' and 'Type of User'. Now, i will provide the User ID and i need to get the Type of User for that ID.
    Can anyone suggest if this is possible using Vb-Script or Java? If yes, please publish your thoughts. Thanks in advance.

    ReplyDelete
  10. Does it support Arabic PDF files, as I have Arabic pdf files and I want to search for specific words inside it?

    ReplyDelete
    Replies
    1. Hi,

      Yeah, certainly. You have to use ArabicAnalyzer for this. Check this out,

      http://lucene.apache.org/core/3_0_3/api/contrib-analyzers/org/apache/lucene/analysis/ar/ArabicAnalyzer.html

      http://stackoverflow.com/questions/2938564/lucene-2-2-arabic-analyzer

      Thanks,
      Priya

      Delete
  11. Hi Frd,

    Above code works fine with single word like "Hello" or "World". If i try to search "Hello World"..Its says not available in the document... Pls help me to resolve this!!!

    ReplyDelete
    Replies
    1. Hi,

      The above example explains how to search for a single word only. You should use 'PhraseQuery' to do an exact phrase search. There are lots to learn in lucene. Go through the official documentation to find out which analyzer and query class best suits to you. For a quick solution, refer to this,

      http://stackoverflow.com/questions/5527868/exact-phrase-search-using-lucene

      Thanks,
      Priya

      Delete
    2. Thanks for ur reply,

      i gone through this link
      'http://stackoverflow.com/questions/9066347/lucene-multi-word-phrases-as-search-terms?rq=1',
      and I've done some changes in the downloaded source file from here. Its looks like working correctly!!

      I've shared my project with updates Download Link: http://www.mediafire.com/?is7rq3rob400mq4.

      Can u pls check and tell me... what i'd done is correct..

      Delete
  12. Hi Priya,

    Very nice introductory tutorial! Thanks for putting it up... It really is helpful for someone new to PDFBox and Lucene.

    I am having the following scenario which I could not find in the comments before me:

    I have a set of pdf documents (say 2000) created using Actuate Reports. There are scatterred key value pairs in every PDF with format like "customer=1234". Again on some other page it could be "customer=1456", etc.

    I want to parse every pdf and fetch all the customer values from inside my java program.

    Using the code above and modifying it as per my requirement, I think I will be able to get all the occurrences of "customer=" string and then through some String processing getting the next token before space and after "customer=" string as the value I want.

    My questions are as below:

    1) Is this way of getting the value is correct ? Or is there any option present in Lucene which directly fetches the value given a key as is my case.

    2) My pdf documents will be around 2 to 5 pages long. So will it be ok to parse thousands of pdf at a time or will there be a performance issue ? Is there a way you can guide to improve performance ?

    3) If the pdf background is white and the string "customer=1234" is also written in white color fonts (which means the will be physically invisible), then in that case will PDFBox be able to fetch the text such that I can search through lucene later ?

    Thanks in advance for your help! Meanwhile, I will try to work with your program to get answers to my questions.

    Keep up the good work!

    Regards,
    Nik


    ReplyDelete
    Replies
    1. Hi Nik,

      First of all, thanks for reading this article.

      Lucene is a full text search engine, which provides quick search results when queried against a huge search index. Please post your question at stackoverflow.com after doing proper analysis on your requirements and all possible ways of implementing it.

      Thanks,
      Priya

      Delete
    2. I want multiple pages searching text in pdf file.... I try to this code working single only single pages.... so please help me .....

      Delete
  13. Thanks PRIYA! dont have enoguh words to thank you!

    ReplyDelete
  14. I did a indexing of files like pdf,ppt,docs. It display the file containing the particular word. Now I need to show the line in which the particular word occurs. Any idea on how to do that?

    ReplyDelete
  15. Hi Priya, How do I get the Coordinate location of the searched text? How do we use PrintTextLocations or TextPositions or some custom class? Can you pls help?

    ReplyDelete
  16. Hi Priya,
    I am Karthik, been a tester for 7+ years, now i am asked to work on elasticsearch (lucene under the hood), i have spent few weeks on this and all that i did was

    1) Copied few 100s of XML files into a folder
    2) Converted each of them into JSON object and indexed it as Documents
    ( using http://www.json.org/java/index.html)
    3) As part of elastisearch mapping (schema of XML) , all the elements of xml became fields with their correspoding type (String, Long , int , Date etc)

    could please send me an email (karthikbm1809@gmail.com) ,need to ask you on indexing of PDF,HTML and XML files in the actual way

    Regds
    Karthik

    ReplyDelete
    Replies
    1. Hi Karthik,

      I am no elasticsearch expert. All I would suggest is to go through required documents or get help from elasticsearch forum to proceed in the right way. Search is very interesting as always and I hope you find it easy after you are done with the exploration. Good luck!

      Delete

emo-but-icon

Currency Converter

Built using AngularJS and ASP.NET Web API

SUBSCRIBE


item