Search text in PDF files using Java (Apache Lucene and Apache PDFBox)

search text in pdf files using apache lucene

I came across this requirement recently, to find whether a specific word is present or not in a PDF file. Initially I thought this is a very simple requirement and created a simple application in Java, that would first extract text from PDF files and then do a linear character matching like mystring.contains(mysearchterm) == true. It did give me the expected output, but linear character matching operations are suitable only when the content you are searching is very small. Otherwise it is very expensive, in complexity terms O(np) where n= number of words to search and p= number of search terms.

The best solution is to go for a simple search engine which will first pre-parse all your data in to tokens to create an index and then allow us to query the index to retrieve matching results. This means the whole content will be first broken down into terms and then each of it will point to the content. For example, consider the raw data,

1,hello world

2,god is good all the time

3,all is well

4,the big bang theory

The search engine will create an index like this,

all-> 2,3

hello-> 1

is->2,3

good->2

world->1

the->2,4

god->2

big->4

Full Text Search engines are what I am referring to here and these search engines quickly and effectively search large volume of unstructured text. There are many other things you can do with a search engine but I am not going to deal with any of it in this post. The aim is to let you know how to create a simple java application that can search for a particular keyword in PDF documents and tell you whether the document contains that particular keyword or not. That being said, the open source full text search engine that I am going to use for this purpose is Apache Lucene, which is a high performance, full-featured text search engine completely written in Java. Apache Lucene does not have the ability to extract text from PDF files. All it does is, creates index from text and then enables us to query against the indices to retrieve the matching results. To extract text from PDF documents, let us use Apache PDFBox, an open source java library that will extract content from PDF documents which can be fed to Lucene for indexing.

Lets get started by downloading the required libraries. Please stick to the version of software's that I am using, since latest versions may require different kind of implementation.

1. Download Apache lucene 3.6.1 from here. Unzip the content and find lucene-core-3.6.1.jar.

2. Download Apache PDFBox 0.7.3 from here. Unzip it and find pdfbox-0.7.3.jar

3. Download fontbox-0.1.0.jar from here. This project will throw Class not found exception if this library is not present.

Next step is to create a Java Project in Eclipse. Right click the project in project explorer, Go to -> Configure build Path -> Add External jars -> add lucene-core-3.6.1.jar,pdfbox-0.7.3.jar and fontbox-0.1.0.jar -> Click OK.

4. Create a class and name it as SimplePDFSearch.java. This is the main class that is going to perform each action one by one. Copy paste the below code in this class. Edit the package name to the name of package in which you are creating this class.

package com.programmingfree.simplepdfsearch;

import org.apache.lucene.queryParser.ParseException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;

import java.io.File;
import java.io.IOException;


 public class SimplePDFSearch {
     // location where the index will be stored.
     private static final String INDEX_DIR = "src/main/resources/index";
     private static final int DEFAULT_RESULT_SIZE = 100;

     public static void main(String[] args) throws IOException, ParseException {

         File pdfFile = new File("src/resources/SamplePDF.pdf");
         IndexItem pdfIndexItem = index(pdfFile);

         // creating an instance of the indexer class and indexing the items
         Indexer indexer = new Indexer(INDEX_DIR);
         indexer.index(pdfIndexItem);
         indexer.close();

         // creating an instance of the Searcher class to the query the index
         Searcher searcher = new Searcher(INDEX_DIR);
         int result = searcher.findByContent("Hello", DEFAULT_RESULT_SIZE);
         print(result);
         searcher.close();
     }
     
     //Extract text from PDF document
     public static IndexItem index(File file) throws IOException {
         PDDocument doc = PDDocument.load(file);
         String content = new PDFTextStripper().getText(doc);
         doc.close();
         return new IndexItem((long)file.getName().hashCode(), file.getName(), content);
     }

    //Print the results
     private static void print(int result) {
      if(result==1)
         System.out.println("The document contains the search keyword");
      else
      System.out.println("The document does not contain the search keyword");

     }
 }

5. We have to create a class to set and get the items that need to be indexed from a PDF file. Create a class and name it as IndexItem.java and copy the below code and paste in it. By doing this we are instructing the search engine to create and to retrieve the following contents of the PDF file, an Unique ID, the file name and the contents (text) of the file.

package com.programmingfree.simplepdfsearch;

public class IndexItem {
 private Long id;
    private String title;
    private String content;

    public static final String ID = "id";
    public static final String TITLE = "title";
    public static final String CONTENT = "content";

    public IndexItem(Long id, String title, String content) {
        this.id = id;
        this.title = title;
        this.content = content;
    }

    public Long getId() {
        return id;
    }

    public String getTitle() {
        return title;
    }

    public String getContent() {
        return content;
    }

    @Override
    public String toString() {
        return "IndexItem{" +
                "id=" + id +
                ", title='" + title + '\'' +
                ", content='" + content + '\'' +
                '}';
    }

}

6. Next step is to create a class to index the contents of the PDF documents. Create a new class and name it as Indexer.java as we have referred here. Copy and paste the below code to Indexer.java,

package com.programmingfree.simplepdfsearch;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.IOException;

public class Indexer {
 private IndexWriter writer;

    public Indexer(String indexDir) throws IOException {
        // create the index
        if(writer == null) {
        writer = new IndexWriter(FSDirectory.open(
                new File(indexDir)), new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36)));
        }
    }

    /** 
      * This method will add the items into index
      */
    public void index(IndexItem indexItem) throws IOException {

        // deleting the item, if already exists
        writer.deleteDocuments(new Term(IndexItem.ID, indexItem.getId().toString()));

        Document doc = new Document();

        doc.add(new Field(IndexItem.ID, indexItem.getId().toString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.add(new Field(IndexItem.TITLE, indexItem.getTitle(), Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field(IndexItem.CONTENT, indexItem.getContent(), Field.Store.YES, Field.Index.ANALYZED));

        // add the document to the index
        writer.addDocument(doc);
    }

    /**
      * Closing the index
      */
    public void close() throws IOException {
        writer.close();
    }
}

7. The last step is to create a class that provides features to query the index that is created using the indexer class. Create a class and name it as Searcher.java. Copy and paste the below code in it.

package com.programmingfree.simplepdfsearch;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Searcher {
 
    private IndexSearcher searcher;
    private QueryParser contentQueryParser;

    public Searcher(String indexDir) throws IOException {
        // open the index directory to search
        searcher = new IndexSearcher(IndexReader.open(FSDirectory.open(new File(indexDir))));
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);

        // defining the query parser to search items by content field.
        contentQueryParser = new QueryParser(Version.LUCENE_36, IndexItem.CONTENT, analyzer);
    }

    
    /**
      * This method is used to find the indexed items by the content.
      * @param queryString - the query string to search for
      */
    public int findByContent(String queryString, int numOfResults) throws ParseException, IOException {
        // create query from the incoming query string.
        Query query = contentQueryParser.parse(queryString);
         // execute the query and get the results
        ScoreDoc[] queryResults = searcher.search(query, numOfResults).scoreDocs;
        
        if(queryResults.length>0)
         return 1;
        else 
         return 0;
        
    }

    public void close() throws IOException {
        searcher.close();
    }
}

That is all we have to do before we run this program to find whether a word is present in a PDF file or not in a more quick and efficient way. Note in the main class (SimplePDFSearch.java), I have used a field named INDEX_DIR which contains the path where the index will be stored. Every time this program is run, the old index will be cleared and new index will be created. I have used a sample PDF document that consists of the following text in it,

"Hello World by PDFBox"

I am searching for the word "Hello", that is passed as a parameter to findByContent method of the Searcher class and the output is,

The document contains the search keyword

Download source code(use download button at the beginning of this article) and practice it yourself to understand this better.

Please leave your comments and queries about this post in the comment sections in order for me to improve my writing skills and to showcase more useful posts. Thanks for reading!!

Subscribe to GET LATEST ARTICLES!

Posted by Unknown

Post a Comment

luciano16 January 2013 at 06:54
Hello, i'm trying use Phrasequery to search exact phrase 'Hello World'. Can help me?
ReplyDelete
Replies
Unknown19 January 2013 at 11:35
Hello Priya,
I am trying to write a java program to search a word from first page(or paragraph) of a pdf file. Searching a word and its count of ocurance is enough. Advice please. Thanks in advance.
ReplyDelete
Replies
Vidyasagar Prasad24 January 2013 at 01:49
Hi
i have multiple pdf files in one folder ...so task is that in software there will be 2 input box
for
Browsing :- this will browse to that folder
name:- name of any person which you want to find (search in pdf)

and then when we will click on search it will check all the pdf available in that folder and then will check the name inside all pdf when it will get it should show the output below...
pdf file name :- first output
pdf file page no.:-second
person name:-which we searched
father's name:- searched person's father's name
sex:-M or F
Age:-
PLEASE HELP...
ReplyDelete
Replies
luciano1 April 2013 at 12:00
Hi Pryia,

My application was returning error org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EI
PDDocument doc = PDDocument.load(file);
String content = new PDFTextStripper().getText(doc);

Can help me?
Thanks
ReplyDelete
Replies
luciano3 April 2013 at 10:32
Pryia thanks for all,
I have one more question, i'm trying to remove the accents in the search, find words removing special characters such as accents ("ANDRÉ" equals "ANDRE").
I found the class ICUTokenizer but got the error NoSuchMethodError: com.ibm.icu.text.UnicodeSet.freeze.
http://lucene.apache.org/core/4_2_0/analyzers-icu/index.html

ReplyDelete
Replies
My Notes9 April 2013 at 03:11
Hi Priya..
I am into Testing. I need to write a code to verify a PDF.

I need to Search for Somewords in a PDF which contain data in a Tabular format.
Here are my queries..
How to retrieve the position of a word if it is found in the PDF.
If the word found, How to read that entire line

Please help me
Thanks in advance..
Ani
ReplyDelete
Replies
Unknown24 April 2013 at 01:46
what changes would i have to do for content based search on the following files
txt,doc,pdf,xls and csv
ReplyDelete
Replies
Unknown12 June 2013 at 04:31
if i search for "by" then it says doesnt contain the keyword
ReplyDelete
Replies
Unknown21 July 2013 at 01:48
Hi All, Please find the below requirement and suggest any solution for this.
--> I have a PDF document in my Local drive.
--> There is a table in the document and i need to find the exact value under a column name. e.g., i have 3 columns such as 'User ID', 'Password' and 'Type of User'. Now, i will provide the User ID and i need to get the Type of User for that ID.
Can anyone suggest if this is possible using Vb-Script or Java? If yes, please publish your thoughts. Thanks in advance.
ReplyDelete
Replies
Unknown18 September 2013 at 04:24
Does it support Arabic PDF files, as I have Arabic pdf files and I want to search for specific words inside it?
ReplyDelete
Replies
Demo23 September 2013 at 06:37
Hi Frd,

Above code works fine with single word like "Hello" or "World". If i try to search "Hello World"..Its says not available in the document... Pls help me to resolve this!!!
ReplyDelete
Replies
nikunj chauhan16 October 2013 at 03:29
Hi Priya,

Very nice introductory tutorial! Thanks for putting it up... It really is helpful for someone new to PDFBox and Lucene.

I am having the following scenario which I could not find in the comments before me:

I have a set of pdf documents (say 2000) created using Actuate Reports. There are scatterred key value pairs in every PDF with format like "customer=1234". Again on some other page it could be "customer=1456", etc.

I want to parse every pdf and fetch all the customer values from inside my java program.

Using the code above and modifying it as per my requirement, I think I will be able to get all the occurrences of "customer=" string and then through some String processing getting the next token before space and after "customer=" string as the value I want.

My questions are as below:

1) Is this way of getting the value is correct ? Or is there any option present in Lucene which directly fetches the value given a key as is my case.

2) My pdf documents will be around 2 to 5 pages long. So will it be ok to parse thousands of pdf at a time or will there be a performance issue ? Is there a way you can guide to improve performance ?

3) If the pdf background is white and the string "customer=1234" is also written in white color fonts (which means the will be physically invisible), then in that case will PDFBox be able to fetch the text such that I can search through lucene later ?

Thanks in advance for your help! Meanwhile, I will try to work with your program to get answers to my questions.

Keep up the good work!

Regards,
Nik

ReplyDelete
Replies
Unknown21 October 2013 at 03:50
Thanks PRIYA! dont have enoguh words to thank you!
ReplyDelete
Replies
Unknown11 February 2014 at 23:21
I did a indexing of files like pdf,ppt,docs. It display the file containing the particular word. Now I need to show the line in which the particular word occurs. Any idea on how to do that?
ReplyDelete
Replies
Siri2 April 2014 at 15:14
Hi Priya, How do I get the Coordinate location of the searched text? How do we use PrintTextLocations or TextPositions or some custom class? Can you pls help?
ReplyDelete
Replies
Karthik b matha26 April 2014 at 11:23
Hi Priya,
I am Karthik, been a tester for 7+ years, now i am asked to work on elasticsearch (lucene under the hood), i have spent few weeks on this and all that i did was

1) Copied few 100s of XML files into a folder
2) Converted each of them into JSON object and indexed it as Documents
( using http://www.json.org/java/index.html)
3) As part of elastisearch mapping (schema of XML) , all the elements of xml became fields with their correspoding type (String, Long , int , Date etc)

could please send me an email (karthikbm1809@gmail.com) ,need to ask you on indexing of PDF,HTML and XML files in the actual way

Regds
Karthik
ReplyDelete
Replies
Unknown19 January 2015 at 02:32
How to get the word count in the pdf file
ReplyDelete
Replies
Unknown10 January 2016 at 22:40
Thanks Thanks a lot..
it really helped me.
Keep doing good work like this..
All the best :)
ReplyDelete
Replies
Unknown4 March 2016 at 18:05
Hi Priya, I am not able download the file through the given link. it is showing error. Can you please provide the alternate link to download.
ReplyDelete
Replies
satya27 June 2016 at 03:15
Hi Priya,

Thanks for this very good post.

However my requirement for a POC on concepts like classification and indexing documents(PDF, word doc, XML, text..etc) and search among them. When I am using lucene library to do though indexing is working with simple API for pdf and xml files, but when i am executing search the correct result is not coming as output. Could you please suggest some thoughts on this ?
ReplyDelete
Replies
M.R.ANIRUDH16 July 2016 at 21:52
Is this your website too ? http://geekonjava.blogspot.com/2015/08/search-text-in-pdf-using-java-apache.html I don't see your name as the author here.
ReplyDelete
Replies
Unknown28 September 2016 at 01:53
Hello priya,

Thanks for your advice.

I am having 8 number of pdf files and I want to search a word in all these 8 pdf but I want the output only the pdf files which contains that my given searching word.

Please advice me how to do it in java and if you have any related link for that please post here.

Once again thank you.
ReplyDelete
Replies
velkris18 June 2017 at 02:45
Nice tutorial to get started with Lucene and PDF box
ReplyDelete
Replies
Mitul Vaghela25 November 2017 at 05:57
Hello,

Is it possible to find the page number of the string being searched?
ReplyDelete
Replies
Mitul Vaghela25 November 2017 at 05:59
Hello,

Is it possible to find the page number of the string being searched?
ReplyDelete
Replies
Unknown28 December 2017 at 04:07
This comment has been removed by the author.
ReplyDelete
Replies
Unknown28 December 2017 at 05:16
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/cmap/CMapParser
at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:534)

How can i fix this error while running the project?
Thanks in advance.
ReplyDelete
Replies
xcod4r9 January 2019 at 01:36
Learning Examples
ReplyDelete
Replies
zan21 March 2019 at 14:26
This comment has been removed by the author.
ReplyDelete
Replies
zan22 March 2019 at 13:45
Hello, I don't know java but I need to research in a file pdf (an electronics topografic) a list of words (R1, C1, L1 etc). Your program can be used for this?
ReplyDelete
Replies
gmk10 May 2019 at 00:01
This article is so much helpful. I followed the steps and exactly got what I wanted ! Many Many Thanks !!
ReplyDelete
Replies
Unknown21 March 2022 at 05:25
If You Want Get Discount on Shopping So Check Our Store:
smartbuyglasses promo code
ReplyDelete
Replies
Alex Hale11 April 2022 at 18:08
If you want to save a large amount of your money then click the link. So, visit here yoshinoya deals
ReplyDelete
Replies
jay9 May 2022 at 04:11
If you've ever wondered how to find out when a house was originally built, you're not alone. In fact, the UK is famous for its varied housing stock. However, some of the oldest buildings date back to the 12th century, for example in Bath. Although the majority of UK housing stock is modern, you can often find evidence of the original use by studying the architecture of the area around your house. how long does a mortgage pre approval last
ReplyDelete
Replies

Add comment

Top Ads

Custom Links

Label Links

Java

AJAX

JavaScript

Break

Search text in PDF files using Java (Apache Lucene and Apache PDFBox)

Subscribe to GET LATEST ARTICLES!

Post a Comment

SUBSCRIBE

Tabs

Hot in week

Recent

Comments

Tag Cloud

Contact Us

ACHIEVEMENTS

Social

Top Ads

Custom Links

Label Links

Java

AJAX

JavaScript

Break

Search text in PDF files using Java (Apache Lucene and Apache PDFBox)

Subscribe to GET LATEST ARTICLES!

Like to share?

Related

Post a Comment

SUBSCRIBE

Tabs

Hot in week

Recent

Comments

Tag Cloud

Contact Us

ACHIEVEMENTS