Evotec PDF OCR IFilter

Evotec PDF OCR IFilter allows you to search, within scanned PDF documents, using OCR techniques in order to recognize text.

The main use cases where this funcionality is specially useful are:

  • Scan vendor invoices in order to search and find them by product, serial number, VAT number, etc.
  • Scan signed contracts, where you canot save original Office document, but scanned document.
  • Scan IDs, documents, passports, and find them by name, ID, etc.

Main features

OCR Engine Features

We have used the well-known OCR engine “tesseract-ocr” in order to transform image to text within PDF documents. This engine was developed by HP Labs between 1985 and 1995, until in 2006 Google take over this project and continues its evolution and development. All this project is Open Source (Apache License).

More information can be found here:

http://code.google.com/p/tesseract-ocr/

SharePoint Search Integration

Other market solutions make a different approach to enable search in PDF using OCR; that technique is based on modifying original PDF adding a hidden text layer which contains the result of OCR process. This approach has some interesting advantages, but on the other hand, it has a big compromise, with some use cases where this can't be done in any way, as signed PDF, and also makes a problem with auditing and existing workflows in SharePoint where documents should not be modified in any way. In these scenarios, you should give an alternative document library to keep OCR'd documents, duplicating information and making a big confusion to organization users.

So, we made a different approach, and is to make an IFilter that smartly integrates with Search, and that can be completely transparent with existing information or custom developments, as this doesn't interact with them.

Adobe IFilter 64 bit integration

Our PDF OCR IFilter integrates with Adobe IFilter. As a first try, all PDF documents are indexed by original Adobe PDF IFilter, crawling all internal metadata, and text. If text is bigger than a configured length (this is customizable), the document is not passed through OCR, because it is probably a legible document. However, if after Adobe IFilter crawled properties, and didnt found significative amount of text, our OCR engine starts to work and extract all possible text from it.

This feature is optional, and our product automatically detects if Adobe is or not installed, in order to use it.


High performance and large scale environments

Evotec PDF OCR IFilter uses lot of CPU when making OCR, and of course in large scale deployments could be an important issue. However, it implements a central cache location, so that documents are OCR'd only once each one of them. Even if we have duplicated documents, it would be detected and optimized.

We can also deploy additional Crawl servers in SharePoint topology, all of them pointing to same cache location, so this process can be scaled up to whatever our hardware allows us.

It also can be configured to limit number of concurrent threads used to OCR, so we can allow large cpu servers to use multiple processes to OCR.

Other uses: not only for SharePoint

As this module uses standard IFilter Microsoft Interface, the same component could be used to index documents in Desktop Search, as well as for SQL Server BLOB indexing with Text Index.

SharePoint compatible versions

This component is version independent (only need to configure Windows registry to enable in SharePoint):

  • SharePoint Services 3.0, SharePoint 2007 Portal, Search Server 2007
  • SharePoint Foundation 2010, Search Express 2010, SharePoint Server 2010 y SharePoint 2013.

 

 

© Evotec Consulting S.L. Gestión de Sistemas Informáticos