SharePoint Optical Character Recognition (OCR) Solution for Image Only PDFs

Posted in Digital Workplace Solutions, SharePoint

Summary

DMC's consulting services team implemented our SharePoint OCR Solution to convert Image Only PDF documents to searchable text for an established law firm based in Chicago, Illinois. The solution automatically scanned each and every document stored in the SharePoint Document Management System, identified Image Only PDF files, added a text layer to those PDF files via optical character recognition, and automatically re-saved the documents to the SharePoint Document Management System where they could be indexed by SharePoint's Enterprise Search engine.

Solution

Approximately 60% of the law firm's files are PDF files, and 1/3 of these PDFs are Image Only. The content of PDF files which contain only images cannot be searched.

The legal firm asked DMC for assistance with scanning their existing SharePoint Document Repository's 700,000+ files and converting Image Only PDF documents to searchable documents using Optical character recognition (OCR).

In order to help the law firm's staff quickly locate key documents, DMC built an application to first scan all existing documents already in SharePoint to determine which were Image Only PDFs. These documents were then processed by an OCR module built upon the Aquaforest OCR SDK to render the textual content searchable via SharePoint. The legal firm's SharePoint document repository of 700,000 files was scanned and converted in approximately 45 days, with a 96% success rate of adding a searchable text layer to image-only PDF files.

A simple SharePoint keyword search now instantly retrieves a list of all files containing the specified keyword(s), providing quick access to the information in all of the client's document files, saving vital time for their employees and customers.

Since implementing the original SharePoint OCR application, DMC has upgraded the application for compatibility with SharePoint 2010, 2013, 2016, and Office 365 SharePoint Online. Features have also been added to identify newly uploaded PDF files and OCR them multiple times daily, as well as the ability re-scan specific sites and libraries.

For more information on our SharePoint OCR Solution, please Contact Us.

Customer Benefits

PDF files can now be indexed by SharePoint Enterprise Search and instantly searched from SharePoint, allowing the legal firm's staff to quickly locate documents using simple keyword search
Automation of the OCR process saved at least 4,000 hours of staff time that would have been required to convert each PDF file individually
At least $150,000 was saved by implementing a custom solution when compared to the cost of implementing a packaged OCR software application, which are typically priced at $1+ per OCR'd page
Achieved a 96% success rate of adding a searchable text layer to image-only PDF files

Technologies

Microsoft SharePoint 2010
Microsoft Office SharePoint Server (MOSS) 2007
Microsoft .NET 3.5 Framework
Microsoft SQL Server 2008 R2
Microsoft Windows Server 2008 R2
Aquaforest OCR SDK

SharePoint Optical Character Recognition (OCR) Solution for Image Only PDFs

Summary

Solution

Customer Benefits

Technologies

Sign up for our newsletter