Blog

Optical Character Recognition (OCR) Solution with SharePoint

Optical Character Recognition (OCR) Solution with SharePoint

The benefits of combining OCR and SharePoint are pretty clear. Image-only PDFs are not picked up by SharePoint search, meaning that many documents that might match a search request are missed. DMC's OCR solution solves this problem by checking PDFs uploaded to SharePoint and running them through OCR if needed. Once a PDF has been processed by OCR, it will contain a text layer that will be picked up by search.

We recently added a number of features to our OCR solution and I’d like to detail them here. But first, here’s a quick reminder of how our OCR solution works:

Our solution is configured to monitor any number of content sources. A content source can vary in size between a single document library and an entire site collection. Based either on a schedule or on demand, the solution will search all content sources for newly uploaded PDF files. It will then download the files it found, process them with the AquaForest OCR engine, and then upload them to SharePoint at the same location. The solution can be configured to archive the original files. The latest version of our solution supports SharePoint 2010, 2013, and SharePoint Online (Office 365).

So what’s new? Let’s take a look:

Run as a Windows Service
Our solution can now be installed as a Windows service. This means that it no longer requires an active user session to run, and also means that it will start up again automatically when the computer it is running on is rebooted.

Support for Advanced SDK
Based on our experience, we’ve found that AquaForest’s advanced SDK performs much better for customers that have low quality scans. For this reason, we have made sure that our solution now also supports using the Advanced SDK.

Recrawl
We added a “Recrawl” button to support fully reprocessing all PDFs in a given site (and subsites) or library. This can be useful for example when upgrading to the Advanced SDK in order to reprocess PDFs that were previously processed with the Standard SDK.

Priority
We can assign a priority to PDFs as they are found, so that more important files are processed first. The priority is assigned by content source. One use case is if a large number of files is added which might take weeks to fully process. In this case it may be desirable to still process new files that are added while processing the large batch of uploaded files. The priority system allows this by setting a low priority for the large group of files added while setting a higher priority for newer files.

New Image-Only check
Simply testing if a PDF has a text layer can cause false positives, because some scanners will add a text layer that contains only page numbers. With the existing check, these documents would be skipped. Our new logic will run the PDF through OCR and then compare before and after results. This new logic can be disabled, as it does require more processing time.

If this sounds useful to you, please contact us. We can also customize this solution to match your needs and integrate it into your workflows.

Learn more about DMC's Microsoft Consulting Services.

Comments

ContCentric IT Services Pvt Ltd
# ContCentric IT Services Pvt Ltd
Thank you for sharing such nice information on OCR.
OCR is an important aspect of ECM. If you are a developer and want to know more about the programming side of the same, you can refer this blog post by us: http://www.contcentric.com/configuring-ocr-in-alfresco/

Post a comment

Name (required)

Email (required)

CAPTCHA image
Enter the code shown above:

Related Blog Posts