[#] Optical Character Recognition and You: A Primer and Handy Guide
02:07am EST - 11/13/2008

But, what’s OCR? Why should I care if my PDFs are tagged with the (OCR)? Well, fair reader, let me explain.

Optical Character Recognition, or OCR, is a method of recognizing letters in a PDF in order to make it searchable. By stringing letters together to form words, the program – Adobe Acrobat – is able to turn what was a jpeg or similar image into clickable, highlight-able, searchable text. Sounds pretty handy doesn’t it? Why yes, yes they are! OCR technology makes reading a PDF that much simpler. In addition to making the text interactive, it also allows for the compressing of PDFs. Take your average Dark Heresy scan. It’s about 200 megs, give or take a few megs for the differences in scans available. Once run through an OCR converter, that size can shrink all the way down to under 50 megs. Repeat this process for any book out there.

“But if OCR is so good, why aren’t all books OCR’d!” one might be wondering. Well. There is a slight loss of quality involved in the process. The same Dark Heresy book lost some of its quality. It still remained an excellent scan, however during the OCR process it does lose some. The remaining problem is the necessary equipment. Not just any program either. Adobe Acrobat is the only one I’ve found so far that is able to perform the process. Along with the program, a certain amount of time is needed, usually a few hours for the OCR to take place. During this period, the program uses a large amount of resources, making other endeavors nigh on impossible.

However, this brave author, has started OCRing books, in a venture he calls UNLIMITED OCR WORKS. Through this program, he’s OCR’d a fair number of books, and the library grows often. Though intensive, I have a number of books that I have OCR’d personally. The catalog will be updated and posted shortly, along with handy links.

If you have a particular or special request, please field them. It’s kind of fun to OCR books and stuff, so let me know over the IRC or something.



1 PurpleXVI
03:09am UTC - 11/13/2008 [X]
Isn't there also the issue of the OCR software not always perfectly recognizing characters, especially if some parts of a scan are imperfect or if it uses an unusual font?

I know I've encountered some "bugs" in PDF's occasionally which could be easily attributed to that, and it's pretty annoying.

2 Fatum
05:52pm UTC - 11/13/2008 [X]
Well, Purple, typos are annoying, but having to clear one or two is still better than typing those pesky ten lines of rules.

And what comes to requests, Ishallcallu, we sure could use 5th edition wh40k codexes or DH Core book and supplements like Inquisitor's Guide.

3 Ishallcallu
11:50pm UTC - 11/13/2008 [X]
@Purple: Yeah, some funny things do happen. When copypasta'ing stuff out, not quite all the characters make it out. I can't remember what particular string, but I think it's "if." It returns "? f" which is weird as hell, but hey. Shit ton better then typing out your lines of book.

@Fatum. I have the DH Inquisitor's Handbook ready to go. I'll up it right now. I might have 5th WH40K, but only SM. DH core I need to check.

4 PurpleXVI
05:09am UTC - 11/14/2008 [X]
Oh, definitely, just pointing out that it isn't entirely infallible.

If you ever decide to OCR some 2nd edition stuff, like maybe some Planescape, or some Dark Sun, I've got a pile of that.


