Tesseract (software)

Tesseract
	Screenshot File:Tesseract2.03.png; Tesseract 2.03 running on Gnome Terminal 2.26. "scanneddoc.tif" is the input document which will be rendered as "outputfile.txt" by Tesseract.
Original author(s)	Ray Smith, Hewlett-Packard
Developer(s)	Google
Stable release	2.04 / June 30, 2009; 495685940 ago
Preview release	3.00 - Revision 319 (svn) / December 1, 2009; 482380340 ago
Written in	C and C++
Operating system	Ubuntu 6.06 & 6.10 (32 & 64-bit), Windows (32-bit), and, unofficially, Mac OS X (x86) & Linux (32 & 64-bit)
Available in	Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Indonesian, Italian, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Tagalog, Turkish, Ukrainian & Vietnamese (more can be added using included training files)
Development status	Active
Type	Optical character recognition
License	Apache License v2.0
Website	http://code.google.com/p/tesseract-ocr

In computer software, Tesseract is a free optical character recognition engine. It was originally developed as proprietary software at Hewlett-Packard between 1985 until 1995. After ten years without any development taking place, Hewlett Packard and UNLV released it as open source in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0.^[1]^[2]^[3]

Tesseract is considered one of the most accurate free software OCR engines currently available.^[3]^[4]

About the Tesseract OCR Engine

Tesseract is a raw OCR engine. It has no document layout analysis, no output formatting, and no graphical user interface. The sole format it processes is a TIFF image of a single column, from which it creates text. TIFF compression is not supported unless libtiff is installed. It can detect whether text is monospaced or proportional. The engine was in the top 3 in terms of character accuracy in 1995. It compiles and runs on Linux, Windows and Mac OS X, however, due to limited resources only Windows and Ubuntu Linux are rigorously tested by developers.^[2]^[3]

Tesseract can process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch, and can be trained to work in other languages.^[3]

Tesseract is suitable for use as a backend, and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus. Further integration with programs such as OCRopus, to better support complicated layouts, is planned. Likewise, frontends such as FreeOCR can add a GUI to make the software easier to use for manual tasks.^[5]

History

The Tesseract engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler.^[2]

Currently Tesseract builds under Linux with GCC 2.95 or later and under Windows with Visual C++ 6. The C++ code makes heavy use of a list system using macros. This predates the C++ Standard Template Library and may be more efficient than Standard Template Library lists, but is reportedly harder to debug in the event of a segmentation fault. Another side-effect of the C/C++ split is that the C++ data structures get converted to C data structures to call the low-level C code. The migration to C++ is a step towards eliminating this conversion, though it is not yet complete.

Usage

Tesseract is an OCR engine, and it does not have a graphical user interface. It runs from the command line, and may be called with the command:^[6]

    tesseract image.tif output [options]

Tesseract handles image files in TIFF format (with filename extension .tif);^[6] other file formats need to be converted to TIFF before being submitted to Tesseract.

Tesseract does not support layout analysis, which means that it cannot interpret multi-column text, images, or equations, and in these cases will produce a garbled text output.^[3]

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 Google (2008). "tesseract-ocr". http://code.google.com/p/tesseract-ocr/. Retrieved 2008-07-12.
↑ ^2.0 ^2.1 ^2.2 Vincent, Luc (August 2006). "Announcing Tesseract OCR". http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html. Retrieved 2008-06-26.
↑ ^3.0 ^3.1 ^3.2 ^3.3 ^3.4 Canonical Ltd. (June 2008). "OCR". https://help.ubuntu.com/community/OCR. Retrieved 2008-07-12.
↑ Willis, Nathan (September 2006). "Google's Tesseract OCR engine is a quantum leap forward". http://www.linux.com/articles/57222. Retrieved 2008-07-18.
↑ Softi Software (2008). "FreeOCR.net V2.4 Free OCR Software". http://softi.co.uk/freeocr.htm. Retrieved 2008-06-26.
↑ ^6.0 ^6.1 http://code.google.com/p/tesseract-ocr/wiki/ReadMe Google Code - Tesseract Readme

External links

Tesseract OCR Project page on Google Code
Information Science Research Institute at the University of Nevada, Las Vegas Information Science Research Institute at the University of Nevada, Las Vegas
http://tesseract-ocr.repairfaq.org/ - C/C++ structure of Tesseract extracted from Doxyfied source code (based on Tesseract V1.03)
Archivista Box - A complete GPL document management system based on Tesseract and Linux.
Tesseract - Summary - some patches for training on a 64-bit machine.
Tesseract OCR Engine What it is, where it came from, where it is going.
VietOCR - Java/.NET GUI frontend for Tesseract OCR engine

de:Tesseract (Software) es:Tesseract OCR fr:Tesseract (logiciel) id:Tesseract pt:Tesseract (software) ru:Tesseract uk:Tesseract

If you like SEOmastering Site, you can support it by - BTC: bc1qppjcl3c2cyjazy6lepmrv3fh6ke9mxs7zpfky0 , TRC20 and more...

→

[TesseractHomePage-1] 1.0 ^1.1 ^1.2 ^1.3 Google (2008). "tesseract-ocr". http://code.google.com/p/tesseract-ocr/. Retrieved 2008-07-12.

[Google30Aug06-2] 2.0 ^2.1 ^2.2 Vincent, Luc (August 2006). "Announcing Tesseract OCR". http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html. Retrieved 2008-06-26.

[UbuntuDoc-3] 3.0 ^3.1 ^3.2 ^3.3 ^3.4 Canonical Ltd. (June 2008). "OCR". https://help.ubuntu.com/community/OCR. Retrieved 2008-07-12.

[Linux.com-4] Willis, Nathan (September 2006). "Google's Tesseract OCR engine is a quantum leap forward". http://www.linux.com/articles/57222. Retrieved 2008-07-18.

[Freeocr-5] Softi Software (2008). "FreeOCR.net V2.4 Free OCR Software". http://softi.co.uk/freeocr.htm. Retrieved 2008-06-26.

[readme-6] 6.0 ^6.1 http://code.google.com/p/tesseract-ocr/wiki/ReadMe Google Code - Tesseract Readme

[1]

[2]

[3]

[4]

[5]

[6]

Tesseract (software)

Contents

About the Tesseract OCR Engine

History

Usage

See also

References

External links

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Seo Tools

Tools

Main Categories

Screenshot File:Tesseract2.03.png Tesseract 2.03 running on Gnome Terminal 2.26. "scanneddoc.tif" is the input document which will be rendered as "outputfile.txt" by Tesseract.
Original author(s)	Ray Smith, Hewlett-Packard^[1]
Developer(s)	Google
Stable release	2.04 / June 30, 2009; 495685940 ago^[1]
Preview release	3.00 - Revision 319 (svn) / December 1, 2009; 482380340 ago^[1]
Written in	C and C++
Operating system	Ubuntu 6.06 & 6.10 (32 & 64-bit), Windows (32-bit), and, unofficially, Mac OS X (x86) & Linux (32 & 64-bit)
Available in	Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Indonesian, Italian, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Tagalog, Turkish, Ukrainian & Vietnamese (more can be added using included training files)
Development status	Active
Type	Optical character recognition
License	Apache License v2.0
Website	http://code.google.com/p/tesseract-ocr