GOT-OCR 2.0: A New Era of OCR with Multi-Lingual, Multi-Modal Capabilities
Beijing, China – A newopen-source end-to-end OCR model, GOT-OCR 2.0, has been released, promising to usher in a new era of optical character recognitionwith its advanced capabilities. Developed by researchers at the University of Chinese Academy of Sciences, GOT-OCR 2.0 boasts multi-lingual, multi-modalrecognition, and diverse input/output formats, making it a powerful tool for various applications.
GOT-OCR 2.0 stands out for its ability to handle a wide range of optical characters, including text, mathematical formulas, molecular formulas,charts, musical scores, and geometric figures. This versatility allows for a broader range of applications compared to traditional OCR models. The model also supports multiple languages, particularly Chinese and English, and can output results in various formats like Markdown and LaTeX.
One of the key features of GOT-OCR 2.0 is its interactive OCR functionality. This allows users to guide the model through region-level recognition using coordinates or color cues, providing a more flexible user experience. The model also incorporates a dynamic resolution strategy, enabling it to handle ultra-high-resolution images likelarge posters or stitched PDF pages while maintaining accuracy.
For efficient processing of large documents, GOT-OCR 2.0 includes multi-page OCR technology. This allows for batch processing of multi-page documents, significantly improving the efficiency of handling long PDF files or multiple image documents.
Technical Architecture and Training Strategy:
GOT-OCR 2.0 is built on an encoder-decoder architecture. The encoder compresses the input image into a sequence of image tokens, capturing visual information. The decoder then receives these tokens and converts them into text output. The decoder supports long contexts, enabling it to handle long text documents.
The modelutilizes a high-compression encoder, which compresses a 1024×1024 pixel image into 256×1024 image tokens, facilitating the processing of high-resolution images. The long-context decoder supports token sequences of up to 8K, allowing for the processing ofdocuments containing large amounts of text.
GOT-OCR 2.0 employs a multi-stage training strategy. In the pre-training phase, the encoder is pre-trained on a large amount of text data to learn visual representations of text. During the joint training phase, the encoder is trained alongside a new decoder toadapt to a broader range of OCR tasks. Finally, the decoder undergoes further training in the post-training phase to support advanced features like fine-grained OCR, dynamic resolution, and multi-page OCR.
Applications of GOT-OCR 2.0:
The versatility of GOT-OCR 2.0makes it suitable for a wide range of applications, including:
- Document digitization: Converting paper documents like books, manuscripts, legal documents, and academic papers into electronic formats for easier storage, retrieval, and editing.
- Scene text recognition: Identifying and extracting text in natural scenes like street signs, billboards,and menus.
- Invoice processing: Automating the recognition and extraction of text information from invoices, receipts, and bills, streamlining financial and accounting processes.
- Identity verification and security: Recognizing information from passports, ID cards, or driver’s licenses in scenarios requiring personal identity verification, such as banking transactions or airport securitychecks.
- Logistics and transportation: Automating the recognition of barcodes and address information on packages, improving the efficiency of logistics sorting and delivery.
- Medical record management: Recognizing and digitizing handwritten prescriptions, medical records, and other medical documents.
Availability and Future Prospects:
GOT-OCR 2.0 is available on GitHub and HuggingFace Model Hub, making it accessible to developers and researchers worldwide. The model’s open-source nature encourages collaboration and innovation in the OCR field.
The release of GOT-OCR 2.0 signifies a significant advancement in OCR technology, offering a powerful and versatile tool for variousapplications. As the model continues to evolve, it is expected to further enhance its capabilities and drive innovation in fields like document digitization, scene text recognition, and automated data extraction.
Views: 0