Ashari Abidin's Developer Docs

Myanmar OCR OpenClaw

๐Ÿ‰ OpenClaw ยท Myanmar OCR Pipeline

Tesseract 5 + OpenCV + Hybrid AI | Enterprise-grade Burmese language recognition

๐Ÿ“˜ Core integration: OpenClaw + Myanmar (Burmese) OCR

Tesseract mya+eng

To integrate with OpenClaw for Myanmar (Burmese) language recognition, you mainly need to:

  • Install Tesseract with Myanmar language data
  • Configure OpenClaw OCR pipeline
  • Optimize preprocessing for Myanmar script
  • Handle Unicode normalization and mixed-language text

Myanmar language support in Tesseract already exists through the mya language model and Myanmar.traineddata.

โš™๏ธ 1. Install Tesseract OCR

sudo apt update && sudo apt install -y \
 tesseract-ocr \
 libtesseract-dev \
 tesseract-ocr-mya

Verify installation: tesseract --list-langs โ†’ expected mya & eng. If missing, manually download:

sudo mkdir -p /usr/share/tesseract-ocr/5/tessdata
sudo wget https://github.com/tesseract-ocr/tessdata/raw/main/mya.traineddata \
 -O /usr/share/tesseract-ocr/5/tessdata/mya.traineddata
# Optional script model:
sudo wget https://github.com/tesseract-ocr/tessdata/raw/main/script/Myanmar.traineddata \
 -O /usr/share/tesseract-ocr/5/tessdata/Myanmar.traineddata

๐Ÿ” 2. Basic OCR Test

tesseract sample-myanmar.png stdout -l mya
# Mixed Burmese + English
tesseract sample.png stdout -l mya+eng

For OpenClaw integrations, mya+eng is usually better because Myanmar documents often contain English names, numbers, Latin abbreviations, and technical terms.

๐Ÿ 3. Python Integration Example

import pytesseract
from PIL import Image
img = Image.open("invoice_mm.png")
text = pytesseract.image_to_string(img, lang="mya+eng", config="--oem 1 --psm 6")
print(text)

Recommended configs: OEM=1 (LSTM), PSM=6 or 11, language mya+eng.

๐Ÿ”Œ 4. OpenClaw OCR Pipeline Integration

Typical architecture: Document Upload โ†’ Image Preprocessing โ†’ Tesseract OCR (mya+eng) โ†’ Unicode Cleanup โ†’ NER / Extraction โ†’ Structured JSON Output.

class MyanmarOCR:
 def extract(self, image_path):
 return pytesseract.image_to_string(
 image_path, lang="mya+eng", config="--oem 1 --psm 6"
 )

โš ๏ธ 5. Important Myanmar OCR Challenges

  • Character stacking: multiple glyph layers
  • Complex ligatures & combined characters
  • Font inconsistency: Zawgyi vs Unicode
  • Low-resource datasets & historical document quality

Research shows Myanmar OCR often benefits from hybrid pipelines using preprocessing + LSTM + post-correction.

๐Ÿ”„ 6. Zawgyi vs Unicode Detection

pip install myanmar-tools rabbit
from myanmar_tools import ZawgyiDetector
detector = ZawgyiDetector()
score = detector.get_zawgyi_probability(text)
import rabbit
unicode_text = rabbit.zg2uni(text) # if Zawgyi detected

Normalization is crucial for accurate extraction with Tesseract.

๐ŸŽจ 7. OpenCV Preprocessing (Highly Recommended)

import cv2
img = cv2.imread("scan.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
cv2.imwrite("clean.png", thresh)

Recommended: deskew, noise reduction, adaptive thresholding, DPI scaling to 300+, contrast enhancement.

๐Ÿณ 8. Docker Deployment Example

FROM ubuntu:24.04
RUN apt update && apt install -y tesseract-ocr tesseract-ocr-mya python3-pip
RUN pip install pytesseract pillow opencv-python
WORKDIR /app
COPY . .
CMD ["python3", "app.py"]

๐ŸŒ 9. REST API Example (FastAPI)

from fastapi import FastAPI, UploadFile
import pytesseract
from PIL import Image
app = FastAPI()
@app.post("/ocr")
async def ocr(file: UploadFile):
 image = Image.open(file.file)
 text = pytesseract.image_to_string(image, lang="mya+eng")
 return {"text": text}

๐Ÿญ 10. Recommended Production Stack

  • OCR Engine: Tesseract 5
  • Language: mya+eng
  • Preprocessing: OpenCV
  • Encoding normalization: Rabbit + Myanmar Tools
  • Layout detection: YOLOv8 / Detectron2
  • NLP extraction: spaCy / Transformers
  • API: FastAPI, Queue: Celery/Redis

๐Ÿ“ˆ 11. Accuracy Improvement Strategy

Use fine-tuned models: tesstrain.sh with custom Myanmar fonts, datasets like NRC cards, hospital records, government forms. Active research in Myanmar OCR fine-tuning.

๐ŸŽ›๏ธ 12. Recommended OCR Modes (PSM)

  • Printed document โ†’ --psm 6
  • Sparse text โ†’ --psm 11
  • Single line โ†’ --psm 7
  • ID cards โ†’ --psm 4
  • Forms โ†’ --psm 6

๐Ÿ›๏ธ 13. Enterprise Use Cases

Myanmar NRC extraction, hospital records, pharmacy prescriptions, government forms, banking KYC, logistics manifests, education certificates, invoice digitization.

๐Ÿง  14. Recommended Hybrid Architecture

YOLO Layout Detection โ†’ OpenCV Cleanup โ†’ Tesseract OCR (mya+eng) โ†’ Unicode Normalization โ†’ LLM Correction Layer โ†’ Structured Extraction

This hybrid pattern is becoming common for low-resource OCR languages (Myanmar, Khmer, etc).
Back