Skip to the content.

OCR

OCR, in general

OCR = Optical Character Recognition 光学文字识别

含文字的图像 –> (格式化)文字

Evaluation

CER: character-level

WER: word-level

==Preprocessing==

Borderless Tables Detection with Deep Learning and OpenCV - by Volodymyr Holomb - Towards Data Science

Improving the quality of the output - tessdoc

OCR Preprocessing overall

Problems for prediction:

Solution:

thresholding

OpenCV: Image Thresholding

Screen Shot 2022-10-07 at 6.26.44 PM

import cv2 as cv
import numpy as np
from matplotlib import pyplot as plt
img = cv.imread('gradient.png',0)
ret,thresh1 = cv.threshold(img,127,255,cv.THRESH_BINARY)
ret,thresh2 = cv.threshold(img,127,255,cv.THRESH_BINARY_INV)
ret,thresh3 = cv.threshold(img,127,255,cv.THRESH_TRUNC)
ret,thresh4 = cv.threshold(img,127,255,cv.THRESH_TOZERO)
ret,thresh5 = cv.threshold(img,127,255,cv.THRESH_TOZERO_INV)
titles = ['Original Image','BINARY','BINARY_INV','TRUNC','TOZERO','TOZERO_INV']
images = [img, thresh1, thresh2, thresh3, thresh4, thresh5]
for i in range(6):
    plt.subplot(2,3,i+1),plt.imshow(images[i],'gray',vmin=0,vmax=255)
    plt.title(titles[i])
    plt.xticks([]),plt.yticks([])
plt.show()

Screen Shot 2022-10-07 at 6.27.12 PM

Adaptive Thresholding

The adaptiveMethod decides how the threshold value is calculated:

ret,th1 = cv.threshold(img,127,255,cv.THRESH_BINARY)
th2 = cv.adaptiveThreshold(img,255,cv.ADAPTIVE_THRESH_MEAN_C,\
            cv.THRESH_BINARY,11,2)
th3 = cv.adaptiveThreshold(img,255,cv.ADAPTIVE_THRESH_GAUSSIAN_C,\
            cv.THRESH_BINARY,11,2)

ada_threshold.jpg

structure recognizing

OpenCV: Morphological Transformations

border enhancing(prep)

思路:生成一个辅助border recogniton的图像–整行/列的空白填黑,深色的数字变浅。

识别空白行/列:维护dict计数器数每个值的像素数量

  1. 遍历所有列,记下空白的列

    • 遍历完行之后把空白列和其前一列涂黑
    • (不干扰到行的识别)
  2. 遍历所有行,空白行涂黑

    2.1 对于不完整的行(比如同一格里面分出的子结构,短的行),如果一行中有一段固定长度的都是黑的,把整行和前一行都涂黑

    2.2 把空白行涂黑

    2.3 深色数字变浅

structure/border recoginiton

方法:

Tesseract OCR

Tesseract OCR in Python with Pytesseract & OpenCV

What is Tesseract and how it works? - by BytePace - Medium

Architecture

legacy tesseract engine:

img

LSTMs

How Tesseract uses LSTM model presentation

img