Correctly extract text from image using Tesseract OCR

Question

I have been trying to extract the bold white text from this image but not able to get it working correctly, seems the 9 is read as a 3 and the I as 1.

Have been looking at various sites which has code to make the image better quality but not getting it to work, anyone able to help me with this one? The desired output should be "I6M-9U"

def get_text_from_image(image: cv2.Mat) -> str:
    pytesseract.pytesseract.tesseract_cmd = r'C:\Tesseract-OCR\tesseract.exe'
    
    # Crop image to only get the piece I am interested in
    top, left, height, width = 25, 170, 40, 250

    try:
        crop_img = image[top:top + height, left:left + width]
        
        # Make it bigger
        resize_scaling = 1500
        resize_width = int(crop_img.shape[1] * resize_scaling / 100)
        resize_height = int(crop_img.shape[0] * resize_scaling / 100)
        resized_dimensions = (resize_width, resize_height)
    
        # Resize it
        crop_img = cv2.resize(crop_img, resized_dimensions, interpolation=cv2.INTER_CUBIC)
        
        return str(pytesseract.image_to_string(crop_img, config="--psm 6"))

UPDATED CODE

ret, thresh1 = cv.threshold(image, 120, 255, cv.THRESH_BINARY +
                                            cv.THRESH_OTSU)

cv.imshow("image", thresh1)

This now has all the background artifacts removed but it is now reading the first letter I as 1 and the 9 is read as 3

what are you trying? show is your code to know what not to do. — Tino D
– Tino D, Commented May 8, 2024 at 17:07
I have tesseract.js working locally, and I cannot get it. Best accuracy outputs IEM-9U, every time. I reduced brightness, upped contrast slightly and a hard sharpen. Image. Now, it's IBM-9U, consistently. I've pulled way worse text. You have a unique case, for sure. — JayCravens
– JayCravens, Commented May 8, 2024 at 17:15
I got it. The image is full of artifacts. If you open it in GIMP and do an 'equalize', you'll see all the artifacts. Use this posterizer tool and select 2 colors. Tesseract got it on the first try. — JayCravens
– JayCravens, Commented May 8, 2024 at 17:24
Perhaps just Otsu threshold. That might help clean it up properly for tesseract to get the correct solution. Sorry, my OpenCV is not working at the moment for me to test that. — fmw42
– fmw42, Commented May 8, 2024 at 18:02

Hermann12 · Accepted Answer · 2024-05-13 08:58:18Z

0

The I looks like a 1 for me. If you don't like the 1 remove it from filter:

import cv2
import pytesseract
 
img = cv2.imread('16M.png',cv2.IMREAD_UNCHANGED)

(thresh, blackAndWhiteImage) = cv2.threshold(img, 63, 255, cv2.THRESH_BINARY) 

# resize image
scale_percent = 4 # percent of original size
width = int(blackAndWhiteImage.shape[1] * scale_percent / 100)
height = int(blackAndWhiteImage.shape[0] * scale_percent / 100)
dim = (width, height)
resized = cv2.resize(blackAndWhiteImage, dim, interpolation = cv2.INTER_AREA)

# OCR resized Black & White image
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# Remove 1 from filter, than you will get I instead
custom_config = r'--psm 6 --oem 3 -c tessedit_char_whitelist=-.ABCDEFTGHIJKLMNOPQRSTUVWXYZ0123456789' 

tex = pytesseract.image_to_string(resized, config=custom_config)
print(tex)

# Display cropped image
cv2.imshow("Image", resized)
 
cv2.waitKey(0)
cv2.destroyAllWindows()

Output:

16M-9U-0.0

With removed "1" from filter:

I6M-9U-0.0

answered May 13, 2024 at 8:58

Hermann12

4,1382 gold badges8 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Andreas Ellsen Over a year ago

That would work except there are multiple images I need to parse through and some of them contain 1. The image above has an I in it and not a 1

Collectives™ on Stack Overflow

Correctly extract text from image using Tesseract OCR

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related