Image Extractor
A month ago, I took up a project to analyse images in textbooks. It sounded like a good idea back then because I was interested in learning Google Vision api and computer vision in general and was looking for a project to implement it.
I later found out that the api, specifically the label images function, labels images based on context but not the content of the images. My first diagnosis of the problem was that I needed to extract the images from the documents and analyse them separately using the api. Which was naive in hindsight but alas that was the birth of this side project.
Initially, I was resistant to embark on this project mainly because I could not find a way to do in R. I came across this solution in C++ but I was already knee deep in R that picking up another language was not on my agenda. Frustrated, I decided to pull up my sleeves and try to translate the codes to Python, a language that I learned but did not use.
It took me a week to translate the language and two weeks after that to tweak it to my purpose. By the end of the project, I developed two different scripts. One that can analyse individual pages and extract the bounding boxes identified and another that can analyse all images in a folder. I will discuss the latter but not the former here.
Let's get started:
I coded this using Python 3.6 (32 bit) and used openCV version 3.4.0. You might not need the logging package.
First, we need to write the load input function which requires the path of the folder containing the target files. This function only works with png files but you may change the extension to suit your purpose. The getName function will be elaborated further below.
The next function was translated from the C++ solution found with minor tweaks to suit my purpose. I realized that the original 8 by 8 kernel was too big for my documents because some of the images have no borders. This will end up with more undetected target images. The smaller kernel was a compromise which will end up with noisier masking but with more images captured.
Here I also added erode function to enlarge images without clear borders. In the future, it might be necessary to include a function that can determine the size of the kernel or to choose whether to implement the dilate and erode combinations. I also fiddled with scipy local_thresholding function to deal with noisier input images. This might be included in the future iterations as well.
Next, we need to write the argparse arguments to run this code from the command prompt. Currently, there are only three main arguments. The first one is the path of to the image folder. Second, the path to the output folder. If this was left with no parameter the input folder will be the output folder. Lastly, is to save the original images with the bounding boxes. The original images will be saved in the output folder (I realised while writing this that the third argument might override the original files if the output folder was not specified).
The following lines are the crux of the script.
The first line takes the parameter from the argparse and pass it through the loadInput function. Then for every file in the folder, we need to get the file name using the getName function and load the file and convert it to grayscale for pre-processing. We then find the contour of the image and approximate it to squares.
In order to reduce the noise, only bounding boxes with areas more than 1500px will be further processed. The boxes are then sorted based from top left to bottom right for further analysis.
I tried several ways to handle overlapping boxes. So far only the combineBoxes method found here works with minor tweaks.
Here we first define the union and the intersection function. The combineBoxes method here is not pythonic and I still can't figure out the best way to do it with itertools to handles boxes as recommended in the same stackoverflow answer after it has been combined. I rather not run the function more than once to combine all the subsequent boxes so I chose this method for now.
For that reason I still keep the bounding boxes sorted in case I figure out how to combine the intersecting boxes in the future.
Lastly, we need to save selected bounding boxes. Only bounding boxes that are smaller than the image size and has a width or height bigger than 50 px are saved. This is necessary as some pages with text only will produce bounding box that captured the whole page and some noises might produce images too narrow to contain anything of interest after combining the overlapping boxes. The lines before the last lines will save the original images with the bounding boxes if specified.
So in order to run the script, just open the command prompt and enter:
The output is not perfect. For instance, the top left number is considered as an image because it has a solid background. Additionally, the first two images are considered as one image because one of the image borders blend into the background which caused the bounding box for the left image to intersect with the right image.
I later found out that the api, specifically the label images function, labels images based on context but not the content of the images. My first diagnosis of the problem was that I needed to extract the images from the documents and analyse them separately using the api. Which was naive in hindsight but alas that was the birth of this side project.
![]() |
This for instance was labelled as friendship (77%) but what I needed was to label the five humans and the one cake. I later found out though that it still doesn't work and is currently working on a different method using object detection |
Initially, I was resistant to embark on this project mainly because I could not find a way to do in R. I came across this solution in C++ but I was already knee deep in R that picking up another language was not on my agenda. Frustrated, I decided to pull up my sleeves and try to translate the codes to Python, a language that I learned but did not use.
It took me a week to translate the language and two weeks after that to tweak it to my purpose. By the end of the project, I developed two different scripts. One that can analyse individual pages and extract the bounding boxes identified and another that can analyse all images in a folder. I will discuss the latter but not the former here.
The codes
You can find the full code hereLet's get started:
import cv2
import os
import numpy as np
import argparse
import logging
I coded this using Python 3.6 (32 bit) and used openCV version 3.4.0. You might not need the logging package.
First, we need to write the load input function which requires the path of the folder containing the target files. This function only works with png files but you may change the extension to suit your purpose. The getName function will be elaborated further below.
def loadInput(path):
imagePaths = []
if not os.path.isdir(path):
raise IOError("The folder " + path + " doesn't exist.")
for root, dirs, files in os.walk(path):
for filename in (x for x in files if x.endswith('.png')):
imagePaths.append(os.path.join(root, filename))
return imagePaths
def getName(path):
base = os.path.basename(path)
return os.path.splitext(base)[0]
The next function was translated from the C++ solution found with minor tweaks to suit my purpose. I realized that the original 8 by 8 kernel was too big for my documents because some of the images have no borders. This will end up with more undetected target images. The smaller kernel was a compromise which will end up with noisier masking but with more images captured.
def preProcess(image):
kernel = cv2.getStructuringElement(cv2.MORPH_RECT,(5,5))
gray = cv2.dilate(image, kernel, iterations=1) #dilate to remove text
gray = cv2.erode(gray, kernel, iterations=1) #erode to restore dilation
ret, gray = cv2.threshold(gray, 254, 255, cv2.THRESH_TOZERO) #change white bg to blk
ret, gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV) #invert binary image for easier processing
gray = cv2.morphologyEx(gray, cv2.MORPH_CLOSE, kernel) #try to fill images rectangles and remove noise
gray = cv2.morphologyEx(gray, cv2.MORPH_OPEN, kernel)
return gray
Here I also added erode function to enlarge images without clear borders. In the future, it might be necessary to include a function that can determine the size of the kernel or to choose whether to implement the dilate and erode combinations. I also fiddled with scipy local_thresholding function to deal with noisier input images. This might be included in the future iterations as well.
if __name__ == "__main__":
ap = argparse.ArgumentParser(description='Scanned image extractor.')
ap.add_argument('-p', '--path', required=True,
help="Path to image e.g. .\image.png")
ap.add_argument('-o', '--output', required=False, default=None,
help="Path for output images e.g. ..\roi")
ap.add_argument('-s', '--saveoriginal', required=False, default=None,
help="Save original document with bounded boxes")
args = vars(ap.parse_args())
if args["output"] == None:
output_dir = args["path"]
else:
output_dir = os.path.realpath(args["output"])
if not os.path.isdir(output_dir):
logging.error(("Output directory %s does not exist" % args["output"]))
else:
logging.info("Output will be saved to " + output_dir)
Next, we need to write the argparse arguments to run this code from the command prompt. Currently, there are only three main arguments. The first one is the path of to the image folder. Second, the path to the output folder. If this was left with no parameter the input folder will be the output folder. Lastly, is to save the original images with the bounding boxes. The original images will be saved in the output folder (I realised while writing this that the third argument might override the original files if the output folder was not specified).
The following lines are the crux of the script.
imgInput = loadInput(args["path"])
for i in imgInput:
filename = getName(i)
colour = cv2.imread(i)
imgSize = colour.shape[1]*colour.shape[0]
gray = cv2.cvtColor(colour, cv2.COLOR_BGR2GRAY)
img = preProcess(gray)
#find contours and approximate to square
image, contours, hierarchy = cv2.findContours(img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
The first line takes the parameter from the argparse and pass it through the loadInput function. Then for every file in the folder, we need to get the file name using the getName function and load the file and convert it to grayscale for pre-processing. We then find the contour of the image and approximate it to squares.
idx = 0
rect = []
for c in contours:
if cv2.contourArea(c) > 1500:
rect.append(cv2.boundingRect(c))
sortBoxes = sorted(rect, key = lambda x: x[1]*3000+x[0])
try:
final = combineBoxes(sortBoxes)
except IndexError:
log.debug ("Boxes in " + filename + " can't be combined.")
cv2.rectangle(colour, (x,y), (x+w, y+h), (0, 0, 255), 2)
if w*h == imgSize:
log.debug ("Can't find significant image in " + filename)
else:
roi = colour[y:y+h, x:x+w] #save region of interests
cv2.imwrite(os.path.join(output_dir, filename + '_roi-' + str(idx) + '.png'), roi)
In order to reduce the noise, only bounding boxes with areas more than 1500px will be further processed. The boxes are then sorted based from top left to bottom right for further analysis.
I tried several ways to handle overlapping boxes. So far only the combineBoxes method found here works with minor tweaks.
def union(a,b):
x = min(a[0], b[0])
y = min(a[1], b[1])
w = max(a[0]+a[2], b[0]+b[2]) - x
h = max(a[1]+a[3], b[1]+b[3]) - y
return (x, y, w, h)
def intersection(a,b):
x = max(a[0], b[0])
y = max(a[1], b[1])
w = min(a[0]+a[2], b[0]+b[2]) - x
h = min(a[1]+a[3], b[1]+b[3]) - y
if w<0 code="" h="" or="" return="" w="" x="" y="">0>
Here we first define the union and the intersection function. The combineBoxes method here is not pythonic and I still can't figure out the best way to do it with itertools to handles boxes as recommended in the same stackoverflow answer after it has been combined. I rather not run the function more than once to combine all the subsequent boxes so I chose this method for now.
For that reason I still keep the bounding boxes sorted in case I figure out how to combine the intersecting boxes in the future.
def combineBoxes(boxes):
noIntersectLoop = False
noIntersectMain = False
posIndex = 0
# keep looping until we have completed a full pass over each rectangle
# and checked it does not overlap with any other rectangle
while noIntersectMain == False:
noIntersectMain = True
posIndex = 0
# start with the first rectangle in the list, once the first
# rectangle has been unioned with every other rectangle,
# repeat for the second until done
while posIndex < len(boxes):
noIntersectLoop = False
while noIntersectLoop == False and len(boxes) > 1 and posIndex < len(boxes): #added posIndex < len(boxes) to prevent indexError
a = boxes[posIndex]
listBoxes = np.delete(boxes, posIndex, 0)
index = 0
for b in listBoxes:
#if there is an intersection, the boxes overlap
if intersection(a, b):
newBox = union(a,b)
listBoxes[index] = newBox
boxes = listBoxes
noIntersectLoop = False
noIntersectMain = False
index = index + 1
else: #changed break to else
noIntersectLoop = True
index = index + 1
posIndex = posIndex + 1
return np.array(boxes).astype("int")
Lastly, we need to save selected bounding boxes. Only bounding boxes that are smaller than the image size and has a width or height bigger than 50 px are saved. This is necessary as some pages with text only will produce bounding box that captured the whole page and some noises might produce images too narrow to contain anything of interest after combining the overlapping boxes. The lines before the last lines will save the original images with the bounding boxes if specified.
for x,y,w,h in final:
if w*h == imgSize:
log.debug ("Can't find significant image in " + filename)
elif w < 50:
log.debug ("Width too narrow for box in " + filename)
elif h < 50:
log.debug ("Height too short for box in " + filename)
else:
idx += 1
cv2.rectangle(colour, (x,y), (x+w, y+h), (0, 0, 255), 2)
roi = colour[y:y+h, x:x+w] #save region of interests
cv2.imwrite(os.path.join(output_dir, filename + '_roi-' + str(idx) + '.png'), roi)
if args["saveoriginal"] is not None:
cv2.imwrite(os.path.join(output_dir, filename + '.png'), colour)
cv2.destroyAllWindows()
So in order to run the script, just open the command prompt and enter:
python imageExtractor-cd.py --path C:\[path] --output C:\[path]\roi
These are some inconsistencies that I still could not control. Perhaps it is a matter of garbage-in-garbage-out and better image pre-processing is needed. The border problem could be solved by using canny lines detection then find the bounding box. This shall be explored in future iterations.
Conclusion
This side project was born out of the desire to extract images in scanned document for further analysis. In hindsight, this project was unnecessary and I could not find any other use for it now. However, I am glad that I started on this project as it was a good exposure to OpenCV and computer vision in general. The references and knowledge gleamed from this experience will definitely be useful for future projects.
Comments
Post a Comment