PDF to Excel using advanced python NLP and Computer Vision AKA Document AI - Part 1






If the data is tabular, we can use the Camelot, tabula, or pdftotext libraries to transform it into a data frame. To extract tabular material, we may identify a table by using vertical and horizontal lines, or we can utilize the libraries listed to assist us to identify the table. Another useful PDF hack is to obtain the table columns, header, and footer and decode them based on their coordinates with respect to the column area.

installation: 
pip install matplotlib
pip install "camelot-py[cv]"
brew install ghostscript tcl-tk


import camelot
#import camelot.io as camelot
import pandas as pd
import PyPDF2 as pyPdf
import matplotlib.pyplot as plt

#import tabula
tables = camelot.read_pdf('icic.pdf')
camelot.plot(tables[0], kind='text')
plt.show()
camelot.plot(tables[0], kind='grid')
plt.show()

reader = pyPdf.PdfFileReader(open("icic.pdf", mode='rb' ))
n = reader.getNumPages() 

df = []
for page in [str(i+1) for i in range(n)]:
    if page == "1":
            df.append(read_pdf(r"icic.pdf", area=(530,12.75,790.5,561), pages=page))
    else:
            df.append(read_pdf(r"icic.pdf", pages=page))


Comments

Popular Posts