PDF to Excel using advanced python NLP and Computer Vision AKA Document AI - Part 1

If the data is tabular, we can use the Camelot, tabula, or pdftotext libraries to transform it into a data frame. To extract tabular material, we may identify a table by using vertical and horizontal lines, or we can utilize the libraries listed to assist us to identify the table. Another useful PDF hack is to obtain the table columns, header, and footer and decode them based on their coordinates with respect to the column area.

installation:

pip install matplotlib

pip install "camelot-py[cv]"

brew install ghostscript tcl-tk

import camelot

#import camelot.io as camelot

import pandas as pd

import PyPDF2 as pyPdf

import matplotlib.pyplot as plt

#import tabula

tables = camelot.read_pdf('icic.pdf')

camelot.plot(tables[0], kind='text')

plt.show()

camelot.plot(tables[0], kind='grid')

plt.show()

reader = pyPdf.PdfFileReader(open("icic.pdf", mode='rb' ))

n = reader.getNumPages()

df = []

for page in [str(i+1) for i in range(n)]:

if page == "1":

df.append(read_pdf(r"icic.pdf", area=(530,12.75,790.5,561), pages=page))

else:

df.append(read_pdf(r"icic.pdf", pages=page))

Search This Blog

All about Machine Learning

PDF to Excel using advanced python NLP and Computer Vision AKA Document AI - Part 1

Comments

Post a Comment

Popular Posts