Try : Insurtech, Application Development

AgriTech(1)

Augmented Reality(20)

Clean Tech(8)

Customer Journey(17)

Design(45)

Solar Industry(8)

User Experience(68)

Edtech(10)

Events(34)

HR Tech(3)

Interviews(10)

Life@mantra(11)

Logistics(5)

Strategy(18)

Testing(9)

Android(48)

Backend(32)

Dev Ops(11)

Enterprise Solution(29)

Technology Modernization(8)

Frontend(29)

iOS(43)

Javascript(15)

AI in Insurance(38)

Insurtech(66)

Product Innovation(58)

Solutions(22)

E-health(12)

HealthTech(24)

mHealth(5)

Telehealth Care(4)

Telemedicine(5)

Artificial Intelligence(147)

Bitcoin(8)

Blockchain(19)

Cognitive Computing(7)

Computer Vision(8)

Data Science(23)

FinTech(51)

Banking(7)

Intelligent Automation(27)

Machine Learning(47)

Natural Language Processing(14)

expand Menu Filters

Tabular Data Extraction from Invoice Documents

5 minutes, 12 seconds read

The task of extracting information from tables is a long-running problem statement in the world of machine learning and image processing. Although the latest accomplishments in the field of deep learning have seen a lot of success, tabular data extraction still remains a challenge due to the vast amount of ways in which tables are represented both visually and structurally. Below are some of the examples: 

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Invoice Documents

Many companies process their bills in the form of invoices which contain tables that hold information about the items along with their prices and quantities. This information is generally required to be stored in databases while these invoices get processed.

Traditionally, this information is required to be hand filled into a database software however, this approach has some drawbacks:

1. The whole process is time consuming.

2. Certain errors might get induced during the data entry process.

3. Extra cost of manual data entry.

 An invoice automation system can be deployed to address these shortcomings. The idea is to upload the invoice document and the system will read and generate the tabular information in the digital format making the whole process faster and more cost-effective for companies.

Fig. 6

Fig. 6 shows a sample invoice that contains some regular invoice details such as Invoice No, Invoice Date, Company details, and two tables holding transaction information. Now, our goal is to extract the information present in the two tables.

Tabular Information

The problem of extracting tables from invoices can be condensed into 2 main subtasks.

1. Table Detection

2. Tabular Structure Extraction.

 What is Table Detection?

 Table Detection is the process of identifying and locating tables that are present in a document, usually an image. There are multiple ways to detect tables in an image. Some of the approaches make use of image processing toolkits like OpenCV while some of the other approaches use statistical models on features extracted from the documents such as Text Position and Text Characteristics. Recently more deep learning approaches have been used to detect tables using trained neural networks similar to the ones used in Object Detection.

What is Table Structure Extraction?

Table Structure Extraction is the process of extracting the tabular information once the boundaries of the table are detected through Table Detection. The information within the rows and columns is then extracted and transferred to the desired format, usually CSV or Excel file.

Table Detection using Faster RCNN

Faster RCNN is a neural network model that comes from the RCNN family. It is the successor of Fast RCNN created by Ross Girshick in 2015. The name Faster RCNN is to signify an improvement over the previous model both in terms of training speed and detection speed. 

To read more about the model framework, one can access the paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

 There are many other object detection model architectures that are available for use today. Each model comes with certain advantages and disadvantages in terms of prediction accuracy, model parameter size, inference speed, etc.

For the task of detecting tables in invoice documents, we will select the Faster RCNN model with FPN(Feature Pyramid Network) as a feature extraction network. The model is pre-trained on the ImageNet corpus using ResNET 101 architecture. The ImageNet corpus is a public dataset that consists of more than 20,000 image categories of everyday objects.  We will therefore make use of a Pytorch framework to train and test the model.

The above mentioned model gives us a fast inference time and a high Mean Average Precision. It is preferred for cases where a quick real time detection is desired.

First, the model is to be trained using public datasets for Table Detection such as Marmot and UNLV datasets. Next, we further fine-tune the model with our custom labeled dataset. For the purpose of labeling, we will follow the COCO annotation format.

Once trained, the model displayed an accuracy close to 86% on our custom dataset. There are certain scenarios where the model fails to locate the tables such as cases containing watermarks and/or overlapping texts. Tables without borders are also missed in a few instances. However, the model has shown its ability to learn from examples and detect tables in multiple different invoice documents. 

Fig. 7

After running inference on the sample invoice from Fig 6, we can see two table boundaries being detected by the model in Fig 7. The first table gets detected with 100% accuracy and the second table is detected with 99% accuracy.

Table Structure Extraction

Once the boundaries of the table are detected by the model, an OCR (Optical Character Reader) mechanism is used to extract the text within the boundaries. The text is then processed using the information that is part of a unique table.

We were able to extract the correct structure of the table, including its headers and line items using logics derived from the invoices. The difficulty of this process depends on the type of invoice format at hand.

There are multiple challenges that one may encounter while building an algorithm to extract structure. Some of them are:

  1. The span of some table columns may overlap making it difficult to determine the boundaries between columns.
  2. The fonts and sizes present within tables may vary from one table to another. The algorithm should be able to accomodate for this variation.
  3. The tables might get split into two pages and detecting the continuation of a table might be challenging.

Certain deep learning approaches have also been published recently to determine the structure of a table. However, training them on custom datasets still remains a challenge. 

Fig 8

The final result is then stored in a CSV file and can be edited or stored according to one’s convenience as shown in Fig 8 which displays the first table information.

Conclusion

The deep learning approach to extracting information from structured documents is a step in the right direction. With high accuracy and low running time, the systems can only learn to perform better with more data. The recent and upcoming advancements in computer vision approaches have made processes such as invoice automation significantly accessible and robust.

About the author:

Prateek Sethi is a Data Scientist working at Mantra Labs. His work involves leveraging Artificial Intelligence to create data-driven solutions. Apart from his work he takes a keen interest in football and exploring the outdoors.

Further Reading:

Cancel

Knowledge thats worth delivered in your inbox

Silent Drains: How Poor Data Observability Costs Enterprises Millions

Let’s rewind the clock for a moment. Thousands of years ago, humans had a simple way of keeping tabs on things—literally. They carved marks into clay tablets to track grain harvests or seal trade agreements. These ancient scribes kickstarted what would later become one of humanity’s greatest pursuits: organizing and understanding data. The journey of data began to take shape.

Now, here’s the kicker—we’ve gone from storing the data on clay to storing the data on the cloud, but one age-old problem still nags at us: How healthy is that data? Can we trust it?

Think about it. Records from centuries ago survived and still make sense today because someone cared enough to store them and keep them in good shape. That’s essentially what data observability does for our modern world. It’s like having a health monitor for your data systems, ensuring they’re reliable, accurate, and ready for action. And here are the times when data observability actually had more than a few wins in the real world and this is how it works

How Data Observability Works

Data observability involves monitoring, analyzing, and ensuring the health of your data systems in real-time. Here’s how it functions:

  1. Data Monitoring: Continuously tracks metrics like data volume, freshness, and schema consistency to spot anomalies early.
  2. Automated data Alerts: Notify teams of irregularities, such as unexpected data spikes or pipeline failures, before they escalate.
  3. Root Cause Analysis: Pinpoints the source of issues using lineage tracking, making problem-solving faster and more efficient.
  4. Proactive Maintenance: Predicts potential failures by analyzing historical trends, helping enterprises stay ahead of disruptions.
  5. Collaboration Tools: Bridges gaps between data engineering, analytics, and operations teams with a shared understanding of system health.

Real-World Wins with Data Observability

1. Preventing Retail Chaos

A global retailer was struggling with the complexities of scaling data operations across diverse regions, Faced with a vast and complex system, manual oversight became unsustainable. Rakuten provided data observability solutions by leveraging real-time monitoring and integrating ITSM solutions with a unified data health dashboard, the retailer was able to prevent costly downtime and ensure seamless data operations. The result? Enhanced data lineage tracking and reduced operational overhead.

2. Fixing Silent Pipeline Failures

Monte Carlo’s data observability solutions have saved organizations from silent data pipeline failures. For example, a Salesforce password expiry caused updates to stop in the salesforce_accounts_created table. Monte Carlo flagged the issue, allowing the team to resolve it before it caught the executive attention. Similarly, an authorization issue with Google Ads integrations was detected and fixed, avoiding significant data loss.

3. Forbes Optimizes Performance

To ensure its website performs optimally, Forbes turned to Datadog for data observability. Previously, siloed data and limited access slowed down troubleshooting. With Datadog, Forbes unified observability across teams, reducing homepage load times by 37% and maintaining operational efficiency during high-traffic events like Black Friday.

4. Lenovo Maintains Uptime

Lenovo leveraged observability, provided by Splunk, to monitor its infrastructure during critical periods. Despite a 300% increase in web traffic on Black Friday, Lenovo maintained 100% uptime and reduced mean time to resolution (MTTR) by 83%, ensuring a flawless user experience.

Why Every Enterprise Needs Data Observability Today

1. Prevent Costly Downtime

Data downtime can cost enterprises up to $9,000 per minute. Imagine a retail giant facing data pipeline failures during peak sales—inventory mismatches lead to missed opportunities and unhappy customers. Data observability proactively detects anomalies, like sudden drops in data volume, preventing disruptions before they escalate.

2. Boost Confidence in Data

Poor data quality costs the U.S. economy $3.1 trillion annually. For enterprises, accurate, observable data ensures reliable decision-making and better AI outcomes. For instance, an insurance company can avoid processing errors by identifying schema changes or inconsistencies in real-time.

3. Enhance Collaboration

When data pipelines fail, teams often waste hours diagnosing issues. Data observability simplifies this by providing clear insights into pipeline health, enabling seamless collaboration across data engineering, data analytics, and data operations teams. This reduces finger-pointing and accelerates problem-solving.

4. Stay Agile Amid Complexity

As enterprises scale, data sources multiply, making Data pipeline monitoring and data pipeline management more complex. Data observability acts as a compass, pinpointing where and why issues occur, allowing organizations to adapt quickly without compromising operational efficiency.

The Bigger Picture:

Are you relying on broken roads in your data metropolis, or are you ready to embrace a system that keeps your operations smooth and your outcomes predictable?

Just as humanity evolved from carving records on clay tablets to storing data in the cloud, the way we manage and interpret data must evolve too. Data observability is not just a tool for keeping your data clean; it’s a strategic necessity to future-proof your business in a world where insights are the cornerstone of success. 

At Mantra Labs, we understand this deeply. With our partnership with Rakuten, we empower enterprises with advanced data observability solutions tailored to their unique challenges. Let us help you turn your data into an invaluable asset that ensures smooth operations and drives impactful outcomes.

Cancel

Knowledge thats worth delivered in your inbox

Loading More Posts ...
Go Top
ml floating chatbot