Table Recognition and Extraction from PDF Documents
Table recognition and extraction (TREX) has recently received as considerable attention as it can be considered a "per se" research field. Therefore, a large body of work concerning approaches and systems aimed at recognizing and extracting tables from documents having different internal encodings is currently available in literature.
An important limitation of TREX, as a ”per se” research field, is the lack of available standard datasets that hinders objective and complete comparisons among existing approaches.
The PDF-TREX dataset, freely available, in this page, aims at contributing to the definition of standard datasets in the TREX field. The dataset contains 100 documents and 164 tables having different layouts.
Download the PDF-TREX dataset here