Extract Data from PDF using REST API in Node.js

Extracting Data from PDF using REST API in Node.js

We can easily parse PDF documents and extract specific data using a user-defined template on the cloud. We can extract fields and table data from PDF files programmatically. In this article, we will learn how to extract data from PDF using REST API in Node.js.

The following topics shall be covered in this article:

REST API and Node.js SDK to Extract Data from PDF

For parsing PDF documents and extracting data based on a template, we will be using the Node.js SDK of GroupDocs.Parser Cloud API. It also allows parsing of other supported document types and the extraction of text, images, and extract information from PDF using a template. Please install it using the following command in the console:

npm install groupdocs-parser-cloud

Please get your Client ID and Secret from the dashboard before following the mentioned steps. Once you have your ID and secret, add in the code as shown below:

// This code example demonstrates how to add your client ID and Secret.
global.clientId = "112f0f38-9dae-42d5-b4fc-cc84ae644972";
global.clientSecret = "16ad3fe0bdc39c910f57d2fd48a5d618";
global.myStorage = "";
const configuration = new groupdocs_parser_cloud.Configuration(clientId, clientSecret);
configuration.apiBaseUrl = "https://api.groupdocs.cloud";

Extract Data using JSON based Template File in Node.js

We can extract data from PDF documents using a template by following the simple steps given below:

Upload the Document

Firstly, we will upload the PDF document to the cloud for scraping pdf using the code sample given below:

// This code example demonstrates how to upload a PDF document to the cloud.
// Construct FileApi
let fileApi = groupdocs_parser_cloud.FileApi.fromConfig(configuration);
let file = 'C:\\Files\\companies.pdf';
// Read file
fs.readFile(file, (err, fileStream) => {
// Upload file request
let request = new groupdocs_parser_cloud.UploadFileRequest("companies.pdf", fileStream, myStorage);
// Upload file
fileApi.uploadFile(request);
});

As a result, the uploaded PDF file will be available in the files section of the dashboard on the cloud.

Extract Data from PDF using JSON based Template File

We can parse the PDF document and extract data using a JSON-based template file by following the steps given below:

  1. Create an instance of the ParseApi.
  2. Provide the uploaded PDF file path.
  3. Set the path to the template JSON file.
  4. Finally, parse the document and extract the data.

The following code sample shows how to extract data according to the template provided in the JSON file using a REST API.

// This code example demonstrates how to Parse a PDF document by a JSON based Template.
// Create an instance of the API
let parseApi = groupdocs_parser_cloud.ParseApi.fromConfig(configuration);
// Input file path
let fileInfo = new groupdocs_parser_cloud.FileInfo()
fileInfo.filePath = "companies.pdf"
// Create Parse options
let options = new groupdocs_parser_cloud.ParseOptions();
options.fileInfo = fileInfo;
options.templatePath = "template.json";
// Create Parse request
let request = new groupdocs_parser_cloud.ParseRequest(options);
// Parse the document
let response = await parseApi.parse(request);
// Display output
response.fieldsData.forEach(data => {
if (data.pageArea.pageTextArea != null) {
console.log("Field name: " + data.name + ". Text :" + data.pageArea.pageTextArea.text);
}
if (data.pageArea.pageTableArea != null) {
console.log("Table name: " + data.name);
data.pageArea.pageTableArea.pageTableAreaCells.forEach(cell => {
console.log("Table cell. Row " + cell.rowIndex + " column " + cell.columnIndex + ". Text: " + cell.pageArea.pageTextArea.text);
});
}
});

Please find below the template in JSON format.

{
"Fields": [
{
"FieldName": "Address",
"FieldPosition": {
"FieldPositionType": "Regex",
"Regex": "Companyaddress:"
}
},
{
"FieldName": "CompanyAddress",
"FieldPosition": {
"FieldPositionType": "Linked",
"LinkedFieldName": "ADDRESS",
"IsRightLinked": true,
"SearchArea": {
"Height": 10.0,
"Width": 100.0
},
"AutoScale": true
}
},
{
"FieldName": "Company",
"FieldPosition": {
"FieldPositionType": "Regex",
"Regex": "Companyname:"
}
},
{
"FieldName": "CompanyName",
"FieldPosition": {
"FieldPositionType": "Linked",
"LinkedFieldName": "Company",
"IsRightLinked": true,
"SearchArea": {
"Height": 10.0,
"Width": 100.0
},
"AutoScale": true
}
}
],
"Tables": [
{
"TableName": "Companies",
"DetectorParameters": {
"Rectangle": {
"Position": {
"X": 77.0,
"Y": 279.0
},
"Size": {
"Height": 41.0,
"Width": 480.0
}
}
}
}
]
}

Extract Information From PDF using Template Object in Node.js

We can extract data from a PDF file based on the template defined as an object by following the steps given below:

  1. Create an instance of the ParseApi.
  2. Provide the uploaded PDF file path.
  3. Initialize a Template as an object.
  4. Finally, parse the document and extract the data.

The following code sample shows how to extract data according to the defined template from a PDF document using a REST API. Please follow the steps mentioned earlier to upload the file.

// This code example demonstrates how to Parse a PDF document by Template object.
// Api initialization
let parseApi = groupdocs_parser_cloud.ParseApi.fromConfig(configuration);
// Input file
let fileInfo = new groupdocs_parser_cloud.FileInfo();
fileInfo.filePath = "companies.pdf";
// Define parse options
let options = new groupdocs_parser_cloud.ParseOptions();
options.fileInfo = fileInfo;
// Get Template Object
options.template = GetTemplate();
// Create parse request
let request = new groupdocs_parser_cloud.ParseRequest(options);
// Pasrse the document
let result = await parseApi.parse(request);
// Show Results
result.fieldsData.forEach(data => {
if (data.pageArea.pageTextArea != null) {
console.log("Field name: " + data.name + ". Text :" + data.pageArea.pageTextArea.text);
}
if (data.pageArea.pageTableArea != null) {
console.log("Table name: " + data.name);
data.pageArea.pageTableArea.pageTableAreaCells.forEach(cell => {
console.log("Table cell. Row " + cell.rowIndex + " column " + cell.columnIndex + ". Text: " + cell.pageArea.pageTextArea.text);
});
}
});

Please find below the template object created according to the PDF document for scraping data from pdf.

// This code example demonstrates a template object.
let field1 = new groupdocs_parser_cloud.Field();
field1.fieldName = "Address";
let fieldPosition1 = new groupdocs_parser_cloud.FieldPosition();
fieldPosition1.fieldPositionType = "Regex";
fieldPosition1.regex = "Company address:";
field1.fieldPosition = fieldPosition1;
let field2 = new groupdocs_parser_cloud.Field();
field2.fieldName = "CompanyAddress";
let fieldPosition2 = new groupdocs_parser_cloud.FieldPosition();
fieldPosition2.fieldPositionType = "Linked";
fieldPosition2.linkedFieldName = "ADDRESS";
fieldPosition2.isRightLinked = true;
let size2 = new groupdocs_parser_cloud.Size();
size2.width = 100;
size2.height = 10;
fieldPosition2.searchArea = size2;
fieldPosition2.autoScale = true;
field2.fieldPosition = fieldPosition2;
let field3 = new groupdocs_parser_cloud.Field();
field3.fieldName = "Company";
let fieldPosition3 = new groupdocs_parser_cloud.FieldPosition();
fieldPosition3.fieldPositionType = "Regex";
fieldPosition3.regex = "Company name:";
field3.fieldPosition = fieldPosition3;
let field4 = new groupdocs_parser_cloud.Field();
field4.fieldName = "CompanyName";
let fieldPosition4 = new groupdocs_parser_cloud.FieldPosition();
fieldPosition4.fieldPositionType = "Linked";
fieldPosition4.linkedFieldName = "Company";
fieldPosition4.isRightLinked = true;
let size4 = new groupdocs_parser_cloud.Size();
size4.width = 100;
size4.height = 10;
fieldPosition4.searchArea = size4;
fieldPosition4.autoScale = true;
field4.fieldPosition = fieldPosition4;
let table = new groupdocs_parser_cloud.Table();
table.tableName = "Companies";
let detectorparams = new groupdocs_parser_cloud.DetectorParameters();
let rect = new groupdocs_parser_cloud.Rectangle();
let size = new groupdocs_parser_cloud.Size();
size.height = 60;
size.width = 480;
let position = new groupdocs_parser_cloud.Point();
position.x = 77;
position.y = 279;
rect.size = size;
rect.position = position;
detectorparams.rectangle = rect;
table.detectorParameters = detectorparams;
let fields = [field1, field2, field3, field4];
let tables = [table];
let template = new groupdocs_parser_cloud.Template();
template.fields = fields;
template.tables = tables;
return template;
Extract Data using Template Object in Node.js

Extract Data using Template Object in Node.js

Parse Document Inside Container using Template in Node.js

We can also parse the PDF document available inside the container and extract data using the template object. Please follow the steps mentioned below to parse the document to extract data from scanned pdf inside a container.

  1. Create an instance of the ParseApi.
  2. Provide the uploaded archive file path.
  3. Initialize a Template as an object.
  4. Provide the container item.
  5. Finally, parse the document and extract the data.

The following code sample shows how to parse a PDF document inside a ZIP archive using a REST API. Please follow the steps mentioned earlier to upload the files and extract info from pdf.

// This code example demonstrates how to Parse a PDF document available inside container.
// Api initialization
let parseApi = groupdocs_parser_cloud.ParseApi.fromConfig(configuration);
// Input file path
let fileInfo = new groupdocs_parser_cloud.FileInfo();
fileInfo.filePath = "archive.zip";
// Create parse options
let options = new groupdocs_parser_cloud.ParseOptions();
options.fileInfo = fileInfo;
// Get template object
options.template = GetTemplate();
// Container item info
let containerItemInfo = new groupdocs_parser_cloud.ContainerItemInfo();
containerItemInfo.relativePath = "companies.pdf";
options.containerItemInfo = containerItemInfo;
// Create parse request
let request = new groupdocs_parser_cloud.ParseRequest(options);
// Create request
let response = await parseApi.parse(request);
// Display output
response.fieldsData.forEach(data => {
if (data.pageArea.pageTextArea != null) {
console.log("Field name: " + data.name + ". Text :" + data.pageArea.pageTextArea.text);
}
if (data.pageArea.pageTableArea != null) {
console.log("Table name: " + data.name);
data.pageArea.pageTableArea.pageTableAreaCells.forEach(cell => {
console.log("Table cell. Row " + cell.rowIndex + " column " + cell.columnIndex + ". Text: " + cell.pageArea.pageTextArea.text);
});
}
});

Try Online

Please try the following free online PDF Parsing tool for pdf data extraction online. This pdf content extractor is developed using the above API. https://products.groupdocs.app/parser/pdf

Conclusion

In this article, we have learned how to extract data from PDF documents according to the provided template on the cloud .We have also seen how to create a template object or use a template in a JSON format. Now you know how to extract information from pdf using pdf scraper API and free PDF data extractor. Besides, you can learn more about GroupDocs.Parser Cloud API using the documentation. We also provide an API Reference section that lets you visualize and interact with our APIs directly through the browser. In case of any ambiguity, please feel free to contact us on the forum.

See Also