Parse Word Documents using REST API in Python

In various cases, we may need to parse Word documents and extract images or text. Extraction of images and text from Word documents can be helpful to analyze the text, reuse or combine them into other documents. We can easily parse DOC or DOCX files and extract all the images/text programmatically on the cloud. In this article, we will learn how to parse Word documents using a REST API in Python.

The following topics shall be covered in this article:

Word Document Parser REST API and Python SDK

For parsing Word documents, we will be using the Python SDK of GroupDocs.Parser Cloud API. Please install it using the following command in the console:

pip install groupdocs_parser_cloud

Please get your Client ID and Secret from the dashboard before following the mentioned steps. Once you have your ID and secret, add in the code as shown below:

Parse Word Documents and Extract Images using REST API in Python

We can parse Word documents and extract images programmatically by following the steps given below:

Upload the Document

Firstly, we will upload the Word document (DOCX) to the Cloud using the code example given below:

As a result, the uploaded DOCX file will be available in the files section of the dashboard on the cloud.

Extract Images from Word Documents using Python

We can easily extract all the images from Word documents programmatically by following the steps given below.

  • Firstly, create an instance of the ParseApi.
  • Next, create an instance of the FileInfo.
  • Then, set path to the input DOCX file.
  • Next, create an instance of the ImageOptions.
  • Then, assign FileInfo to the ImageOptions.
  • After that, create ImagesRequest with ImageOptions as argument.
  • Finally, extract images by calling the ParseApi.images() method with ImageRequest.

The following code sample shows how to extract images from a DOCX file using a REST API in Python.

Parse Word Documents and Extract Images using REST API in Python
Parse Word Documents and Extract Images using REST API in Python.

Download Extracted Images

The above code sample will save the extracted images on the cloud. We can download these images using the code example given below:

Extract Text from Word Documents using REST API in Python

We can easily extract all the text from Word documents programmatically by following the steps given below.

  • Firstly, create an instance of the ParseApi.
  • Next, create an instance of the FileInfo.
  • Then, set path to the input DOCX file.
  • Next, create an instance of the TextOptions.
  • Then, assign FileInfo to the TextOptions.
  • After that, create TextRequest with TextOptions as argument.
  • Finally, get results by calling the ParseApi.text() method with TextRequest.

The following code example shows how to extract text from a DOCX file using a REST API.

Extract Text from Word Documents using REST API in Python
Extract Text from Word Documents using REST API in Python.

Try Online

Please try the following free online DOCX Parsing tool, which is developed using the above API. https://products.groupdocs.app/parser/docx

Conclusion

In this article, we have learned how to parse Word documents on the cloud. We have also seen how to extract images and text from DOCX files using a REST API in Python. This article also explained how to programmatically upload a DOCX file to the cloud and download the extracted images from the Cloud. Besides, you can learn more about GroupDocs.Parser Cloud API using the documentation. We also provide an API Reference section that lets you visualize and interact with our APIs directly through the browser. In case of any ambiguity, please feel free to contact us on the forum.

See Also