Home » PDF Updates » Extract Metadata From PDF File in 2 Ways

Extract Metadata From PDF File in 2 Ways

Eva Mendis | Modified: 2022-06-13T05:39:31+00:00|PDF Updates | 3 Minutes Reading

Do you want to know how to extract metadata from PDF files? In this blog, we are going to educate you how to do it by using Python and Automated Extractor software by SysTools. You can use this tool for PDF Forensics.

But first, let’s understand what is the meaning of Metadata in PDF.

PDF metadata is structured information that can be used to identify what the document is about. It consists of components like- Title, Author, Subject, keywords Document language.

You can access Metadata in Adobe Acrobat Pro- just go to File > Properties > Description

Extract Metadata From PDF File Using SysTools PDF Extractor

SysTools has designed a prominent software to extract components from PDF documents. This includes metadata as well. It has a quite simple interface, just follow these below steps to automatically extract metadata from PDF file:

Download Now Purchase Now

Step-1: Insert PDF files using Add File(s)/ Add Folder. Click on Next.

insert PDFs

Step-2: Here, you have to choose what you want to extract from PDF, click on Metadata option.

Step-3: This software provides you the option to save the extracted metadata content in PDF or DOC or DOCX file format.

tool to extract metadata from PDF files

Step-4: Along with this, you can apply page settings from which you want information like you can page number or page range or you can extract metadata from all odd pages or all even pages.

Step-5: Hit Extract button

tool to extract metadata from PDF files

 

You can take out other components from the PDF with the help of this software. Components like attachments and portfolios, text, images, any type of rich media, bookmarks, hyperlinks, highlighted text, comments.

How to Extract metadata From PDF Files Using Python

Well, if you are techie (even if you are not) using Python language you can take out metadata of any PDF. All you have to do is to install the pyPDF2 package.

Lets see how you can achieve this-

The PDF sample used here is called “reportlab-sample.pdf”.

Here’s the code:

# get_doc_info.py

from PyPDF2 import PdfFileReader

def get_info(path):
with open(path, ‘rb’) as f:
pdf = PdfFileReader(f)
info = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()

print(info)

author = info.author
creator = info.creator
producer = info.producer
subject = info.subject
title = info.title

if __name__ == ‘__main__’:
path = ‘reportlab-sample.pdf’
get_info(path)

As you can see, we have imported the PdfFileReader class from PyPDF2. This class has the ability to read a PDF and extract data from it. Next, we create our own get_info function that will take a PDF file path as its argument. Then we will open the file in read-only binary mode. Finally we pass that file handler into “PdfFileReader” and create an instance out of it.

We can use the getDocumentInfo method to extract useful information from PDF files. It will return an instance of PyPDF2.pdf.DocumentInformation, which will have the useful attributes like author, creator, producer, subject and title.

Conclusion

In this blog, you learnt 2 methods to extract metadata from PDF files. First we used automated software by SysTools. It is an all-in-one software with many components from PDF files like hyperlinks, text, richmedia, inline images, highlighted text, bookmarks, comments, attachments and portfolios. It’s very easy to run even for the non-technical users. You can download the free version of the tool and check it yourself.

Secondly, we used python language to take out PDF metadata for which we used the PyPDF2 package.

Whichever method you are comfortable with, go ahead that way!

offer-banner