How To Extract Text From Pdf In Python

PDF files are widely used for document sharing and storage, but extracting text from them can sometimes be a challenging task. Fortunately, Python provides various libraries that make this process easier. In this guide, we’ll demonstrate how to extract text from PDF files using Python modules PyPDF2, textract, and nltk. Additionally, we’ll address common errors that may occur during execution.

1. Install Python Modules PyPDF2, textract, and nltk.

Open a terminal and run the below command to install the above Python library.

pip install PyPDF2
pip install textract
pip install nltk

When installing textract, you may encounter the below error message. That means the swig is not installed in your OS, you can refer to How To Install Swig On macOS, Linux, And Windows to learn more.

unable to execute 'swig': No such file or directory
This is because the textract installation need swig module installed.
so run below command first to install swig.
unable to execute 'swig': No such file or directory

2. Python PDF Text Extract Example.

Open VSCode and create a Python file with the name how-to-extract-text-from-pdf-in-python.py. Copy and paste the below python code in the above file. There are two functions in this file, the first function is used to extract PDF text, and the second function is used to split the text into keyword tokens and remove stop words and punctuations.

'''
This example tell you how to extract text content from a pdf file.
'''

import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# This function will extract and return the pdf file text content.
def extract_pdf_text(filePath=''):

    # Open the pdf file in read binary mode.
    fileObject = open(filePath, 'rb')

    # Create a pdf reader .
    pdfReader = PyPDF2.PdfReader(fileObject)

    # Get total pdf page number.
    totalPageNumber = len(pdfReader.pages)

    # Print pdf total page number.
    print('This pdf file contains totally ' + str(totalPageNumber) + ' pages.')

    currentPageNumber = 0
    text = ''

    # Loop in all the pdf pages.
    while(currentPageNumber < totalPageNumber ):

        # Get the specified pdf page object.
        pdfPage = pdfReader.pages[currentPageNumber]

        # Get pdf page text.
        text = text + pdfPage.extract_text()

        # Process next page.
        currentPageNumber += 1

    if(text == ''):
        # If can not extract text then use ocr lib to extract the scanned pdf file.
        text = textract.process(filePath, method='tesseract', encoding='utf-8')
       
    return text

# This function will remove all stop words and punctuations in the text and return a list of keywords.
def extract_keywords(text):
    # Split the text words into tokens
    wordTokens = word_tokenize(text)

    # Remove blow punctuation in the list.
    punctuations = ['(',')',';',':','[',']',',']

    # Get all stop words in english.
    stopWords = stopwords.words('english')

    # Below list comprehension will return only keywords that are not in stop words and punctuations.
    keywords = [word for word in wordTokens if not word in stopWords and not word in punctuations]
   
    return keywords

if __name__ == '__main__': 

    pdfFilePath = '/Users/songzhao/Documents/WorkSpace/Pro Office 365 Development, 2nd Edition.pdf'
   
    pdfText = extract_pdf_text(pdfFilePath)
    print('There are ' + str(pdfText.__len__()) + ' word in the pdf file.')
    #print(pdfText)

    keywords = extract_keywords(pdfText)
    print('There are ' + str(keywords.__len__()) + ' keyword in the pdf file.')
    #print(keywords)

Right-click the source code and click Run As —> Python Run menu item. Then you can get the below output in the eclipse console.

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
This pdf file contains totally 347 pages.
There are 481318 word in the pdf file.
There are 53212 keyword in the pdf file.

3. Handling Execution Errors.

When you run the example you may encounter some errors, below will list all the errors and how to fix them.

3.1 nltk punkt not found error.

This error occurs when import nltk.tokenize.word_tokenize. Below is the error message.

LookupError:
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:
  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  Searched in:
    - '/Users/zhaosong/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/lib/nltk_data'
    - ''
**********************************************************************

When seeing the above error message, run the below command in a terminal to download nltk punkt.

>>> import nltk
>>> nltk.download('punkt')
[nltk_data] Downloading package punkt to /Users/zhaosong/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
True

3.2 nltk stopwords not found error.

This error occurs when importing nltk.corpus.stopwords. Below is the error message.

  Please use the NLTK Downloader to obtain the resource:
  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  Searched in:
    - '/Users/zhaosong/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/lib/nltk_data'
**********************************************************************

Run the below commands to fix the error.

>>> import nltk
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/zhaosong/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
True

4. Conclusion.

Extracting text from PDF files using Python is a valuable skill for various applications, including data analysis, information retrieval, and natural language processing tasks.

By leveraging PyPDF2, textract, and nltk libraries, you can efficiently extract text content from PDF documents and further process it as needed. With the steps outlined in this guide, you can seamlessly integrate PDF text extraction into your Python projects.

2 thoughts on “How To Extract Text From Pdf In Python”

  1. i am getting the error after using the same code and procedure .(invalid literal for int() with base 10: b”)
    can you plz help me out as soon as possible.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.