{"id":20110,"date":"2025-02-11T16:08:41","date_gmt":"2025-02-11T10:38:41","guid":{"rendered":"https:\/\/opstree.com\/blog\/?p=20110"},"modified":"2025-02-24T13:28:55","modified_gmt":"2025-02-24T07:58:55","slug":"extract-text-from-pdf-using-pymupdf-fitz","status":"publish","type":"post","link":"https:\/\/opstree.com\/blog\/extract-text-from-pdf-using-pymupdf-fitz\/","title":{"rendered":"Extract Text from PDF using PyMuPDF (fitz)"},"content":{"rendered":"<h1><b><span data-contrast=\"auto\">Objective<\/span><\/b><span data-contrast=\"auto\">:<\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:240}\">\u00a0<\/span><\/h1>\n<p><span data-contrast=\"auto\">This Python script demonstrates how to extract text from a PDF document using the <\/span><b><span data-contrast=\"auto\">PyMuPDF<\/span><\/b><span data-contrast=\"auto\"> (also known as <\/span><b><span data-contrast=\"auto\">fitz<\/span><\/b><span data-contrast=\"auto\">) library. PyMuPDF is a lightweight and efficient library for working with PDF documents, XPS files, and eBooks. It provides functions to extract text, images, and metadata, enabling developers to manipulate and analyze PDF documents with ease.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<h3><b><span data-contrast=\"none\">Requirements<\/span><\/b><span data-contrast=\"none\">:<\/span><\/h3>\n<p><span class=\"TextRun SCXW201125029 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW201125029 BCX0\">To use this script, you need to have <\/span><\/span><span class=\"TextRun SCXW201125029 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SpellingErrorV2Themed SCXW201125029 BCX0\">PyMuPDF<\/span><\/span><span class=\"TextRun SCXW201125029 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW201125029 BCX0\"> installed in your Python environment. <\/span><\/span><span class=\"TextRun SCXW201125029 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW201125029 BCX0\">You can install it using the following command:<\/span><\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-20116 size-full\" src=\"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-4.png\" alt=\"\" width=\"567\" height=\"114\" srcset=\"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-4.png 567w, https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-4-300x60.png 300w\" sizes=\"auto, (max-width: 567px) 100vw, 567px\" \/><\/p>\n<p><!--more--><\/p>\n<h3><span class=\"TextRun SCXW142760821 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW142760821 BCX0\">Python Script for Extracting Text from a PDF Document:<\/span><\/span><\/h3>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-20112 size-full\" src=\"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image.png\" alt=\"\" width=\"546\" height=\"89\" srcset=\"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image.png 546w, https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-300x49.png 300w\" sizes=\"auto, (max-width: 546px) 100vw, 546px\" \/><\/p>\n<ul>\n<li data-leveltext=\"\uf076\" data-font=\"Wingdings\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf076&quot;,&quot;469777815&quot;:&quot;multilevel&quot;,&quot;469778510&quot;:&quot;bullet&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">import fitz makes the <\/span><b><span data-contrast=\"auto\">PyMuPDF<\/span><\/b><span data-contrast=\"auto\"> library available in your Python code, allowing you to manipulate PDFs, XPS, and eBooks.<\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335559738&quot;:240,&quot;335559739&quot;:240}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"\uf076\" data-font=\"Wingdings\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf076&quot;,&quot;469777815&quot;:&quot;multilevel&quot;,&quot;469778510&quot;:&quot;bullet&quot;}\" data-aria-posinset=\"2\" data-aria-level=\"1\"><span data-contrast=\"auto\">The comment # PyMuPDF simply reminds the reader that fitz refers to the <\/span><b><span data-contrast=\"auto\">PyMuPDF<\/span><\/b><span data-contrast=\"auto\"> library.<\/span><\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-20113 size-full\" src=\"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-1.png\" alt=\"\" width=\"749\" height=\"426\" srcset=\"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-1.png 749w, https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-1-300x171.png 300w\" sizes=\"auto, (max-width: 749px) 100vw, 749px\" \/><\/p>\n<ul>\n<li data-leveltext=\"\uf076\" data-font=\"Wingdings\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf076&quot;,&quot;469777815&quot;:&quot;multilevel&quot;,&quot;469778510&quot;:&quot;bullet&quot;}\" data-aria-posinset=\"2\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Extract Text from the Page<\/span><\/b><span data-contrast=\"auto\">: Page text contains the text extracted from the current PDF page.<\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335559738&quot;:240,&quot;335559739&quot;:240}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"\uf076\" data-font=\"Wingdings\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf076&quot;,&quot;469777815&quot;:&quot;multilevel&quot;,&quot;469778510&quot;:&quot;bullet&quot;}\" data-aria-posinset=\"2\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Add New Line<\/span><\/b><span data-contrast=\"auto\">: &#8220;\\n&#8221; adds a line break after the text from the page, ensuring that the content from different pages is separated.<\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335559738&quot;:240,&quot;335559739&quot;:240}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"\uf076\" data-font=\"Wingdings\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf076&quot;,&quot;469777815&quot;:&quot;multilevel&quot;,&quot;469778510&quot;:&quot;bullet&quot;}\" data-aria-posinset=\"2\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Append to extracted text<\/span><\/b><span data-contrast=\"auto\">: += appends the extracted text and newline to the existing extracted text string, accumulating text from each page in the PDF.<\/span><\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-20117 size-full\" src=\"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-5.png\" alt=\"\" width=\"713\" height=\"209\" srcset=\"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-5.png 713w, https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-5-300x88.png 300w\" sizes=\"auto, (max-width: 713px) 100vw, 713px\" \/><\/p>\n<ul>\n<li data-leveltext=\"\uf076\" data-font=\"Wingdings\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf076&quot;,&quot;469777815&quot;:&quot;multilevel&quot;,&quot;469778510&quot;:&quot;bullet&quot;}\" data-aria-posinset=\"2\" data-aria-level=\"1\"><span data-contrast=\"auto\">This code defines the path to a PDF file and uses the extract_text_from_pdf function to extract its text. It then prints the extracted text to the console.<\/span><\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-20115\" src=\"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-3.png\" alt=\"\" width=\"600\" height=\"391\" srcset=\"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-3.png 1006w, https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-3-300x195.png 300w, https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/image-3-768x500.png 768w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<ul>\n<li data-leveltext=\"\uf076\" data-font=\"Wingdings\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf076&quot;,&quot;469777815&quot;:&quot;multilevel&quot;,&quot;469778510&quot;:&quot;bullet&quot;}\" data-aria-posinset=\"2\" data-aria-level=\"1\"><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335559738&quot;:240,&quot;335559739&quot;:240}\"><span class=\"TextRun SCXW200242495 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW200242495 BCX0\">The output will be the extracted text from the PDF.<\/span><\/span><\/span><\/li>\n<\/ul>\n<p><strong>[ Find More about: <a href=\"https:\/\/opstree.com\/blog\/2024\/10\/08\/etl-vs-elt-which-data-integration-approach-is-right-for-you\/\">ETL Data Integration<\/a>]<\/strong><\/p>\n<h3><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335559738&quot;:240,&quot;335559739&quot;:240}\"><span class=\"TextRun SCXW200242495 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW200242495 BCX0\"><span class=\"TextRun SCXW38483271 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW38483271 BCX0\">What this Script Achieves:<\/span><\/span><\/span><\/span><\/span><\/h3>\n<ul>\n<li data-leveltext=\"\uf076\" data-font=\"Wingdings\" data-listid=\"5\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf076&quot;,&quot;469777815&quot;:&quot;multilevel&quot;,&quot;469778510&quot;:&quot;bullet&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Automated PDF Text Extraction:<\/span><\/b><span data-contrast=\"auto\"> PyMuPDF allows efficient, automated text extraction from PDFs, saving time on manual copy-pasting.<\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:240}\">\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\uf076\" data-font=\"Wingdings\" data-listid=\"5\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf076&quot;,&quot;469777815&quot;:&quot;multilevel&quot;,&quot;469778510&quot;:&quot;bullet&quot;}\" data-aria-posinset=\"2\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Versatile and Customizable:<\/span><\/b><span data-contrast=\"auto\"> The script is adaptable for various use cases, from <a href=\"https:\/\/opstree.com\/services\/middleware-database-and-data-engineering\/\"><em><strong>data analysis<\/strong><\/em><\/a> to document processing, and can be easily integrated and customized.<\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:240,&quot;335559739&quot;:240}\">\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\uf076\" data-font=\"Wingdings\" data-listid=\"5\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf076&quot;,&quot;469777815&quot;:&quot;multilevel&quot;,&quot;469778510&quot;:&quot;bullet&quot;}\" data-aria-posinset=\"3\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Scalable for Large Documents<\/span><\/b><span data-contrast=\"auto\">: The script efficiently handles large PDF files with multiple pages, making it ideal for processing extensive documents without compromising performance.<\/span><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This Python script demonstrates how to extract text from a PDF document using the PyMuPDF (also known as fitz) library. PyMuPDF is a lightweight and efficient library for working with PDF documents, XPS files, and eBooks.<\/p>\n","protected":false},"author":244582693,"featured_media":20118,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_coblocks_attr":"","_coblocks_dimensions":"","_coblocks_responsive_height":"","_coblocks_accordion_ie_support":"","jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[28070474],"tags":[768739465,768739464,768739463],"class_list":["post-20110","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops","tag-automated-pdf-text-extraction","tag-fitz","tag-pymupdf"],"blocksy_meta":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/02\/OCR-extraction-words-from-pdf-or-image.jpg","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pfDBOm-5em","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts\/20110","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/users\/244582693"}],"replies":[{"embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/comments?post=20110"}],"version-history":[{"count":9,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts\/20110\/revisions"}],"predecessor-version":[{"id":20564,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts\/20110\/revisions\/20564"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/media\/20118"}],"wp:attachment":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/media?parent=20110"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/categories?post=20110"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/tags?post=20110"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}