Extract original images from PDF files

Discuss and share scripts and script files...
Post Reply
1024mb
Posts: 205
Joined: 14 Dec 2018 23:26

Extract original images from PDF files

Post by 1024mb »

I never post scripts but this time is the exception because it was so damn painful to get this working, I've looked everywhere for an application that can extract the images from PDF files without re-converting them and found none except for mutool, the problem is its extract command doesn't offer an output path option and it doesn't name the files by their pages so I had to do this manually, I started doing this with and overcomplicated XYplorer script and batch file but found other way with the help of sebras from mutool, thanks to him I realized it's possible to execute JS scripts with mutool.

This tiny script needs 2 applications (the last one is optional) that you have to download from their official sites:
And they must be included in your PATH, you can do this by simply copying the executables to "C:\Windows", exiftool executable must be renamed to "exiftool.exe".

If sounder is not included it will silently fail to play a sound if there is an error (see point 5 below).
Sound downloaded from freesound.org, don't remember which one was.

Also the script makes use of 2 extra files, the JavaScript script and one batch that executes mutool and exiftool. All the files are easy to read, or at least I tried.
  • Extract all the images from all the currently listed PDFs to a subfolder named after the PDF file, in the current path.
  • Number padding is automatically detected based on the amount of pages and images per page.
  • Copy the dates from the PDF file to the images, and then the dates from the metadata.
  • It extracts the original JPG files and all the other formats as PNG files so there is no quality loss.
  • Warns if there is any page in any PDF that has no image.
Either there must be no directory with the same base name as the PDF file or it must be empty, otherwise exiftool will change the dates of all the files inside.

If you want to use the JS script in other ways, it accepts 3 arguments.
  1. PDF file
  2. Output path
  3. Anything, this is just to make the script output and error if there is at least one page that has no image.
Updated:

Script didn't extracted images other than DCT encoded ones (JPG) successfully, now it extracts the original JPG files and extracts all the other formats to PNG without any quality loss.

As far as I know there is no way to extract TIF images as TIF files, WEBP images as WEBP files, and so on, once they are included in a PDF file they become raw data which mutool has to interpret, and it does in this case by outputting it to PNG. There are some formats that can be extracted as is such as JBIG2 but they are useless as there is no viewer.
Extract_PDF_Images_v0.1.7z
(44.18 KiB) Downloaded 47 times

Extract_PDF_Images.7z
OLD - Don't use.
(43.89 KiB) Downloaded 45 times
Last edited by 1024mb on 07 Mar 2023 00:05, edited 1 time in total.

highend
Posts: 13033
Joined: 06 Feb 2011 00:33

Re: Extract original images from PDF files

Post by highend »

If you don't mind...

First part of the script could look like this...
No need for a "complicated" way to get the pdf files, the ability to define the destination for the additional required components (I would even include a way to modifiy the .cmd to be able to use portable versions of mutool / exiftool)...

Code: Select all

    $mutool_js   = "<xyscripts>\_mutool-extract_named.js";
    $extract_cmd = "<xyscripts>\_Extract_images_PDF_JS.cmd";
    end (exists($mutool_js)   != 1), quote($mutool_js)   . " not found, aborted!";
    end (exists($extract_cmd) != 1), quote($extract_cmd) . " not found, aborted!";

    $date      = <date yy-mm-dd_hh_nn_ss>;
    $curpath   = replace(<curpath>, "%", "%%");
    $pdf_files = listpane(, "*.pdf", , <crlf>);
    end (!$pdf_files), "No .pdf file(s) found, aborted!";

    $path_script     = replace($mutool_js,   "%", "%%");
    $path_worker     = replace($extract_cmd, "%", "%%");
    $batch_path      = replace("<xydata>\Temp\extract_pdf_$date.cmd", "%", "%%");
    $batch_path_orig = "<xydata>\Temp\extract_pdf_$date.cmd";
    $batch           = "";
    foreach($path, $pdf_files, <crlf>, "e") {
        $base_name  = gpc($path, "base");
        $base_name  = replace($base_name, "%", "%%");
        $path       = replace($path,      "%", "%%");
        $batch     .= <<<EOD
            CALL "$path_worker" "$path_script" "$path" "$curpath\$base_name"
            IF %ERRORLEVEL% EQU 2 (
                SET "_error=1"
                ECHO "$base_name.pdf" >> "$batch_path.log"
            )
            ECHO:
        EOD;
    }
    $batch = regexreplace($batch, "^[ ]{16}");
Btw, you could do the same via poppler => pdfimages.exe (https://github.com/oschwartz10612/poppler-windows)
All through XY scripting (no javascript required)
One of my scripts helped you out? Please donate via Paypal or paypal_donate (at) stdmail (dot) de

1024mb
Posts: 205
Joined: 14 Dec 2018 23:26

Re: Extract original images from PDF files

Post by 1024mb »

highend wrote: 03 Feb 2023 09:11 If you don't mind...

First part of the script could look like this...
No need for a "complicated" way to get the pdf files, the ability to define the destination for the additional required components (I would even include a way to modifiy the .cmd to be able to use portable versions of mutool / exiftool)...
Thanks! That looks much better, you should have seen how it was even before, it was a gigantic mess :lol: .
highend wrote: 03 Feb 2023 09:11 Btw, you could do the same via poppler => pdfimages.exe (https://github.com/oschwartz10612/poppler-windows)
All through XY scripting (no javascript required)
Ah, yes, I also found that after posting this, it only needs a better way to name the images to have it all automatic.

highend
Posts: 13033
Joined: 06 Feb 2011 00:33

Re: Extract original images from PDF files

Post by highend »

I've written a different implementation with pdfimages.exe here: viewtopic.php?t=25780
it only needs a better way to name the images to have it all automatic.
The script takes care of this.

pdfimages.exe names them: page-<page number>-<incrementing ID>.<ext>

The cleanup part of the script renames them with:

Code: Select all

    $frontName  = "pg. ";
    $middleName = " - #";
to:
pg. <formatted page number> - #<incrementing ID per page>.<ext>

E.g.:

Code: Select all

pg. 01 - #001.jpg
pg. 02 - #001.png
pg. 04 - #001.jpg
pg. 04 - #002.jpg
pg. 05 - #001.png
pg. 05 - #002.jpg
pg. 06 - #001.jpg
pg. 06 - #002.jpg
pg. 07 - #001.jpg
pg. 07 - #002.jpg
pg. 08 - #001.png
pg. 08 - #002.png
pg. 08 - #003.png
pg. 08 - #004.png
pg. 08 - #005.jpg
pg. 08 - #006.jpg
pg. 08 - #007.jpg
One of my scripts helped you out? Please donate via Paypal or paypal_donate (at) stdmail (dot) de

klownboy
Posts: 4052
Joined: 28 Feb 2012 19:27

Re: Extract original images from PDF files

Post by klownboy »

Thanks highend and 1024mb for these scripts. I figured there must be a way to do it, but never looked into it. :tup:
Windows 11, 22H2 Build 22621.1555 at 100% 2560x1440

FunkyFinn
Posts: 1
Joined: 08 Jun 2023 18:30

Re: Extract original images from PDF files

Post by FunkyFinn »

Does this tool ie. "Extract_PDF_Images_v0.1" rip the PDF into raw format or does it just read the meta data in the original? How does one use it on a single image within a PDF file?
Kind of a newbie at this.

Post Reply