Extract original images from PDF files

1024mb · Post by **1024mb** » 03 Feb 2023 01:14

I never post scripts but this time is the exception because it was so damn painful to get this working, I've looked everywhere for an application that can extract the images from PDF files without re-converting them and found none except for mutool, the problem is its extract command doesn't offer an output path option and it doesn't name the files by their pages so I had to do this manually, I started doing this with and overcomplicated XYplorer script and batch file but found other way with the help of sebras from mutool, thanks to him I realized it's possible to execute JS scripts with mutool.

This tiny script needs 2 applications (the last one is optional) that you have to download from their official sites:

And they must be included in your PATH, you can do this by simply copying the executables to "C:\Windows", exiftool executable must be renamed to "exiftool.exe".

If sounder is not included it will silently fail to play a sound if there is an error (see point 5 below).
Sound downloaded from freesound.org, don't remember which one was.

Also the script makes use of 2 extra files, the JavaScript script and one batch that executes mutool and exiftool. All the files are easy to read, or at least I tried.

Extract all the images from all the currently listed PDFs to a subfolder named after the PDF file, in the current path.
Number padding is automatically detected based on the amount of pages and images per page.
Copy the dates from the PDF file to the images, and then the dates from the metadata.
It extracts the original JPG files and all the other formats as PNG files so there is no quality loss.
Warns if there is any page in any PDF that has no image.

Either there must be no directory with the same base name as the PDF file or it must be empty, otherwise exiftool will change the dates of all the files inside.

If you want to use the JS script in other ways, it accepts 3 arguments.

PDF file
Output path
Anything, this is just to make the script output and error if there is at least one page that has no image.

Updated:

Script didn't extracted images other than DCT encoded ones (JPG) successfully, now it extracts the original JPG files and extracts all the other formats to PNG without any quality loss.

As far as I know there is no way to extract TIF images as TIF files, WEBP images as WEBP files, and so on, once they are included in a PDF file they become raw data which mutool has to interpret, and it does in this case by outputting it to PNG. There are some formats that can be extracted as is such as JBIG2 but they are useless as there is no viewer.

Extract_PDF_Images_v0.1.7z: (44.18 KiB) Downloaded 47 times

Extract_PDF_Images.7z: OLD - Don't use.; (43.89 KiB) Downloaded 45 times

Post by **highend** » 03 Feb 2023 09:11

If you don't mind...

First part of the script could look like this...
No need for a "complicated" way to get the pdf files, the ability to define the destination for the additional required components (I would even include a way to modifiy the .cmd to be able to use portable versions of mutool / exiftool)...

Code: Select all

    $mutool_js   = "<xyscripts>\_mutool-extract_named.js";
    $extract_cmd = "<xyscripts>\_Extract_images_PDF_JS.cmd";
    end (exists($mutool_js)   != 1), quote($mutool_js)   . " not found, aborted!";
    end (exists($extract_cmd) != 1), quote($extract_cmd) . " not found, aborted!";

    $date      = <date yy-mm-dd_hh_nn_ss>;
    $curpath   = replace(<curpath>, "%", "%%");
    $pdf_files = listpane(, "*.pdf", , <crlf>);
    end (!$pdf_files), "No .pdf file(s) found, aborted!";

    $path_script     = replace($mutool_js,   "%", "%%");
    $path_worker     = replace($extract_cmd, "%", "%%");
    $batch_path      = replace("<xydata>\Temp\extract_pdf_$date.cmd", "%", "%%");
    $batch_path_orig = "<xydata>\Temp\extract_pdf_$date.cmd";
    $batch           = "";
    foreach($path, $pdf_files, <crlf>, "e") {
        $base_name  = gpc($path, "base");
        $base_name  = replace($base_name, "%", "%%");
        $path       = replace($path,      "%", "%%");
        $batch     .= <<<EOD
            CALL "$path_worker" "$path_script" "$path" "$curpath\$base_name"
            IF %ERRORLEVEL% EQU 2 (
                SET "_error=1"
                ECHO "$base_name.pdf" >> "$batch_path.log"
            )
            ECHO:
        EOD;
    }
    $batch = regexreplace($batch, "^[ ]{16}");

Btw, you could do the same via poppler => pdfimages.exe (https://github.com/oschwartz10612/poppler-windows)
All through XY scripting (no javascript required)

1024mb · Post by **1024mb** » 03 Feb 2023 23:51

highend wrote: ↑03 Feb 2023 09:11 If you don't mind...

First part of the script could look like this...
No need for a "complicated" way to get the pdf files, the ability to define the destination for the additional required components (I would even include a way to modifiy the .cmd to be able to use portable versions of mutool / exiftool)...

Thanks! That looks much better, you should have seen how it was even before, it was a gigantic mess

.

highend wrote: ↑03 Feb 2023 09:11 Btw, you could do the same via poppler => pdfimages.exe (https://github.com/oschwartz10612/poppler-windows)
All through XY scripting (no javascript required)

Ah, yes, I also found that after posting this, it only needs a better way to name the images to have it all automatic.

Post by **highend** » 06 Feb 2023 09:36

I've written a different implementation with pdfimages.exe here: viewtopic.php?t=25780

it only needs a better way to name the images to have it all automatic.

The script takes care of this.

pdfimages.exe names them: page-<page number>-<incrementing ID>.<ext>

The cleanup part of the script renames them with:

Code: Select all

    $frontName  = "pg. ";
    $middleName = " - #";

to:
pg. <formatted page number> - #<incrementing ID per page>.<ext>

E.g.:

Code: Select all

pg. 01 - #001.jpg
pg. 02 - #001.png
pg. 04 - #001.jpg
pg. 04 - #002.jpg
pg. 05 - #001.png
pg. 05 - #002.jpg
pg. 06 - #001.jpg
pg. 06 - #002.jpg
pg. 07 - #001.jpg
pg. 07 - #002.jpg
pg. 08 - #001.png
pg. 08 - #002.png
pg. 08 - #003.png
pg. 08 - #004.png
pg. 08 - #005.jpg
pg. 08 - #006.jpg
pg. 08 - #007.jpg

klownboy · Post by **klownboy** » 06 Feb 2023 15:23

Thanks highend and 1024mb for these scripts. I figured there must be a way to do it, but never looked into it.

FunkyFinn · Post by **FunkyFinn** » 08 Jun 2023 18:49

Does this tool ie. "Extract_PDF_Images_v0.1" rip the PDF into raw format or does it just read the meta data in the original? How does one use it on a single image within a PDF file?
Kind of a newbie at this.

XYplorer Beta Club

Extract original images from PDF files

Extract original images from PDF files

Re: Extract original images from PDF files

Re: Extract original images from PDF files

Re: Extract original images from PDF files

Re: Extract original images from PDF files

Re: Extract original images from PDF files