06 February 2009

Perl: Convert PDF Files to Text

Using the CAM::PDF package, it is easy to extract the text from PDF files. The following script takes all PDF files from a directory and extracts the text of the entire file and writes it to a text file.

This script is also available at dba4Life.



use IO::Handle;

use strict;
use warnings;

use CAM::PDF;
use CAM::PDF::PageText;

my $PDFDIR = "./SomeSubDirectory";
my $pdf;
my %ddl;

opendir DDL, $PDFDIR || die "Error in opening PDF directory $PDFDIR\n";

while((my $filename = readdir(DDL)))
{
# Skip non-PDF files
next if ($filename !~ /\.pdf$/);

$filename = $PDFDIR . '/' . $filename;

if(!-f $filename) { print "\nCould not load $filename";}

# Name output file same as the PDF
my $output = $filename;
$output =~ s/\.pdf/\.txt/;

print "Creating $output...\n";
open(TXTFILE, '>' . $output);

# Load the PDF
$pdf = CAM::PDF->new($filename);

# Total number of pages within the PDF
my $pages = $pdf->numPages;

# Get the text for each page
for(my $x = 1; $x <= $pages; $x++) { print TXTFILE text_from_page($x); } close(TXTFILE); } closedir DDL; sub text_from_page { my $pg_num = shift; return CAM::PDF::PageText->render($pdf->getPageContentTree($pg_num));
}

2 comments:

Chankey Pathak said...

Got errors:

readdir() attempted on invalid dirhandle DDL at test.plx line **.
closedir() attempted on invalid dirhandle DDL at test.plx line **.

Chankey Pathak said...

Sorry I did some mistake, forgot to gave directory path. It is working now. Thank you :)