dba4life: Perl: Convert PDF Files to Text

Using the CAM::PDF package, it is easy to extract the text from PDF files. The following script takes all PDF files from a directory and extracts the text of the entire file and writes it to a text file.

This script is also available at dba4Life.


use IO::Handle;

use strict;
use warnings;

use CAM::PDF;
use CAM::PDF::PageText;

my $PDFDIR = "./SomeSubDirectory";
my $pdf;
my %ddl;

opendir DDL, $PDFDIR || die "Error in opening PDF directory $PDFDIR\n";

while((my $filename = readdir(DDL)))
{
# Skip non-PDF files
next if ($filename !~ /\.pdf$/);

$filename = $PDFDIR . '/' . $filename;

if(!-f $filename) { print "\nCould not load $filename";}

# Name output file same as the PDF
my $output = $filename;
$output =~ s/\.pdf/\.txt/;

print "Creating $output...\n";
open(TXTFILE, '>' . $output);

# Load the PDF
$pdf = CAM::PDF->new($filename);

# Total number of pages within the PDF
my $pages = $pdf->numPages;

# Get the text for each page
for(my $x = 1; $x <= $pages; $x++)    {    print TXTFILE text_from_page($x);     }     close(TXTFILE); }  closedir DDL;     sub text_from_page  {   my $pg_num = shift;   return CAM::PDF::PageText->render($pdf->getPageContentTree($pg_num));
}

2 comments:

Unknown said...: Got errors:

readdir() attempted on invalid dirhandle DDL at test.plx line **.
closedir() attempted on invalid dirhandle DDL at test.plx line **.; December 21, 2011 at 2:01 AM
Unknown said...: Sorry I did some mistake, forgot to gave directory path. It is working now. Thank you :); December 21, 2011 at 2:04 AM

dba4life

RSS

Search

Donation

Links

News

06 February 2009

Perl: Convert PDF Files to Text

2 comments:

FeedBurner FeedCount

Google