rename the file according PDF title - perl

I am trying to write file rename Perl script, for reducing manual efforts. Manually I open the pdf file, copy the title and rename the file name according to the title.
I am writing below code to rename the pdf according to the file title. e.g. SPE-180024-MS is title and pdf should be renamed to that
According to my logic it should rename the file, but the output is not proper
#!/usr/bin/perl
use strict;
#use warnings;
use Cwd;
use File::Basename;
#use File::Copy;
use File::Find;
use PDF::API2;
use CAM::PDF;
my $path1 = getcwd;
open( F6, ">Ref.txt" );
opendir( DIR, $path1 ) or die $!;
my #dots = grep /(.*?)\-(MS)$/, readdir(DIR);
closedir(DIR);
my #file;
my #files;
my $check;
my $err_1;
my $err_2;
my $err_3;
foreach my $file (#dots) {
#print F6 $file."\n";
opendir DIR1, $file or die "Can't open $file: $!";
my #files = sort grep { -f "$file/$_" } readdir DIR1;
my $data1 = join( ",", <#files> );
closedir DIR1;
#print F6 #files."\n";
my $a = #files;
if ($data1 =~ m#(((\w+)\-(\d+)\-MS)\.(pdf))#
#&& $data1=~m#((\w+)\-(\d+)\-MS\.(xml))#) #((.*?)\.xml)#
) {
my $check = $2;
#print F6 $1."\n";
if ( $data1 =~ m#(((\w+)\-(\d+)\-MS)\.(xml))# ) {
my $check1 = $2;
my $first = $1;
if ( $check eq $file || $check1 eq $file ) {
}
else {
#print F6 $file."\tDIFFERENT FILE PRESENT\n";
}
}
}
foreach my $f1 ( glob("$file/*.xml") ) {
#print F6 $f1."\n";
open( FH, '<', $f1 ) or die "Cannot open file: $f1";
my $data2 = join( "", <FH> );
#print F6 $data2."\n";
close FH;
if ( $data2 =~ m#(<page-count count="(\d+)"/>)# ) {
my $page = $2;
#print F6 $f1."\t".$1."\n";
if ( $f1 =~ m#(.*?)-MS/((.*?)-MS)#s
#SPE-173391-MS/SPE-173393-MS #(.*?)\.(.*?)$/s)
) {
my $f11 = $2;
#print F6 $f11."\n";
if ( $file eq $f11 ) {
}
else {
$err_1
= $err_1
. $file . "\t"
. $f11
. "\tDIFFERENT XML FILE PRESENT\n";
#print F6 $file."\t".$f11."\tDIFFERENT XML FILE PRESENT\n";
#print F6 $file."\tDIFFERENT XML FILE PRESENT\n";
}
foreach my $f2 ( glob("$file/*.pdf") ) {
open( F2, "<$f2" ) or die "Cannot open file: $f2";
my $data = join( "", <F2> );
close F2;
my $xml_list = $data;
my $pdf = PDF::API2->open($f2);
my $pages = $pdf->pages;
#print F6 $f2."\t".$pages."\n";
if ($f2 =~ m#(.*?)-MS/((.*?)-MS)#
#/(.*?)\.(.*?)$/s
) {
my $f21 = $2;
if ( $file eq $f21 ) {
}
else {
$err_2
= $err_2
. $file . "\t"
. $f21
. "\tDIFFERENT PDF FILE PRESENT\n";
#print F6 $file."\t".$f21."\tDIFFERENT PDF FILE PRESENT\n";
}
while ( $f11 =~ m/$f21/gs ) {
if ( $page !~ m#$pages#s ) {
$err_3
= $err_3
. $f1 . "\t"
. $page . "\t"
. $f2 . "\t"
. $pages . "\n";
#print F6 $f1."\t".$page."\t".$f2."\t".$pages."\n";
$data2 =~ s#<page-count count="$page"\/>#<page-count count="$pages"\/>#gs;
open( FH, '>', $f1 ) or die "Cannot open file: $f1";
print FH $data2 . "\n";
close FH;
}
}
}
}
}
}
}
}
close F6;
This is the document. The marked heading is what I want.

You cannot just open a PDF file and operate on it. It's different from a text file so it has to be parsed.
You can use CAM::PDF. It will convert your pdf to text which can be later analysed to get the title.
The links provided above covers enough stuff to get your job done. I am reproducing some relevant stuff here
use CAM::PDF;
my $pdf = CAM::PDF->new('test1.pdf');
$pageNum = 1
my $page1 = $pdf->getPageContent(pageNum);
The variable page1 will have the contents of page specified by pageNum variable. Rest is a matter of extracting the required information.
If you find converting the entire pdf to text then you can use getpdftext.pl which is a part of CAM::PDF however that's inefficient compared to reading a single page.

PDFs usually have a bunch of metadata, among them is the document title. If you're lucky, you will find the desired PDF title in there. A Perl example using PDF::API2 and its info method:
use autodie;
use Modern::Perl;
use PDF::API2;
my $file = '/your/sample/file.pdf';
my $pdf = PDF::API2->open( $file );
my %pdf_info = $pdf->info;
my $title = $pdf_info{Title};
my $renamed_dir = '/some/where/else/';
if ( $title ) {
my $new_name = $renamed_dir . $title;
if ( -f $new_name ) {
warn "File $new_name already exists, move it out of the way!";
} else {
$pdf->saveas( $new_name );
}
} else {
warn "No title found in document info.";
}
If you need to use some part of the text, then you should convert it to text first. Since you failed to mention any OS restrictions you get a Debian/Ubuntu solution for that. First, install the package poppler-utils. Then use the freshly installed tool pdftotext to extract all the text from the PDF. It might be a good idea to use pdftotext -layout. From the resulting text you will have to grep/parse the line with your "title", and then use that to rename (or much safer: copy) the PDF.

Related

Nested if statements: Swapping headers and sequences in fasta files

I am opening a directory and processing each file. A sample file looks like this when opened:
>AAAAA
TTTTTTTTTTTAAAAATTTTTTTTTT
>BBBBB
TTTTTTTTTTTTTTTTTTBBBBBTTT
>CCCCC
TTTTTTTTTTTTTTTTCCCCCTTTTT
For the above sample file, I am trying to make them look like this:
>TAAAAAT
AAAAA
>TBBBBBT
BBBBB
>TCCCCCT
CCCCC
I need to find the "header" in next line sequence, take flanks on either side of the match, and then flip them. I want to print each file's worth of contents to another separate file.
Here is my code so far. It runs without errors, but doesn't generate any output. My guess is this is probably related to the nested if statements. I have never worked with those before.
#!/usr/bin/perl
use strict;
use warnings;
my ($directory) = #ARGV;
my $dir = "$directory";
my #ArrayofFiles = glob "$dir/*";
my $count = 0;
open(OUT, ">", "/path/to/output_$count.txt") or die $!;
foreach my $file(#ArrayofFiles){
open(my $fastas, $file) or die $!;
while (my $line = <$fastas>){
$count++;
if ($line =~ m/(^>)([a-z]{5})/i){
my $header = $2;
if ($line !~ /^>/){
my $sequence .= $line;
if ($sequence =~ m/(([a-z]{1})($header)([a-z]{1}))/i){
my $matchplusflanks = $1;
print OUT ">", $matchplusflanks, "\n", $header, "\n";
}
}
}
}
}
How can I fix this code? Thanks.
Try this
foreach my $file(#ArrayofFiles)
{
open my $fh," <", $file or die"error opening $!\n";
while(my $head=<$fh>)
{
chomp $head;
$head=~s/>//;
my $next_line = <$fh>;
my($extract) = $next_line =~m/(.$head.)/;
print ">$extract\n$head\n";
}
}
There are several mistakes in your code but the main problem is:
if ($line =~ m/(^>)([a-z]{5})/i) {
my $header = $2;
if ($line !~ /^>/) {
# here you write to the output file
Because the same line can't start and not start with > at the same time, your output files are never written. The second if statement always fails and its block is never executed.
open(OUT, ">", "/path/to/output_$count.txt") or die $!; and $count++ are misplaced. Since you want to produce an output file (with a new name) for each input file, you need to put them in the foreach block, not outside or in the while loop.
Example:
#!/usr/bin/perl
use strict;
use warnings;
my ($dir) = #ARGV;
my #files = glob "$dir/*";
my $count;
my $format = ">%s\n%s\n";
foreach my $file (#files) {
open my $fhi, '<', $file
or die "Can't open file '$file': $!";
$count++;
my $output_path = "/path/to/output_$count.txt";
open my $fho, '>', $output_path
or die "Can't open file '$output_path': $!";
my ($header, $seq);
while(<$fhi>) {
chomp;
if (/^>([a-z]{5})/i) {
if ($seq) { printf $fho $format, $seq =~ /([a-z]$header[a-z])/i, $header; }
($header, $seq) = ($1, '');
} else { $seq .= $_; }
}
if ($seq) { printf $fho $format, $seq =~ /([a-z]$header[a-z])/i, $header; }
}
close $fhi;
close $fho;

Output .Resx From .CS using perl script

.CS contains string within double quotes and I am trying to extract these strings into .resx file.
The existing code output the .resx but with only one string whereas .CS file contains more than one strings in quotes.
Can you please provide any reference to achieve this?
use strict;
use warnings;
use File::Find;
use XML::Writer;
use Cwd;
#user input: [Directory]
my $wrkdir = getcwd;
system "attrib -r /s";
print "Processing $wrkdir\n";
find( \&recurse_src_path, $wrkdir );
sub recurse_src_path
{
my $file = $File::Find::name;
my $fname = $_;
my #lines;
my $line;
if ( ( -f $file ) && ( $file =~ /.*\.cs$/i ) )
{
print "..";
open( FILE, $file ) || die "Cannot open $file:\n$!";
while ( $line = <FILE> )
{
if ( $line =~ s/\"(.*?)\"/$1/m )
{
chomp $line;
push( #lines, $line );
my $nl = '0';
my $dataIndent;
my $output = new IO::File(">Test.resx");
#binmode( $output, ":encoding(utf-8)" );
my $writer = XML::Writer->new(
OUTPUT => $output,
DATA_MODE => 1,
DATA_INDENT => 2
);
$writer->xmlDecl("utf-8");
$writer->startTag('root');
foreach my $r ($line)
{
print "$1\n";
$writer->startTag( 'data', name => $_ );
$writer->startTag('value');
$writer->characters($1);
$writer->endTag('value');
$writer->startTag('comment');
$writer->characters($1);
$writer->endTag('comment');
$writer->endTag('data');
}
$writer->endTag('root');
$writer->end;
$output->close();
}
}
close FILE;
}
}
Use the /g regex modifier. For example:
use strict;
use warnings;
my $cs_string = '
// Imagine this is .cs code here
system "attrib -r /s";
print "Processing $wrkdir\n";
find( \&recurse_src_path, $wrkdir );
';
while ($cs_string =~ /\"(.*)\"/g) {
print "Found quoted string: '$1'\n"
}
;
See also: http://perldoc.perl.org/perlrequick.html#Matching-repetitions
You might also want to look at File-Slurp to read your .cs code into a single Perl scalar, trusting that your .cs file is not too large.
Finally combine this with your existing code to get the .resx output format.

Failed to open GLOB error

I've noticed that when you drag & drop a file into OS X Terminal and any part of the pathway contains a space (for example in a folder name) it substitutes this for a \
This then leads to an error in opening files in my script:
use strict;
use warnings;
use File::Basename;
my $in;
my $filename = $ARGV[0];
unless ($filename){
print "\n\nPlease drag and drop a FASTA/FA or plain-text file containing your sequence into the prompt window and hit ENTER. Alternatively, manually specify the file-pathway:\n";
$filename = <STDIN>;
chomp $filename;
}
open($in, $filename) or die "Failed to open $in: $!";
my $DNA = read_fasta($in);
my $len = length($DNA);
print "\nFASTA/Sequence Length is: $len bp \n";
print "\nPlease enter restriction sites (degeneracy characters permitted) seperated by a single space: ";
my $sites=<STDIN>;
chomp $sites;
my #pats = split ' ', $sites;
for (#pats) {
s/K/[GT]/g;
s/M/[AC]/g;
s/Y/[CT]/g;
s/S/[CG]/g;
s/W/[AT]/g;
s/B/[CGT]/g;
s/V/[ACG]/g;
s/H/[ACT]/g;
s/D/[AGT]/g;
s/X/[AGCT]/g;
s/R/[AG]/g;
s/N/[AGCT]/g;
}
for (#pats) {
my $m = () = $DNA =~ /$_/gi;
print "\nTotal DNA matches to $_ are: $m \n";
}
my $DIR = dirname($filename);
my $name = basename($filename);
(my $extrem = $name) =~ s/\.[^.]+$//;
open my $out, ">$DIR/$extrem $sites.txt";
my $pat=join("|",#pats);
my #cutarr = split(/$pat/, $DNA);
for (#cutarr) {
my $len = length($_);
print $out "$len \n";
}
print "\nYour results are located at: $DIR/$extrem $sites.txt\n\n";
close($out);
close($in);
#Subfunction - Reading formatted FASTA/FA files
sub read_fasta {
my ($fh) = #_;
my $sequence;
while (<$fh>) {
unless (/^>/) {
chomp;
$sequence .= $_;
}
}
return $sequence;
}
It will open files if the pathway contains no spaces. Is there a better way I can go about opening the file to avoid this occurring?
Try to remove backslashes from your file name,
$filename =~ tr|\\||d;
open(my $in, $filename) or die $!;

perl + read multiple csv files + manipulate files + provide output_files

Apologies if this is a bit long winded, bu i really appreciate an answer here as i am having difficulty getting this to work.
Building on from this question here, i have this script that works on a csv file(orig.csv) and provides a csv file that i want(format.csv). What I want is to make this more generic and accept any number of '.csv' files and provide a 'output_csv' for each inputed file. Can anyone help?
#!/usr/bin/perl
use strict;
use warnings;
open my $orig_fh, '<', 'orig.csv' or die $!;
open my $format_fh, '>', 'format.csv' or die $!;
print $format_fh scalar <$orig_fh>; # Copy header line
my %data;
my #labels;
while (<$orig_fh>) {
chomp;
my #fields = split /,/, $_, -1;
my ($label, $max_val) = #fields[1,12];
if ( exists $data{$label} ) {
my $prev_max_val = $data{$label}[12] || 0;
$data{$label} = \#fields if $max_val and $max_val > $prev_max_val;
}
else {
$data{$label} = \#fields;
push #labels, $label;
}
}
for my $label (#labels) {
print $format_fh join(',', #{ $data{$label} }), "\n";
}
i was hoping to use this script from here but am having great difficulty putting the 2 together:
#!/usr/bin/perl
use strict;
use warnings;
#If you want to open a new output file for every input file
#Do it in your loop, not here.
#my $outfile = "KAC.pdb";
#open( my $fh, '>>', $outfile );
opendir( DIR, "/data/tmp" ) or die "$!";
my #files = readdir(DIR);
closedir DIR;
foreach my $file (#files) {
open( FH, "/data/tmp/$file" ) or die "$!";
my $outfile = "output_$file"; #Add a prefix (anything, doesn't have to say 'output')
open(my $fh, '>', $outfile);
while (<FH>) {
my ($line) = $_;
chomp($line);
if ( $line =~ m/KAC 50/ ) {
print $fh $_;
}
}
close($fh);
}
the script reads all the files in the directory and finds the line with this string 'KAC 50' and then appends that line to an output_$file for that inputfile. so there will be 1 output_$file for every inputfile that is read
issues with this script that I have noted and was looking to fix:
- it reads the '.' and '..' files in the directory and produces a
'output_.' and 'output_..' file
- it will also do the same with this script file.
I was also trying to make it dynamic by getting this script to work in any directory it is run in by adding this code:
use Cwd qw();
my $path = Cwd::cwd();
print "$path\n";
and
opendir( DIR, $path ) or die "$!"; # open the current directory
open( FH, "$path/$file" ) or die "$!"; #open the file
**EDIT::I have tried combining the versions but am getting errors.Advise greatly appreciated*
UserName#wabcl13 ~/Perl
$ perl formatfile_QforStackOverflow.pl
Parentheses missing around "my" list at formatfile_QforStackOverflow.pl line 13.
source dir -> /home/UserName/Perl
Can't use string ("/home/UserName/Perl/format_or"...) as a symbol ref while "strict refs" in use at formatfile_QforStackOverflow.pl line 28.
combined code::
use strict;
use warnings;
use autodie; # this is used for the multiple files part...
#START::Getting current working directory
use Cwd qw();
my $source_dir = Cwd::cwd();
#END::Getting current working directory
print "source dir -> $source_dir\n";
my $output_prefix = 'format_';
opendir my $dh, $source_dir; #Changing this to work on current directory; changing back
for my $file (readdir($dh)) {
next if $file !~ /\.csv$/;
next if $file =~ /^\Q$output_prefix\E/;
my $orig_file = "$source_dir/$file";
my $format_file = "$source_dir/$output_prefix$file";
# .... old processing code here ...
## Start:: This part works on one file edited for this script ##
#open my $orig_fh, '<', 'orig.csv' or die $!; #line 14 and 15 above already do this!!
#open my $format_fh, '>', 'format.csv' or die $!;
#print $format_fh scalar <$orig_fh>; # Copy header line #orig needs changeing
print $format_file scalar <$orig_file>; # Copy header line
my %data;
my #labels;
#while (<$orig_fh>) { #orig needs changing
while (<$orig_file>) {
chomp;
my #fields = split /,/, $_, -1;
my ($label, $max_val) = #fields[1,12];
if ( exists $data{$label} ) {
my $prev_max_val = $data{$label}[12] || 0;
$data{$label} = \#fields if $max_val and $max_val > $prev_max_val;
}
else {
$data{$label} = \#fields;
push #labels, $label;
}
}
for my $label (#labels) {
#print $format_fh join(',', #{ $data{$label} }), "\n"; #orig needs changing
print $format_file join(',', #{ $data{$label} }), "\n";
}
## END:: This part works on one file edited for this script ##
}
How do you plan on inputting the list of files to process and their preferred output destination? Maybe just have a fixed directory that you want to process all the cvs files, and prefix the result.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $source_dir = '/some/dir/with/cvs/files';
my $output_prefix = 'format_';
opendir my $dh, $source_dir;
for my $file (readdir($dh)) {
next if $file !~ /\.csv$/;
next if $file =~ /^\Q$output_prefix\E/;
my $orig_file = "$source_dir/$file";
my $format_file = "$source_dir/$output_prefix$file";
.... old processing code here ...
}
Alternatively, you could just have an output directory instead of prefixing the files. Either way, this should get you on your way.

Open a file, search for a string and fill a string in another file

I have 2 text files. I"m writing a perl script wherein i need to find "Unable to parse" string in a text file and then extract the whole line having this string. Extract part of this string after "/" and store the string in a variable.
Then i need to open the other text file, find the stored string in this text file and replace the string.
my $ldir = "/Android";
$RESULTS_FILE = $ldir.'/'.'results.html';
open OUT, ">>", $RESULTS_FILE;
open(IN,"<logcat.txt");
while(<IN>)
{
chomp;
if( $_ =~ m/Unable to parse/ )
{
my #string = split('/',$_);
print #string;
my $stream_name = $string[4];
while $srch(<OUT>)
{
chomp;
if( $srch =~ m/$stream_name/ )
{
// How to replace the line here?
}
}
}
}
Please help.
Regards,
Ramki
You need to read the file, replace all occurances of your target string with something, then save that file back out again:
#!/usr/bin/perl
use strict;
sub replace_string_in_file {
my ( $source, $target, $filename ) = #_;
my $filecontents = do {
open my $fd, "<" $filename;
local $/;
<$fd>;
};
$filecontents =~ s/$source/$target/mg;
open my $fd, ">", $filename;
print $fd $filecontents;
close($fd);
}
my $ldir = "/Android";
$RESULTS_FILE = $ldir.'/'.'results.html';
open(IN,"<logcat.txt");
while(<IN>)
{
chomp;
if( $_ =~ m/Unable to parse/ )
{
my #string = split('/',$_);
print #string;
my $stream_name = $string[4];
replace_string_in_file( $stream_name, "THIS IS THE REPLACEMENT", $RESULTS_FILE );
}
}