RTF to TEXT conversion using perl - perl

Can somebody tell me how can we convert the rtf file into text with all the tags, tables and formatted data using perl programming language ?
#Ahmad Bilal , #petersergeant : I have been using the below code for RTF to TXT conversion and i am able to convert into text. But the problem is i am unable to capture table or image formats and even all the entities in the inputfile are not captured using the program.
use 5.8.0;
use strict;
use warnings;
use Getopt::Long;
use Pod::Usage;
use RTF::HTMLConverter;
#-------------------------------------------------------------------
#Variable Declarions
#-------------------------------------------------------------------
my $tempfile = "";
my $Outfile = "";
my $txtfile = "";
my $URL = "";
my $Format = "";
my $TreeBuilder = "";
my $Parsed = "";
my $line = "";
my %opts;
GetOptions(
"help|h|?" => \$opts{help},
"man|m" => \$opts{man},
"dom=s" => \$opts{dom},
"noimages|n" => \$opts{noimages},
"imagedir|d=s" => \$opts{imagedir},
"imageuri|u=s" => \$opts{imageuri},
"encoding|e=s" => \$opts{encoding},
"indented|i=i" => \$opts{indented},
);
pod2usage(-verbose => 1, -exitval => 0) if $opts{help};
pod2usage(-verbose => 2, -exitval => 0) if $opts{man};
my %params;
if($opts{dom}){
eval "require $opts{dom}";
die $# if $#;
$params{DOMImplementation} = $opts{dom};
}else{
eval { require XML::GDOME };
if($#){
eval { require XML::DOM };
die "Can't load either XML::GDOME or XML::DOM\n" if $#;
$params{DOMImplementation} = 'XML::DOM';
}
}
if($opts{noimages}){
$params{discard_images} = 1;
}else{
$params{image_dir} = $opts{imagedir} if defined $opts{imagedir};
$params{image_uri} = $opts{imageuri} if defined $opts{imageuri};
}
$params{codepage} = $opts{encoding} if $opts{encoding};
$params{formatting} = $opts{indented} if defined $opts{indented};
#-----------------------------------------------
# Converting RTF to HTML
#-----------------------------------------------
if(defined $ARGV[0]){
open(FR, "< $ARGV[0]") or die "Can't open '$ARGV[0]': $!!\n";
$params{in} = \*FR;
$tempfile = $ARGV[0];
$tempfile =~ /^(.*?)rtf/;
$Outfile = $1."html";
$txtfile = $1."txt";
open(FW, "> $Outfile") or die "Can't open '$Outfile': $!!\n";
$params{out} = \*FW;
print "\n$Outfile - HTML Created\n"
}
my $parser = RTF::HTMLConverter->new(%params);
$parser->parse();
close FW;
#-----------------------------------------------
# Opening HTML and TXT files
#-----------------------------------------------
open (FILE1, ">$txtfile") or die "Can't open '$txtfile': $!!\n";
open (FILE2, "$Outfile") or die "Can't open '$Outfile': $!!\n";
#-----------------------------------------------
# Converting HTML to TXT file
#-----------------------------------------------
local $/ = undef;
while ($line = <FILE2>) {
$line =~ s/\n//g;
$line =~ s/(<!DOCTYPE HTML.*><html><head>.*<\/style>)/<sectd>/;
$line =~ s/<font.*?>//g;
$line =~ s/<\/font>//g;
$line =~ s/<table .*?>/\n<table>\n/g;
$line =~ s/<\/table>/\n<\/table>/g;
$line =~ s/<td .*?>/\n<td>/g;
$line =~ s/<tr>/\n<tr>/g;
$line =~ s/<\/tr>/\n<\/tr>/g;
$line =~ s/<ul.*?>/\n<ul>/g;
$line =~ s/<li.*?>/\n<li>/g;
$line =~ s/<\/ul>/\n<\/ul>/g;
$line =~ s/<\/body><\/html>//g;
$line =~ s/<p.*?>/\n<p>/g;
$line =~ s/<p>( |\*|\s)+<\/p>//g;
$line =~ s/ //g;
$line =~ s/(<sectd>\n?.*?)<\/head><body>/$1/g;
#-------------------
# Entity Conversion
#-------------------
$line =~ s/’/‘/g;
$line =~ s/“/“/g;
$line =~ s/”/”/g;
$line =~ s/¶/¶/g;
print FILE1 $line;
}
print "$txtfile - TXT file Created \n";
close FILE1;
close FILE2;
unlink ("$Outfile");

I am the author of the linked module. Don't use it. If at all possible, shell out to a real RTF to text convertor like Pandoc.

you need to use a module like this:
http://search.cpan.org/~sargie/RTF-Parser-1.12/lib/RTF/TEXT/Converter.pm

Related

Nested if statements: Swapping headers and sequences in fasta files

I am opening a directory and processing each file. A sample file looks like this when opened:
>AAAAA
TTTTTTTTTTTAAAAATTTTTTTTTT
>BBBBB
TTTTTTTTTTTTTTTTTTBBBBBTTT
>CCCCC
TTTTTTTTTTTTTTTTCCCCCTTTTT
For the above sample file, I am trying to make them look like this:
>TAAAAAT
AAAAA
>TBBBBBT
BBBBB
>TCCCCCT
CCCCC
I need to find the "header" in next line sequence, take flanks on either side of the match, and then flip them. I want to print each file's worth of contents to another separate file.
Here is my code so far. It runs without errors, but doesn't generate any output. My guess is this is probably related to the nested if statements. I have never worked with those before.
#!/usr/bin/perl
use strict;
use warnings;
my ($directory) = #ARGV;
my $dir = "$directory";
my #ArrayofFiles = glob "$dir/*";
my $count = 0;
open(OUT, ">", "/path/to/output_$count.txt") or die $!;
foreach my $file(#ArrayofFiles){
open(my $fastas, $file) or die $!;
while (my $line = <$fastas>){
$count++;
if ($line =~ m/(^>)([a-z]{5})/i){
my $header = $2;
if ($line !~ /^>/){
my $sequence .= $line;
if ($sequence =~ m/(([a-z]{1})($header)([a-z]{1}))/i){
my $matchplusflanks = $1;
print OUT ">", $matchplusflanks, "\n", $header, "\n";
}
}
}
}
}
How can I fix this code? Thanks.
Try this
foreach my $file(#ArrayofFiles)
{
open my $fh," <", $file or die"error opening $!\n";
while(my $head=<$fh>)
{
chomp $head;
$head=~s/>//;
my $next_line = <$fh>;
my($extract) = $next_line =~m/(.$head.)/;
print ">$extract\n$head\n";
}
}
There are several mistakes in your code but the main problem is:
if ($line =~ m/(^>)([a-z]{5})/i) {
my $header = $2;
if ($line !~ /^>/) {
# here you write to the output file
Because the same line can't start and not start with > at the same time, your output files are never written. The second if statement always fails and its block is never executed.
open(OUT, ">", "/path/to/output_$count.txt") or die $!; and $count++ are misplaced. Since you want to produce an output file (with a new name) for each input file, you need to put them in the foreach block, not outside or in the while loop.
Example:
#!/usr/bin/perl
use strict;
use warnings;
my ($dir) = #ARGV;
my #files = glob "$dir/*";
my $count;
my $format = ">%s\n%s\n";
foreach my $file (#files) {
open my $fhi, '<', $file
or die "Can't open file '$file': $!";
$count++;
my $output_path = "/path/to/output_$count.txt";
open my $fho, '>', $output_path
or die "Can't open file '$output_path': $!";
my ($header, $seq);
while(<$fhi>) {
chomp;
if (/^>([a-z]{5})/i) {
if ($seq) { printf $fho $format, $seq =~ /([a-z]$header[a-z])/i, $header; }
($header, $seq) = ($1, '');
} else { $seq .= $_; }
}
if ($seq) { printf $fho $format, $seq =~ /([a-z]$header[a-z])/i, $header; }
}
close $fhi;
close $fho;

How to encrypt and decrypt a specific column in a file by using Perl?

I have log file like below,
NAME ID LOCATION
aa 12 in
bb 13 freak
cc 14 test
I want to encrypt and as well as decrypt the field "LOCATION". How to do that by using any Encryption module in Perl ?
Since I'm new to Perl, help me .
Use Crypt::CBC
A sample script
#!/usr/bin/perl
use strict;
use Crypt::CBC;
unless (scalar #ARGV == 3) {
die "Usage: $0 encrypt|decrypt|en|de \$mysecretkey \$file_to_dencrypt";
}
my $type = shift #ARGV;
my $key = shift #ARGV;
my $file = shift #ARGV;
die "The first ARGV should be one of de, en, encrypt, decrypt" if ($type !~ /^(en|de)(crypt)?$/);
die "the file $file is not existence" unless (-f $file);
my $DEBUG = 1;
print "type is $type, key is $key, file is $file\n" if $DEBUG;
my $cipher = Crypt::CBC->new(
-key => $key,
-cipher => 'Blowfish'
);
local $/;
open(FH, $file) or die $!;
flock(FH, 2);
my $data = <FH>;
close(FH);
my ($save_data, $save_file);
if ($type =~ /^en(crypt)?$/) {
$save_data = $cipher->encrypt($data);
$save_file = $file . '.encrypt';
} else {
$save_data = $cipher->decrypt($data);
$save_file = $file . '.decrypt';
}
open(FH, '>', $save_file) or die $!;
print FH $save_data;
close(FH);
if (-e $save_file) {
print "$type file $file to $save_file OK\n";
} else {
print "failed without reason\n";
}

Failed to open GLOB error

I've noticed that when you drag & drop a file into OS X Terminal and any part of the pathway contains a space (for example in a folder name) it substitutes this for a \
This then leads to an error in opening files in my script:
use strict;
use warnings;
use File::Basename;
my $in;
my $filename = $ARGV[0];
unless ($filename){
print "\n\nPlease drag and drop a FASTA/FA or plain-text file containing your sequence into the prompt window and hit ENTER. Alternatively, manually specify the file-pathway:\n";
$filename = <STDIN>;
chomp $filename;
}
open($in, $filename) or die "Failed to open $in: $!";
my $DNA = read_fasta($in);
my $len = length($DNA);
print "\nFASTA/Sequence Length is: $len bp \n";
print "\nPlease enter restriction sites (degeneracy characters permitted) seperated by a single space: ";
my $sites=<STDIN>;
chomp $sites;
my #pats = split ' ', $sites;
for (#pats) {
s/K/[GT]/g;
s/M/[AC]/g;
s/Y/[CT]/g;
s/S/[CG]/g;
s/W/[AT]/g;
s/B/[CGT]/g;
s/V/[ACG]/g;
s/H/[ACT]/g;
s/D/[AGT]/g;
s/X/[AGCT]/g;
s/R/[AG]/g;
s/N/[AGCT]/g;
}
for (#pats) {
my $m = () = $DNA =~ /$_/gi;
print "\nTotal DNA matches to $_ are: $m \n";
}
my $DIR = dirname($filename);
my $name = basename($filename);
(my $extrem = $name) =~ s/\.[^.]+$//;
open my $out, ">$DIR/$extrem $sites.txt";
my $pat=join("|",#pats);
my #cutarr = split(/$pat/, $DNA);
for (#cutarr) {
my $len = length($_);
print $out "$len \n";
}
print "\nYour results are located at: $DIR/$extrem $sites.txt\n\n";
close($out);
close($in);
#Subfunction - Reading formatted FASTA/FA files
sub read_fasta {
my ($fh) = #_;
my $sequence;
while (<$fh>) {
unless (/^>/) {
chomp;
$sequence .= $_;
}
}
return $sequence;
}
It will open files if the pathway contains no spaces. Is there a better way I can go about opening the file to avoid this occurring?
Try to remove backslashes from your file name,
$filename =~ tr|\\||d;
open(my $in, $filename) or die $!;

Extracting multiple lines of record/data using a subroutine or functions

Can you show me how to create a subroutine or function using this code?
Basically I want to make my code into a subroutine so I'll be able to re-use it without making my script too long.
Here is my script:
#!/usr/local/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Carp qw(croak);
my #fields;
my ($tmp_var, $rec_type, $country, $header, $Combline, $records, $line);
my $filename = 'data5.txt';
open (my $input_fh, '<', $filename ) or croak "Can't open $filename: $!";
open my $OUTPUTA, ">", 'drp1.txt' or die $!;
open my $OUTPUTB, ">", 'drp2.txt' or die $!;
while (<$input_fh>) {
$line = _trim($_);
#fields = split (/\|/, $line);
$rec_type = $fields[0];
$country = $fields[1];
my $string = substr $fields[1], 0, 1;
$header = $line if(/^INVHDR/);
if ($rec_type eq 'INVDET') {
if ($string eq 'I') {
$records = $header . $line;
print $OUTPUTA $records, scalar <$input_fh>;
}
else {
$records = $header . $line;
print $OUTPUTB $records, scalar <$input_fh>;
}
}
}
close $OUTPUTA or die $!;
close $OUTPUTB or die $!;
sub _trim {
my $word = shift;
if ( $word ) {
$word =~ s/\s*\|/\|/g; #remove trailing spaces
$word =~ s/"//g; #remove double quotes
}
return $word;
}
This is the part of the script that I wanted to put in a subroutine or function:
$line = _trim($_);
#fields = split (/\|/, $line);
$rec_type = $fields[0];
$country = $fields[1];
my $string = substr $fields[1], 0, 1;
$header = $line if (/^INVHDR/);
if ($rec_type eq 'INVDET') {
if ($string eq 'I') {
$records = $header . $line;
print $OUTPUTA $records, scalar <$input_fh>;
}
else {
$records = $header . $line;
print $OUTPUTB $records, scalar <$input_fh>;
}
}
I would suggest breaking it out a little differently and expand on your _trim function, turning it into a parse function:
use strict;
use warnings;
open( my $input_fh, '<', 'data5.txt' ) or die "Can't open $filename: $!";
open( my $OUTPUTA, '>', 'drp1.txt' ) or die $!;
open( my $OUTPUTB, '>', 'drp2.txt' ) or die $!;
my $header = '';
while (<$input_fh>) {
if ($_ =~ /^INVHDR/) {
$header = $_;
}
if ($_ =~ /^INVDET/) {
my #data = parse($_);
my $line = $header . join('|', #data);
# scalar <$input_fh> is almost certainly not doing what you expect,
# though I'm not sure what you're try to accomplish with it
if ( $data[1] =~ /^I/ ) {
print $OUTPUTA $line;
} else {
print $OUTPUTB $line;
}
}
}
sub parse {
my $input = shift || return;
my $input =~ s/"//g; # remove double quotes
# Here I've combined the removal of trailing spaces with the split.
my #fields = split( m{\s*\|}, $input );
return #fields;
}

Using perl, how do I search a text file for _NN (at the end of a word) and print the word in front?

This gives the whole line:
#!/usr/bin/perl
$file = 'output.txt';
open(txt, $file);
while($line = <txt>) {
print "$line" if $line =~ /_NN/;
}
close(txt);
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
binmode(STDOUT, ":utf8") || die;
my $file = "output.txt";
open(TEXT, "< :utf8", $file) || die "Can't open $file: $!";
while(<TEXT>) {
print "$1\n" while /(\w+)_NN\b/g;
}
close(TEXT) || die "Can't close $file: $!";
Your answer script reads a bit awkwardly, and has a couple of potential errors. I'd rewrite the main logic loop like so:
foreach my $line (grep { /expend_VB/ } #sentences) {
my #nouns = grep { /_NN/ } split /\s+/, $line;
foreach my $word (#nouns) {
$word =~ s/_NN//;
print "$word\n";
}
print "$line\n" if scalar(#nouns);
}
You need to put the my declaration inside the loop - otherwise it will persist longer than you want it to, and could conceivably cause problems later.
foreach is a more common perl idiom for iterating over a list.
print "$1" if $line =~ /(\S+)_NN/;
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
my $search_key = "expend"; ## CHANGE "..." to <>
open(my $tag_corpus, '<', "ch13tagged.txt") or die $!;
my #sentences = <$tag_corpus>; # This breaks up each line into list
my #words;
for (my $i=0; $i <= #sentences; $i++) {
if ( defined( $sentences[$i] ) and $sentences[$i] =~ /($search_key)_VB.*/i) {
#words = split /\s/,$sentences[$i]; ## \s is a whitespace
for (my $j=0; $j <= #words; $j++) {
#FILTER if word is noun:
if ( defined( $words[$j] ) and $words[$j] =~ /_NN/) {
#PRINT word and sentence:
print "**",split(/_\S+/,$words[$j]),"**", "\n";
print split(/_\S+/,$sentences[$i]), "\n"
}
} ## put print sentences here to print each sentence after all the nouns inside
}
}
close $tag_corpus || die "Can't close $tag_corpus: $!";