Extract data from file - perl

I have data like
"scott
E -45 COLLEGE LANE
BENGALI MARKET
xyz -785698."
"Tomm
D.No: 4318/3,Ansari Road, Dariya Gunj,
xbc - 289235."
I wrote one Perl program to extract names i.e;
open(my$Fh, '<', 'printable address.txt') or die "!S";
open(my$F, '>', 'names.csv') or die "!S";
while (my#line =<$Fh> ) {
for(my$i =0;$i<=13655;$i++){
if ($line[$i]=~/^"/) {
print $F $line[$i];
}
}
}
It works fine and it extracts names exactly .Now my aim is to extract address that is like
BENGALI MARKET
xyz -785698."
D.No: 4318/3,Ansari Road, Dariya Gunj,
xbc - 289235."
In CSV file. How to do this please tell me

There are a lot of flaws with your original problem. Should address those before suggesting any enhancements:
Always have use strict; and use warnings; at the top of every script.
Your or die "!S" statements are broken. The error code is actually in $!. However, you can skip the need to do that by just having use autodie;
Give your filehandles more meaningful names. $Fh and $F say nothing about what those are for. At minimum label them as $infh and $outfh.
The while (my #line = <$Fh>) { is flawed as that can just be reduced to my #line = <$Fh>;. Because you're going readline in a list context it will slurp the entire file, and the next loop it will exit. Instead, assign it to a scalar, and you don't even need the next for loop.
If you wanted to slurp your entire file into #line, your use of for(my$i =0;$i<=13655;$i++){ is also flawed. You should iterate to the last index of #line, which is $#line.
if ($line[$i]=~/^"/) { is also flawed as you leave the quote character " at the beginning of your names that you're trying to match. Instead add a capture group to pull the name.
With the suggested changes, the code reduces to:
use strict;
use warnings;
use autodie;
open my $infh, '<', 'printable address.txt';
open my $outfh, '>', 'names.csv';
while (my $line = <$infh>) {
if ($line =~ /^"(.*)/) {
print $outfh "$1\n";
}
}
Now if you also want to isolate the address, you can use a similar method as you did with the name. I'm going to assume that you might want to build the whole address in a variable so you can do something more complicated with it than throwing them blindly at a file. However, mirroring the file setup for now:
use strict;
use warnings;
use autodie;
open my $infh, '<', 'printable address.txt';
open my $namefh, '>', 'names.csv';
open my $addressfh, '>', 'address.dat';
my $address = '';
while (my $line = <$infh>) {
if ($line =~ /^"(.*)/) {
print $namefh "$1\n";
} elsif ($line =~ /(.*)"$/) {
$address .= $1;
print $addressfh "$address\n";
$address = '';
} else {
$address .= $line;
}
}
Ultimately, no matter what you want to use your data for, your best solution is probably to output it to a real CSV file using Text::CSV. That way it can be imported into a spreadsheet or some other system very easily, and you won't have to parse it again.
use strict;
use warnings;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $infh, '<', 'printable address.txt';
open my $outfh, '>', 'address.csv';
my #data;
while (my $line = <$infh>) {
# Name Field
if ($line =~ /^"(.*)/) {
#data = ($1, '');
# End of Address
} elsif ($line =~ /(.*)"$/) {
$data[1] .= $1;
$csv->print($outfh, \#data);
# Address lines
} else {
$data[1] .= $line;
}
}

Related

Print variable after closing the file in Perl

Below code works fine but I want $ip to be printed after closing the file.
use strict;
use warnings;
use POSIX;
my $file = "/tmp/example";
open(FILE, "<$file") or die $!;
while ( <FILE> ) {
my $lines = $_;
if ( $lines =~ m/address/ ) {
my ($string, $ip) = (split ' ', $lines);
print "IP address is: $ip\n";
}
}
close(FILE);
sample data in /tmp/example file
$cat /tmp/example
country us
ip_address 192.168.1.1
server dell
This solution looks for the first line that contains ip_address followed by some space and a sequence of digits and dots
Wrapping the search in a block makes perl delete the lexical variable $fh. Because it is a file handle, that handle will also be automatically closed
Note that I've used autodie to avoid the need to explicitly check the status of the open call
This algorithm will find the first occurrence of ip_address and stop reading the file immediately
use strict;
use warnings 'all';
use autodie;
my $file = '/tmp/example';
my $ip;
{
open my $fh, '<', $file;
while ( <$fh> ) {
if ( /ip_address\h+([\d.]+)/ ) {
$ip = $1;
last;
}
}
}
print $ip // 'undef', "\n";
output
192.168.1.1
Store all ips in an array and you'll then have it for later processing.
The shown code can also be simplified a lot. This assumes a four-number ip and data like that shown in the sample
use warnings;
use strict;
use feature 'say';
my $file = '/tmp/example';
open my $fh, '<', $file or die "Can't open $file: $!";
my #ips;
while (<$fh>) {
if (my ($ip) = /ip_address\s*(\d+\.\d+\.\d+\.\d+)/) {
push #ips, $ip;
}
}
close $fh;
say for #ips;
Or, once you open the file, process all lines with a map
my #ips = map { /ip_address\s*(\d+\.\d+\.\d+\.\d+)/ } <$fh>;
The filehandle is here read in a list context, imposed by map, so all lines from the file are returned. The block in map applies to each in turn, and map returns a flattened list with results.
Some notes
Use three-argument open, it is better
Don't assign $_ to a variable. To work with a lexical use while (my $line = <$fh>)
You can use split but here regex is more direct and it allows you to assign its match so that it is scoped. If there is no match the if fails and nothing goes onto the array
use warnings;
use strict;
my $file = "test";
my ( $string,$ip);
open my $FH, "<",$file) or die $!;
while (my $lines = <FH>) {
if ($lines =~ m/address/){
($string, $ip) = (split ' ', $lines);
}
}
print "IP address is: $ip\n";
This will give you the output you needed. But fails in the case of multiple IP match lines in the input file overwrites the last $ip variable.

Want to add random string to identifier line in fasta file

I want to add random string to existing identifier line in fasta file.
So I get:
MMETSP0259|AmphidiniumcarteCMP1314aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Then the sequence on the next lines as normal. I am have problem with i think in the format output. This is what I get:
MMETSP0259|AmphidiniumCMP1314aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
CTTCATCGCACATGGATAACTGTGTACCTGACTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
TCTGGGAAAGGTTGCTATCATGAGTCATAGAATaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
It's added to every line. (I altered length to fit here.) I want just to add to the identifier line.
This is what i have so far:
use strict;
use warnings;
my $currentId = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
my $header_line;
my $seq;
my $uniqueID;
open (my $fh,"$ARGV[0]") or die "Failed to open file: $!\n";
open (my $out_fh, ">$ARGV[0]_longer_ID_MMETSP.fasta");
while( <$fh> ){
if ($_ =~ m/^(\S+)\s+(.*)/) {
$header_line = $1;
$seq = $2;
$uniqueID = $currentId++;
print $out_fh "$header_line$uniqueID\n$seq";
} # if
} # while
close $fh;
close $out_fh;
Thanks very much, any ideas will be greatly appreciated.
Your program isn't working because the regex ^(\S+)\s+(.*) matches every line in the input file. For instance, \S+ matches CTTCATCGCACATGGATAACTGTGTACCTGACT; the newline at the end of the line matches \s+; and nothing matches .*.
Here's how I would encode your solution. It simply appends $current_id to the end of any line that contains a pipe | character
use strict;
use warnings;
use 5.010;
use autodie;
my ($filename) = #ARGV;
my $current_id = 'a' x 57;
open my $in_fh, '<', $filename;
open my $out_fh, '>', "${filename}_longer_ID_MMETSP.fasta";
while ( my $line = <$in_fh> ) {
chomp $line;
$line .= $current_id if $line =~ tr/|//;
print $line, "\n";
}
close $out_fh;
output
MMETSP0259|AmphidiniumCMP1314aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
CTTCATCGCACATGGATAACTGTGTACCTGACT
TCTGGGAAAGGTTGCTATCATGAGTCATAGAAT

match first columns in two files

I have two files of unequal sizes. First file has two columns and second has only one column. I want to match the column in second file to the first column in first file and if they match, print the whole line from the first file. Pretty simple but I am stuck. Here's what I did after opening and storing the contents of both the files in arrays
foreach(#q) #second file
{
$line=$_;
foreach(#gs) #first file
{
$line1=$_;
if ( $line1=~ /$line/ )
{
print $line1;
}
}
}
This doesnt give an output.
I suspect you might be getting tripped up by line endings for one or both of your files. Regardless, it's not necessary to slurp both your files, just the 2nd one. And a regex is most likely overkill, a simple equality check is sufficient, and more likely what you intend.
The following is probably what you intend:
use strict;
use warnings;
use autodie;
my $file1 = 'foo.txt';
my $file2 = 'bar.txt';
open my $fh2, '<', $file2;
my #keys = <$fh2>;
chomp(#keys);
open my $fh1, '<', $file1;
while (my $line = <$fh1>) {
my $fields = split ' ', $line;
if (grep {$fields[0] eq $_} #keys) {
print $line;
}
}
use strict;
use warnings;
my $file2 = 'foo.txt';
my $file1 = 'bar.txt';
my #line1;
open FF,$file2;
while(<FF>)
{
unshift(#line1,$_);
}
close(FF);
open FH,$file1;
while(<FH>)
{
my $se=$_;
chomp($se);
foreach my $data (#line1)
{
if($data=~m/^\s*$se\s*\t/is)
{
print $data."\n";
}
}
}
close(FH);
Try This....

Perl - empty rows while writing CSV from Excel

I want to convert excel-files to csv-files with Perl. For convenience I like to use the module File::Slurp for read/write operations. I need it in a subfunction.
While printing out to the screen, the program generates the desired output, the generated csv-files unfortunately just contain one row with semicolons, field are empty.
Here is the code:
#!/usr/bin/perl
use File::Copy;
use v5.14;
use Cwd;
use File::Slurp;
use Spreadsheet::ParseExcel;
sub xls2csv {
my $currentPath = getcwd();
my #files = <$currentPath/stage0/*.xls>;
for my $sourcename (#files) {
print "Now working on $sourcename\n";
my $outFile = $sourcename;
$outFile =~ s/xls/csv/g;
print "Output CSV-File: ".$outFile."\n";
my $source_excel = new Spreadsheet::ParseExcel;
my $source_book = $source_excel->Parse($sourcename)
or die "Could not open source Excel file $sourcename: $!";
foreach my $source_sheet_number ( 0 .. $source_book->{SheetCount} - 1 )
{
my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];
next unless defined $source_sheet->{MaxRow};
next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
next unless defined $source_sheet->{MaxCol};
next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};
foreach my $row_index (
$source_sheet->{MinRow} .. $source_sheet->{MaxRow} )
{
foreach my $col_index (
$source_sheet->{MinCol} .. $source_sheet->{MaxCol} )
{
my $source_cell =
$source_sheet->{Cells}[$row_index][$col_index];
if ($source_cell) {
print $source_cell->Value, ";"; # correct output!
write_file( $outFile, { binmode => ':utf8' }, $source_cell->Value, ";" ); # only one row of semicolons with empty fields!
}
}
print "\n";
}
}
}
}
xls2csv();
I know it has something to do with the parameter passing in the write_file function, but couldn't manage to fix it.
Has anybody an idea?
Thank you very much in advance.
write_file will overwrite the file unless the append => 1 option is given. So this:
write_file( $outFile, { binmode => ':utf8' }, $source_cell->Value, ";" );
Will write a new file for each new cell value. It does however not match your description of "only one row of semi-colons of empty fields", as it should only be one semi-colon, and one value.
I am doubtful towards this sentiment from you: "For convenience I like to use the module File::Slurp". While the print statement works as it should, using File::Slurp does not. So how is that convenient?
What you should do, if you still want to use write_file is to gather all the lines to print, and then print them all at once at the end of the loop. E.g.:
$line .= $source_cell->Value . ";"; # use concatenation to build the line
...
push #out, "$line\n"; # store in array
...
write_file(...., \#out); # print the array
Another simple option would be to use join, or to use the Text::CSV module.
Well, in this particular case, File::Slurp was indeed complicating this for me. I just wanted to avoid to repeat myself, which I did in the following clumsy working solution:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy;
use v5.14;
use Cwd;
use File::Basename;
use File::Slurp;
use Tie::File;
use Spreadsheet::ParseExcel;
use open qw/:std :utf8/;
# ... other functions
sub xls2csv {
my $currentPath = getcwd();
my #files = <$currentPath/stage0/*.xls>;
my $fh;
for my $sourcename (#files) {
say "Now working on $sourcename";
my $outFile = $sourcename;
$outFile =~ s/xls/csv/gi;
if ( -e $outFile ) {
unlink($outFile) or die "Error: $!";
print "Old $outFile deleted.";
}
my $source_excel = new Spreadsheet::ParseExcel;
my $source_book = $source_excel->Parse($sourcename)
or die "Could not open source Excel file $sourcename: $!";
foreach my $source_sheet_number ( 0 .. $source_book->{SheetCount} - 1 )
{
my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];
next unless defined $source_sheet->{MaxRow};
next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
next unless defined $source_sheet->{MaxCol};
next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};
foreach my $row_index (
$source_sheet->{MinRow} .. $source_sheet->{MaxRow} )
{
foreach my $col_index (
$source_sheet->{MinCol} .. $source_sheet->{MaxCol} )
{
my $source_cell =
$source_sheet->{Cells}[$row_index][$col_index];
if ($source_cell) {
print $source_cell->Value, ";";
open( $fh, '>>', $outFile ) or die "Error: $!";
print $fh $source_cell->Value, ";";
close $fh;
}
}
print "\n";
open( $fh, '>>', $outFile ) or die "Error: $!";
print $fh "\n";
close $fh;
}
}
}
}
xls2csv();
I'm actually NOT happy with it, since I'm opening and closing the files so often (I have many files with many lines). That's not very clever in terms of performance.
Currently I still don't know how to use the split or Text:CSV in this case, in order to put everything into an array and to open, write and close each file only once.
Thank you for your answer TLP.

Print email addresses to a file in Perl

I have been scouring this site and others to find the best way to do what I need to do but to no avail. Basically I have a text file with some names and email addresses. Each name and email address is on its own line. I need to get the email addresses and print them to another text file. So far all I have been able to print is the "no email addresses found" message. Any thoughts? Thanks!!
#!/usr/bin/perl
open(IN, "<contacts.txt") || die("file not found");
#chooses the file to read
open(OUT, ">emailaddresses.txt");
#prints file
$none = "No emails found!";
$line = <IN>;
for ($line)
{
if ($line =~ /[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}/g)
{
print (OUT $line);
}
else
{
print (OUT $none);
}
}
close(IN);
close(OUT);
First, always use strict; use warnings. This helps writing correct scripts, and is an invaluable aid when debugging.
Also, use a three-arg-open:
open my $fh, "<", $filename or die qq(Can't open "$filename": $!);
I included the reason for failure ($!), which is a good practice too.
The idiom to read files (on an open filehandle) is:
while (<$fh>) {
chomp;
# The line is in $_;
}
or
while (defined(my $line = <$fh>)) { chomp $line; ... }
What you did was to read one line into $line, and loop over that one item in the for loop.
(Perl has a notion of context. Operators like <$fh> behave differently depending on context. Generally, using a scalar variable ($ sigil) forces scalar context, and #, the sigil for arrays, causes list context. This is quite unlike PHP.)
I'd rewrite your code like:
use strict; use warnings;
use feature 'say';
my $regex = qr/[A-Z0-9._%+-]+\#[A-Z0-9.-]+\.[A-Z]{2,4}/i; # emails are case insensitive
my $found = 0;
while (<>) { # use special ARGV filehandle, which usually is STDIN
while (/($regex)/g) {
$found++;
say $1;
}
}
die "No emails found\n" unless $found;
Invoked like perl script.pl <contacts.txt >emailaddresses.txt. The shell is your friend, and creating programs that can be piped from and to is good design.
Update
If you want to hardcode the filenames, we would combine the above script with the three-arg open I have shown:
use strict; use warnings; use feature 'say';
use autodie; # does `... or die "Can't open $file: $!"` for me
my $regex = qr/[A-Z0-9._%+-]+\#[A-Z0-9.-]+\.[A-Z]{2,4}/i;
my $found = 0;
my $contact_file = "contacts.txt";
my $email_file = "emailaddresses.txt";
open my $contact, "<", $contact_file;
open my $email, ">", $email_file;
while (<$contact>) { # read from the $contact filehandle
while (/($regex)/g) { # the /g is optional if there is max one address per line
$found++;
say {$email} $1; # print to the $email file handle. {curlies} are optional.
}
}
die "No emails found\n" unless $found; # error message goes to STDERR, not to the file