Perl grep through large file to match a string - perl

I have an array (#array) which has list of elements. I need to check whether these each of the elements are exists in master file or not. If the element exists in master file then in the same line of master file the string YES (in 5th position) should also exists. And the element should be stored in different array.
Actually my script uses two grep shell command to achieve this. How can I write same thing in Perl do grep.
...
use Data::Dumper;
my #new_array;
my #array = ('RT0AC1', 'WG3RA3');
print Dumper(\#array);
foreach ( #array ){
my $line = `grep $_ "master_file.csv" | grep -i yes`;
next unless($line);
push( #new_array, $_ );
}
print Dumper(#new_array);
...
where master_file.csv looks like this:
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
So Here I am getting $line value as 102,RT0AC1,CONNECTED,WORKING,YES and element RT0AC1 is getting stored in #new_array.
How can I avoid using backtick(`) and two greps to achieve this. I am trying to do this using pure Perl. Also the master_file.csv contains millions of records.

Since all the words you're looking for are in the same location, it's easy to just split up the current line on commas and see if the second column exists in a hash table, and if the fifth column is equal to "YES":
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
my $filename = shift // "master_file.csv"; # Default filename if not given on command line
my #array = qw/RT0AC1 WG3RA3/; # Words you're looking for
my %words = map { $_ => 1 } #array; # Store them in a hash for fast lookup
my #new_array;
# Use Text::CSV_XS for non-trivial CSV files
open my $csv, "<", $filename;
while (<$csv>) {
chomp;
my #F = split /,/;
push #new_array, $F[1] if exists $words{$F[1]} && $F[4] eq "YES";
}
print Dumper(\#new_array);

Form regex to match records of interest, split line into fields and compare field #5 to YES. If there is a match increase a count for field #2 in %match hash.
Once the file processed %match hash will have matched records field #2 as a key and value will reflect how many times this field was matched with YES in the file.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
while(<DATA>) {
chomp;
next unless /$re_filter/;
my #data = split(',',$_);
$match{$data[1]}++ if $data[4] eq 'YES';
}
say Dumper(\%match);
__DATA__
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
Output
$VAR1 = {
'RT0AC1' => 1
};
Remove DATA to get final code and give filename on command line to process file with data of interest
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
while(<>) {
chomp;
next unless /$re_filter/;
my #data = split(',',$_);
$match{$data[1]}++ if $data[4] eq 'YES';
}
say Dumper(\%match);
An alternative version based on regular expression without using split
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
my $regex = qr/^\d+,($re_filter),[^,]+,[^,]+,YES$/;
/$regex/ && $match{$1}++ for <DATA>;
say Dumper(\%match);
__DATA__
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO

Related

find the only columns with empty values in entire file

With the below csv data
name,place,animal
a,,
b,,
a,,
,b,
The name field is available in 3 rows but not available in 1 row
The place field is available in 1 row but not in 3 row
The animal field is empty in all the rows -> Get these column names1
i would like to get the column names only if it empty in all the rows.
I am trying to write a perl script for the same but not sure how to attack this problem.
step 1: Check all the columns in first row, if any column is not empty ,dont search it in next row
step2: keep repeating step1 in a loop and finally we will get the output.and this brings down the complexity as we are not bothered about columns that have value even once.
i will implement the code and post it here.
But if u have any new ideas, please advise me
Thanks
For CSV files with no quotes and escaping, just keep a hash of empty columns so far. Reading the file line by line, remove any non-empty column from the hash:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
chomp( my #column_names = split /,/, <> );
my %empty;
#empty{ #column_names } = ();
while (<>) {
chomp;
my #columns = split /,/;
for my $i (0 .. $#columns) {
delete $empty{ $column_names[$i] } if length $columns[$i];
}
}
say for keys %empty;
For real CSV files, use Text::CSV_XS, but the method is the same: populate a hash by column names, then remove the non empty ones:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use Text::CSV_XS qw{ csv };
my %empty;
csv(in => shift,
out => \ 'skip',
headers => sub { undef $empty{ $_[0] }; $_[0] },
on_in => sub {
my (undef, $columns) = #_;
delete #empty{ grep length $columns->{$_}, keys %$columns }
},
);
say for keys %empty;
As rows are processed update an ancillary array which keeps track of each field's truth-value
If any field in a new row is non-empty the corresponding element of the array flips to true; otherwise it stays false. In the end indices of array's false elements identify indices of empty columns.
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = 'cols.csv';
my $csv = Text::CSV->new( { binary => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
open my $fh, '<', $file or die "Can't open $file: $!";
my #col_names = #{ $csv->getline($fh) };
my #mask;
while (my $line = $csv->getline($fh)) {
#mask = map { $mask[$_] || $line->[$_] ne '' } (0..$#$line);
}
for (0..$#mask) {
say "Column \"$col_names[$_]\" is empty" if not $mask[$_];
}
Syntax: $#$line is the index of the last element of arrayref $line (like $#ary is for #ary)

how to remove last single line available in file using perl

how to remove last single line available in file using perl.
I have my data like below.
"A",1,-2,-1,-4,
"B",3,-5,-2.-5,
how to remove the last line... I am summing all the numbers but receiving a null value at the end.
Tried using chomp but did not work.
Here is the code currently being used:
while (<data>) {
chomp(my #row = (split ',' , $_ , -1);
say sum #row[1 .. $#row];
}
Try this (shell one-liner) :
perl -lne '!eof() and print' file
or as part of a script :
while (defined($_ = readline ARGV)) {
print $_ unless eof();
}
You should be using Text::CSV or Text::CSV_XS for handling comma separated value files. Those modules are available on CPAN. That type of solution would look like this:
use Text::CSV;
use List::Util qw(sum);
my $csv = Text::CSV->new({binary => 1})
or die "Cannot use CSV: " . Text::CSV->error_diag;
while(my $row = $csv->getline($fh)) {
next unless ($row->[0] || '') =~ m/\w/; # Reject rows that don't start with an identifier.
my $sum = sum(#$row[1..$#$row]);
print "$sum\n";
}
If you are stuck with a solution that doesn't use a proper CSV parser, then at least you'll need to add this to your existing while loop, immediately after your chomp:
next unless scalar(#row) && length $row[0]; # Skip empty rows.
The point to this line is to detect when a row is empty -- has no elements, or elements were empty after the chomp.
I suspect this is an X/Y question. You think you want to avoid processing the final (empty?) line in your input when actually you should be ensuring that all of your input data is in the format you expect.
There are a number of things you can do to check the validity of your data.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'sum';
use Scalar::Util 'looks_like_number';
while (<DATA>) {
# Chomp the input before splitting it.
chomp;
# Remove the -1 from your call to split().
# This automatically removes any empty trailing fields.
my #row = split /,/;
# Skip lines that are empty.
# 1/ Ensure there is data in #row.
# 2/ Ensure at least one element in #row contains
# non-whitespace data.
next unless #row and grep { /\S/ } #row;
# Ensure that all of the data you pass to sum()
# looks like numbers.
say sum grep { looks_like_number $_ } #row[1 .. $#row];
}
__DATA__
"A",1.2,-1.5,4.2,1.4,
"B",2.6,-.50,-1.6,0.3,-1.3,

Opening, spliting and sorting into an Arrray in perl

I am a beginner programmer, who has been given a weeklong assignment to build a complex program, but is having a difficult time starting off. I have been given a set of data, and the goal is separate it into two separate arrays by the second column, based on whether the letter is M or F.
this is the code I have thus far:
#!/usr/local/bin/perl
open (FILE, "ssbn1898.txt");
$x=<FILE>;
split/[,]/$x;
#array1=$y;
if #array1[2]="M";
print #array2;
else;
print #array3;
close (FILE);
How do I fixed this? Please try and use the simplest terms possible I stared coding last week!
Thank You
First off - you split on comma, so I'm going to assume your data looks something like this:
one,M
two,F
three,M
four,M
five,F
six,M
There's a few problems with your code:
turn on strict and warnings. The warn you about possible problems with your code
open is better off written as open ( my $input, "<", $filename ) or die $!;
You only actually read one line from <FILE> - because if you assign it to a scalar $x it only reads one line.
you don't actually insert your value into either array.
So to do what you're basically trying to do:
#!/usr/local/bin/perl
use strict;
use warnings;
#define your arrays.
my #M_array;
my #F_array;
#open your file.
open (my $input, "<", 'ssbn1898.txt') or die $!;
#read file one at a time - this sets the implicit variable $_ each loop,
#which is what we use for the split.
while ( <$input> ) {
#remove linefeeds
chomp;
#capture values from either side of the comma.
my ( $name, $id ) = split ( /,/ );
#test if id is M. We _assume_ that if it's not, it must be F.
if ( $id eq "M" ) {
#insert it into our list.
push ( #M_array, $name );
}
else {
push ( #F_array, $name );
}
}
close ( $input );
#print the results
print "M: #M_array\n";
print "F: #F_array\n";
You could probably do this more concisely - I'd suggest perhaps looking at hashes next, because then you can associate key-value pairs.
There's a part function in List::MoreUtils that does exactly what you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use List::MoreUtils 'part';
my ($f, $m) = part { (split /,/)[1] eq 'M' } <DATA>;
say "M: #$m";
say "F: #$f";
__END__
one,M,foo
two,F,bar
three,M,baz
four,M,foo
five,F,bar
six,M,baz
The output is:
M: one,M,foo
three,M,baz
four,M,foo
six,M,baz
F: two,F,bar
five,F,bar
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my #boys=();
my #girls=();
my $fname="ssbn1898.txt"; # I keep stuff like this in a scalar
open (FIN,"< $fname")
or die "$fname:$!";
while ( my $line=<FIN> ) {
chomp $line;
my #f=split(",",$line);
push #boys,$f[0] if $f[1]=~ m/[mM]/;
push #girls,$f[1] if $f[1]=~ m/[gG]/;
}
print Dumper(\#boys);
print Dumper(\#girls);
exit 0;
# Caveats:
# Code is not tested but should work and definitely shows the concepts
#
In fact the same thing...
#!/usr/bin/perl
use strict;
my (#m,#f);
while(<>){
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
}
print "M=#m\nF=#f\n";
Or a "perl -n" (=for all lines do) variant:
#!/usr/bin/perl -n
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
END { print "M=#m\nF=#f\n";}

How to get the maximum number of columns present in a file in perl

I have a test.csv file which has data something like this.
"a","usa","24-Nov-2011","100.98","Extra1","Extra2"
"B","zim","23-Nov-2011","123","Extra22"
"C","can","23-Nov-2011","123"
I want to fetch the maximum number of columns in this file (i,e 6 in this case) and then store this in a variable.
Like
Variable=6
Can you provide me some suggestions on how to proceed.
Try using Text::CSV
Read each line through, parse through this module, and compare the number of fields to your variable.
#!/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
my $max = 0;
open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
my $count = scalar #$rows;
$max = $count > $max ? $count : $max;
}
One of the main reasons given why people use split on a CSV file rather than Text::CSV is that Text::CSV isn't a standard Perl module, so it might not be available.
Then use Text::ParseWords. This is a standard module ans should be readily available:
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);
use Text::ParseWords qw(quotewords);
my $keep = 0;
for my $line ( <DATA> ) {
chomp $line;
my #columns = quotewords ("\s*,\s*", $keep, $line );
say "<" . join( "> <", #columns ) . ">";
}
__DATA__
"a","usa","24-Nov-2011","100.98","Extra1","Extra2"
"B","zim","23-Nov-2011","123","Extra22"
"C","can","23-Nov-2011","123"
"D","can, can, can","23-Nov-2011","123"
This produces:
<a> <usa> <24-Nov-2011> <100.98> <Extra1> <Extra2>
<B> <zim> <23-Nov-2011> <123> <Extra22>
<C> <can> <23-Nov-2011> <123>
<D> <can, can, can> <23-Nov-2011> <123>
Note that the commas inside the quotes didn't throw off the parsing. Now, there are no more excuses for using split.

How to split each entry of an array on each whitespace?

I have a file like this:
This is is my "test"
file with a lot
words separeted by whitespace.
Now I want to achieve to split this so that i create an array where each element contains of one word and all duplicate words are deleted
the desired array:
This
is
my
test
etc...
I read the file into an array but I do not know how to split an whole array so that the result is a new array. And how can I remove the duplicate words?
#!/usr/bin/perl
package catalogs;
use Log::Log4perl;
Log::Log4perl->init("log4perl.properties");
open(FILE, "<Source.txt") || die "file Sources.txt konnte nicht geoeffnet werden";
my #fileContent = <FILE>;
close FILE;
my $log = Log::Log4perl->get_logger("catalogs");
#fileContent = split(" ");
To extract the words, you could use
my #words = $str =~ /\w+/g;
As for removing duplicates,
use List::MoreUtils qw( uniq );
my #uniq_words = uniq #words;
or
my %seen;
my #uniq_words = grep !$seen{$_}++, #words;
You're loading the text of the file into an array, but it may make more sense to load the file into a single string. This would enable you to take advantage of the solution #ikegami provided. To bring it all together, try the following.
use List::MoreUtils qw( uniq );
my $filecontent = do
{
local $/ = undef;
<STDIN>;
};
my #words = $filecontent =~ /\w+/g;
my #uniqword = uniq(#words);
my $log = Log::Log4perl->get_logger("catalogs");
#fileContent = split(/\s+/, $log);
#filecontent = uniq(#filecontent);
To make the words unique, you can use uniq subroutine or map it into a hash. Since keys of a hash are always unique, duplicates will be over-written.
use strict;
use warnings;
use Data::Dumper;
my #a = (1,1,1,2,3,4,4);
my %hash = ();
%hash = map $_=>'1', #a;
my #new = keys(%hash );
print Dumper(#new);