How to extract multiple columns from a CSV file using Perl - perl

I'm pretty new with Perl and was hoping if anyone could help me with this issue. I need to extract two columns from a CSV file embedded commas. This is how the format looks like:
"ID","URL","DATE","XXID","DATE-LONGFORMAT"
I need to extract the DATE column, the XXID column, and the column immediately after XXID. Note, each line doesn't necessarily follow the same number of columns.
The XXID column contains a 2 letter prefix and doesn't always starts with the same letter. It can pretty much be any letter of the aplhabet. The length is always the same.
Finally, once these three columns are extracted, I need to sort on the XXID column and get a count on duplicates.

I published a module called Tie::Array::CSV which lets Perl interact with your CSV as a native Perl nested array. If you use this, you can take your search logic and apply it just as if your data were already in an array of array-references. Take a look!
#!/usr/bin/env perl
use strict;
use warnings;
use File::Temp;
use Tie::Array::CSV;
use List::MoreUtils qw/first_index/;
use Data::Dumper;
# this builds a temporary file from DATA
# normally you would just make $file the filename
my $file = File::Temp->new;
print $file <DATA>;
#########
tie my #csv, 'Tie::Array::CSV', $file;
#find column from data in first row
my $colnum = first_index { /^\w.{6}$/ } #{$csv[0]};
print "Using column: $colnum\n";
#extract that column
my #column = map { $csv[$_][$colnum] } (0..$#csv);
#build a hash of repetitions
my %reps;
$reps{$_}++ for #column;
print Dumper \%reps;

Here's a sample script using the Text::CSV module to parse your csv data. Consult the documentation for the module to find the proper settings for your data.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1 });
while (my $row = $csv->getline(*DATA)) {
print "Date: $row->[2]\n";
print "Col#1: $row->[3]\n";
print "Col#2: $row->[4]\n";
}

You definitely want to use a CPAN library for parsing CSV, as you will never account for all the quirks of the format.
Please see: How can I parse quoted CSV in Perl with a regex?
Please see: How do I efficiently parse a CSV file in Perl?
However, here is a very naive and non-idiomatic solution for that particular string you provided:
use strict;
use warnings;
my $string = '"ID","URL","DATE","XXID","DATE-LONGFORMAT"';
my #words = ();
my $word = "";
my $quotec = '"';
my $quoted = 0;
foreach my $c (split //, $string)
{
if ($quoted)
{
if ($c eq $quotec)
{
$quoted = 0;
push #words, $word;
$word = "";
}
else
{
$word .= $c;
}
}
elsif ($c eq $quotec)
{
$quoted = 1;
}
}
for (my $i = 0; $i < scalar #words; ++$i)
{
print "column " . ($i + 1) . " = $words[$i]\n";
}

Related

Perl grep through large file to match a string

I have an array (#array) which has list of elements. I need to check whether these each of the elements are exists in master file or not. If the element exists in master file then in the same line of master file the string YES (in 5th position) should also exists. And the element should be stored in different array.
Actually my script uses two grep shell command to achieve this. How can I write same thing in Perl do grep.
...
use Data::Dumper;
my #new_array;
my #array = ('RT0AC1', 'WG3RA3');
print Dumper(\#array);
foreach ( #array ){
my $line = `grep $_ "master_file.csv" | grep -i yes`;
next unless($line);
push( #new_array, $_ );
}
print Dumper(#new_array);
...
where master_file.csv looks like this:
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
So Here I am getting $line value as 102,RT0AC1,CONNECTED,WORKING,YES and element RT0AC1 is getting stored in #new_array.
How can I avoid using backtick(`) and two greps to achieve this. I am trying to do this using pure Perl. Also the master_file.csv contains millions of records.
Since all the words you're looking for are in the same location, it's easy to just split up the current line on commas and see if the second column exists in a hash table, and if the fifth column is equal to "YES":
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
my $filename = shift // "master_file.csv"; # Default filename if not given on command line
my #array = qw/RT0AC1 WG3RA3/; # Words you're looking for
my %words = map { $_ => 1 } #array; # Store them in a hash for fast lookup
my #new_array;
# Use Text::CSV_XS for non-trivial CSV files
open my $csv, "<", $filename;
while (<$csv>) {
chomp;
my #F = split /,/;
push #new_array, $F[1] if exists $words{$F[1]} && $F[4] eq "YES";
}
print Dumper(\#new_array);
Form regex to match records of interest, split line into fields and compare field #5 to YES. If there is a match increase a count for field #2 in %match hash.
Once the file processed %match hash will have matched records field #2 as a key and value will reflect how many times this field was matched with YES in the file.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
while(<DATA>) {
chomp;
next unless /$re_filter/;
my #data = split(',',$_);
$match{$data[1]}++ if $data[4] eq 'YES';
}
say Dumper(\%match);
__DATA__
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
Output
$VAR1 = {
'RT0AC1' => 1
};
Remove DATA to get final code and give filename on command line to process file with data of interest
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
while(<>) {
chomp;
next unless /$re_filter/;
my #data = split(',',$_);
$match{$data[1]}++ if $data[4] eq 'YES';
}
say Dumper(\%match);
An alternative version based on regular expression without using split
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
my $regex = qr/^\d+,($re_filter),[^,]+,[^,]+,YES$/;
/$regex/ && $match{$1}++ for <DATA>;
say Dumper(\%match);
__DATA__
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO

how to remove last single line available in file using perl

how to remove last single line available in file using perl.
I have my data like below.
"A",1,-2,-1,-4,
"B",3,-5,-2.-5,
how to remove the last line... I am summing all the numbers but receiving a null value at the end.
Tried using chomp but did not work.
Here is the code currently being used:
while (<data>) {
chomp(my #row = (split ',' , $_ , -1);
say sum #row[1 .. $#row];
}
Try this (shell one-liner) :
perl -lne '!eof() and print' file
or as part of a script :
while (defined($_ = readline ARGV)) {
print $_ unless eof();
}
You should be using Text::CSV or Text::CSV_XS for handling comma separated value files. Those modules are available on CPAN. That type of solution would look like this:
use Text::CSV;
use List::Util qw(sum);
my $csv = Text::CSV->new({binary => 1})
or die "Cannot use CSV: " . Text::CSV->error_diag;
while(my $row = $csv->getline($fh)) {
next unless ($row->[0] || '') =~ m/\w/; # Reject rows that don't start with an identifier.
my $sum = sum(#$row[1..$#$row]);
print "$sum\n";
}
If you are stuck with a solution that doesn't use a proper CSV parser, then at least you'll need to add this to your existing while loop, immediately after your chomp:
next unless scalar(#row) && length $row[0]; # Skip empty rows.
The point to this line is to detect when a row is empty -- has no elements, or elements were empty after the chomp.
I suspect this is an X/Y question. You think you want to avoid processing the final (empty?) line in your input when actually you should be ensuring that all of your input data is in the format you expect.
There are a number of things you can do to check the validity of your data.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'sum';
use Scalar::Util 'looks_like_number';
while (<DATA>) {
# Chomp the input before splitting it.
chomp;
# Remove the -1 from your call to split().
# This automatically removes any empty trailing fields.
my #row = split /,/;
# Skip lines that are empty.
# 1/ Ensure there is data in #row.
# 2/ Ensure at least one element in #row contains
# non-whitespace data.
next unless #row and grep { /\S/ } #row;
# Ensure that all of the data you pass to sum()
# looks like numbers.
say sum grep { looks_like_number $_ } #row[1 .. $#row];
}
__DATA__
"A",1.2,-1.5,4.2,1.4,
"B",2.6,-.50,-1.6,0.3,-1.3,

Perl: Printing out the file where a word occurs

I am trying to write a small program that takes from command line file(s) and prints out the number of occurrence of a word from all files and in which file it occurs. The first part, finding the number of occurrence of a word, seems to work well.
However, I am struggling with the second part, namely, finding in which file (i.e. file name) the word occurs. I am thinking of using an array that stores the word but don’t know if this is the best way, or what is the best way.
This is the code I have so far and seems to work well for the part that counts the number of times a word occurs in given file(s):
use strict;
use warnings;
my %count;
while (<>) {
my $casefoldstr = lc $_;
foreach my $str ($casefoldstr =~ /\w+/g) {
$count{$str}++;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}:\n";
}
The filename is accessible through $ARGV.
You can use this to build a nested hash with the filename and word as keys:
use strict;
use warnings;
use List::Util 'sum';
while (<>) {
$count{$word}{$ARGV}++ for map +lc, /\w+/g;
}
foreach my $word ( keys %count ) {
my #files = keys %$word; # All files containing lc $word
print "Total word count for '$word': ", sum( #{ $count{$word} }{#files} ), "\n";
for my $file ( #files ) {
print "$count{$word}{$file} counts of '$word' detected in '$file'\n";
}
}
Using an array seems reasonable, if you don't visit any file more than once - then you can always just check the last value stored in the array. Otherwise, use a hash.
#!/usr/bin/perl
use warnings;
use strict;
my %count;
my %in_file;
while (<>) {
my $casefoldstr = lc;
for my $str ($casefoldstr =~ /\w+/g) {
++$count{$str};
push #{ $in_file{$str} }, $ARGV
unless ref $in_file{$str} && $in_file{$str}[-1] eq $ARGV;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}: #{ $in_file{$str} }\n";
}

Perl: Manipulating while(<>) loop in file reading

My question is regarding the while loop that reads line from files. The situation is that I want to store values from or the entire next line when the while loop while(<FILEHANDLE>) is performing the action on present line ($_). So what is the way to address this problem? Is there a specific function or module that does this thing?
If you want to process four lines at a time and each set of lines is separated by #FCC then you need to change perl's input file separator.
In your script put
$/="\#FCC"
This means that when you do (<>), each record you get in $_ is now four lines of your file.
use warnings;
use strict;
local $/="\#FCC";
while (<>) {
chomp;
#Each time we iterate, $_ is now all four lines of each record.
}
Edit
You'll need to backslash the #
You can read from <> anywhere, not just in the head of the loop, e.g.
while (my $line = <>) {
chomp $line;
my $another_line = <>;
chomp $another_line;
print "$line followed by $another_line\n";
}
Assuming your file is small-ish (perhaps less than 1gb) you could just stuff it into an array and walk it:
use warnings;
use strict;
my #lines;
while (<>) {
chomp;
push #lines, $_;
}
my $num_lines = #lines; #use of array in scalar context give length of array
# don't do last line (there is no next one)
$num_lines -= 1;
foreach (my $i = 0; $i < $num_lines; $i++) {
my $next_line = $i+1;
print "line $i plus $next_line:",$lines[$i],$lines[$i+1],"\n";
}
Note that the semantics of my solution is a bit different from the answer above. My solution would print out everything except the first line twice; if you wanted everything to be printed once, the above solution might make more sense.
If you want to read n lines at a time from a file you can use Tie::File and use an array to reference n elements at a time, like this:
use strict;
use warnings;
use Tie::File;
my $filename = 'path_to_your_file';
tie my #array, 'Tie::File', $filename or die 'Unable to open file';
my $index = 0;
my $size = #array;
while (1) {
last if ($index > $size); # Be careful! Try to do a better check than this!
print #array[$index..$index+3];
print "----\n";
$index += 4;
}
(This is just an example, try to write better code)
As the documentation says, the file is not loaded into memory all at once, so it will work even for large files.

How to split each entry of an array on each whitespace?

I have a file like this:
This is is my "test"
file with a lot
words separeted by whitespace.
Now I want to achieve to split this so that i create an array where each element contains of one word and all duplicate words are deleted
the desired array:
This
is
my
test
etc...
I read the file into an array but I do not know how to split an whole array so that the result is a new array. And how can I remove the duplicate words?
#!/usr/bin/perl
package catalogs;
use Log::Log4perl;
Log::Log4perl->init("log4perl.properties");
open(FILE, "<Source.txt") || die "file Sources.txt konnte nicht geoeffnet werden";
my #fileContent = <FILE>;
close FILE;
my $log = Log::Log4perl->get_logger("catalogs");
#fileContent = split(" ");
To extract the words, you could use
my #words = $str =~ /\w+/g;
As for removing duplicates,
use List::MoreUtils qw( uniq );
my #uniq_words = uniq #words;
or
my %seen;
my #uniq_words = grep !$seen{$_}++, #words;
You're loading the text of the file into an array, but it may make more sense to load the file into a single string. This would enable you to take advantage of the solution #ikegami provided. To bring it all together, try the following.
use List::MoreUtils qw( uniq );
my $filecontent = do
{
local $/ = undef;
<STDIN>;
};
my #words = $filecontent =~ /\w+/g;
my #uniqword = uniq(#words);
my $log = Log::Log4perl->get_logger("catalogs");
#fileContent = split(/\s+/, $log);
#filecontent = uniq(#filecontent);
To make the words unique, you can use uniq subroutine or map it into a hash. Since keys of a hash are always unique, duplicates will be over-written.
use strict;
use warnings;
use Data::Dumper;
my #a = (1,1,1,2,3,4,4);
my %hash = ();
%hash = map $_=>'1', #a;
my #new = keys(%hash );
print Dumper(#new);