My mind seems to be missing a few screws today. I have an issue that I'm baffled by, but to be fair, I'm new to Perl scripting.
I am opening a csv file and need to look for duplicate values in one column, and where there are duplicates in this column, I need to add all values from another column for each duplicate together and print it on a new line in a new file.
open(my $feed, '<', $rawFile) or die "Could not locate '$rawFile'\n";
open(OUTPUT, '>', $newFile) or die "Could not locate '$newFile'\n";
while(my $line = <$feed>) {
chomp $line;
my #columns = split /,/, $line;
$Address= $columns[1];
$forSale= $columns[3];
}
I understand how to open the file and read it line by line. I know how to print results to new file. What I'm having trouble with is building logic to say, "For each Address in this extract that're duplicates, add all of their forSale's up and print the Address in new file with the added forSale's values. I hope this makes sense. Any assistance at all is encouraged.
The tool you need for this job is a hash.
This will allow you to 'key' things by Address:
my %sum_of;
while(my $line = <$feed>) {
chomp $line;
my #columns = split /,/, $line;
$Address= $columns[1];
$forSale= $columns[3];
$sum_of{$Address} += $forSale;
}
foreach my $address ( sort keys %sum_of ) {
print "$address => $sum_of{$address}\n";
}
Hello Chris Simmons,
I would like to add a few minor modification(s) on the perfect answer that Sobrique provided you.
You can open a file on the way you did but also you can open multiple files on the command line e.g. test.pl sample1.csv sample2.csv, you can read about it here eof.
I would also choose to check the file if it contains comma character (,) else print on terminal that this line can not be parsed.
Next step after splitting all values in the array I would trim the string(s) for white space leading and trailing.
Having said all that see solution bellow:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %hash;
while (<>) {
chomp;
if (index($_, ',') != -1) {
my #fields = split(/,/);
# remove leading and trailing white space
s{^\s+|\s+$}{}g foreach #fields;
$hash{$fields[0]} += $fields[3];
}
else {
warn "Line could not be parsed: $_\n";
}
} continue {
close ARGV if eof;
}
print Dumper \%hash;
__END__
$ perl test.pl sample.csv
$VAR1 = {
'123 6th St.' => 3,
'71 Pilgrim Avenue' => 5
};
__DATA__
123 6th St., Melbourne, FL 32904, 2
71 Pilgrim Avenue, Chevy Chase, MD 20815, 5
123 6th St., Melbourne, CT 06074, 1
Since you did not provide us sample of input data I created my own.
Another possible way is to use the module Text::CSV as ikegami proposed. Sample of code with the same checks that I mentioned earlier, see bellow:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Data::Dumper;
my $csv = Text::CSV->new({ sep_char => ',' });
my %hash;
while (<>) {
chomp;
if ($csv->parse($_)) {
my #fields = $csv->fields();
# remove leading and trailing white space
s{^\s+|\s+$}{}g foreach #fields;
$hash{$fields[0]} += $fields[3];
} else {
warn "Line could not be parsed: $_\n";
}
} continue {
close ARGV if eof;
}
print Dumper \%hash;
__END__
$ perl test.pl sample.csv
$VAR1 = {
'123 6th St.' => 3,
'71 Pilgrim Avenue' => 5
};
__DATA__
123 6th St., Melbourne, FL 32904, 2
71 Pilgrim Avenue, Chevy Chase, MD 20815, 5
123 6th St., Melbourne, CT 06074, 1
Hope this helps.
BR / Thanos
Related
I am trying to make a table looking content of one input file but it constantly gives me an error
Use of uninitialized value $ac[3] in concatenation (.) or string at table.pl
line 58 (#1)
and
Use of uninitialized value $or[2] in concatenation (.) or string at table.pl
line 61 (#1)
and although I made almost every possible changes it still gives me an error and does not print well.
This is how my input file looks like:
HEADER OXIDOREDUCTASE 08-JUN-12 2LU5
EXPDTA SOLID-STATE NMR
REMARK 2 RESOLUTION. NOT APPLICABLE.
HETNAM CU COPPER (II) ION
HETNAM ZN ZINC
FORMUL 2 CU CU 2+
FORMUL 2 ZN ZN 2+
END
This is a script I am using:
#!/usr/bin/env perl
use strict;
use warnings;
use diagnostics;
#my $testfile=shift;
open(INPUT, "$ARGV[0]") or die 'Cannot make it';
my #file=<INPUT>;
close INPUT;
my #ac=();
my #dr=();
my #os=();
my #or=();
my #fo=();
for (my $line=0;$line<=$#file;$line++)
{
chomp($file[$line]);
if ($file[$line] =~ /^HEADER/)
{
print( (split '\s+', $file[$line])[-1]);
print "\t";
while ($file[$line] !~ /^END /)
{
$line++;
if ($file[$line]=~/^EXPDTA/)
{
$file[$line]=~s/^EXPDTA//;
#os=(#os,split '\s+', $file[$line]);
}
if ($file[$line] =~ /^REMARK 2 RESOLUTION./)
{
$file[$line]=~s/^REMARK 2 RESOLUTION.//;
#ac = (#ac,split'\s+',$file[$line]);
}
if ($file[$line] =~ /^HETNAM/)
{
$file[$line]=~s/^HETNAM//;
$file[$line] =~ s/\s+//;
push #dr, $file[$line];
}
if ($file[$line] =~ /^SOURCE 2 ORGANISM_SCIENTIFIC/)
{
$file[$line]=~s/^SOURCE 2 ORGANISM_SCIENTIFIC//;
#or = (#or,split'\s+',$file[$line]);
}
if ($file[$line] =~ /^FORMUL/)
{
$file[$line]=~s/^FORMUL//;
$file[$line] =~ s/\s+//;
push #fo, $file[$line];
}
}
print "$os[1] $os[2]\t";
print "\t";
#os=();
print "$ac[3] $ac[4]\t" or die "Cannot be printed"; #line 58
print "\t";
#ac=();
print "$or[2] $or[3]\t" or die "Cannot be printed"; #line 61
print "\t";
#or=();
foreach (#dr)
{
print "$_";
print "\t\t\t\t\t";
}
#dr=();
print "\n";
}
}
And this is the output it gives me, but it doesnt seems to print well and I am really not sure why:
2LU5 SOLID-STATE NMR CU COPPER (II) ION
Desired output that I am expecting is :
HEADER EXPDTA REMARK HETNAM FORMUL
OXIDOREDUCTASE 2LU5 SOLID-STATE NMR RESOLUTION. NOT APPLICABLE. COPPER (II) ION (here better to say last column because certain diversity exists before "copper") CU 2+
ZN ZINC ZN 2+
The root of your error is that:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #ac = ();
my $str = "REMARK 2 RESOLUTION. NOT APPLICABLE. ";
$str =~ s/^REMARK 2 RESOLUTION.//;
#ac = ( #ac, split '\s+', $str );
print Dumper \#ac;
The contents of #ac is:
$VAR1 = [
'',
'NOT',
'APPLICABLE.'
];
There is no $ac[3], you only have elements 0,1,2 in there.
With your #or error, you don't have any lines matching: /^SOURCE 2 ORGANISM_SCIENTIFIC/
So that array is empty, and that too, means you've got no $or[2] to print.
More generally - what you're doing here is actually really quite clunky, and there's a much cleaner solution.
How about:
#!/usr/bin/env perl
use strict;
use warnings;
#set the text "END" as our record separator
local $/ = 'END';
#define the fields to print out.
my #field_order = qw ( HEADER EXPDTA REMARK HETNAM FORMUL );
print join ( ",", #field_order), "\n"; #print header row
#iterate STDIN or file named on command line.
#just like you're doing with open (FILE, $ARGV[0])
while ( <> ) {
#select key value pairs into a hash - first word on the line is the 'key'
#and the value is 'anything else'.
my %this_entry = m/^(\w+)\s+(.*)$/gm;
next unless $this_entry{'HEADER'}; #check we have a header.
s/\s+/ /g for values %this_entry; #strip repeated spaces from fields;
s/\s+$//g for values %this_entry; #strip trailing whitespace.
#split 'header' row into separate subfields
#this is an example of how you could transform other fields.
($this_entry{'HEADER'}, $this_entry{'DATE'}, $this_entry{'STRUCT'} ) = split ' ', $this_entry{'HEADER'};
print join (",", #this_entry{#field_order} ), "\n";
}
This will - given your input - print:
HEADER,DATE,STRUCT,EXPDTA,REMARK,HETNAM,FORMUL
OXIDOREDUCTASE,08-JUN-12,2LU5,SOLID-STATE NMR,2 RESOLUTION. NOT APPLICABLE.,CU COPPER (II) ION,2 CU CU 2+
Which isn't quite what your output matches, but hopefully it's illustrated how much simpler this task could be?
I want to join the first to 16th word and 17th to 31st, etc in an array with space to one line but do not know why the code does not work. Hope to get help here.Thanks
my #file = <FILE>;
for ( $i=0; $i<=$#file; $i+=16 ){
my $string = join ( " ", #file[$i..$i+15] );
print FILE1 "$string\n";
}
Below is part of my file.
1
2
3
...
What i wan to print is
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21....
I wouldn't do it the way you've done it.
Instead I would:
open ( my $input, '<', "your_file_name" ) or die $!;
chomp ( my #file = <$input> );
print join (" ",splice (#file, 0, 15)),"\n" while #file;
Note - I've used a lexical file handle with a 3 argument open, because that's better style.
splice removes the first 16 elements from #file each iteration, and continues until #file is empty.
Your lines have newlines attached to them. Remove them with chomp. Then loop over the array, remove 16 items and print them.
my #file = <FILE>;
chomp #file;
while (#file) {
my #temp;
INNER: for ( 0 .. 15 ) {
push #temp, shift #file || last INNER; # not or
}
print join( q{ }, #temp ), "\n";
}
This is the long implementation of the splice solution Sobrique suggested in the comments. It's does the same thing, just way more verbose.
This is the old answer before the edit:
If you only want the first 16, this is way more effective.
my $string = join q{ }, map { <FILE>; chomp; $_ } 1 .. 16;
This reads 16 lines and chomp each of them, then joins.
You might also want to use lexical file handles $fh instead of the GLOB FILE.
open my $fh, '<', $path_to_file or die $!;
Suppose if you want to read it from file, don't store the whole file into an array. Instead loop through line by line. And check the line number with $. special variable.
use warnings;
use strict;
open my $fh,"<","input.txt";
my $line;
while (<$fh>)
{
chomp;
$line.= $_." ";
print "$line\n" and $line="" if($. % 16 == 0);
END{ print "$line\n";};
}
Or this also will work
use warnings;
use strict;
open my $wh,"<","input.txt";
my $line;
foreach (;;)
{
my $data = join " ",(map { my $m=<$wh> || ""; chomp($m); $m} (0..15));
last if ($data =~m/^\s+$/);
print $data,"\n";
}
Assuming that you have FILE and FILE1 descriptors open, try:
$.%16?s!\s+! !:1 and print FILE1 $_ while <FILE>;
while ($line = <IN>){
...
print OUT "$line";
print OUT1 "$line";
}
As far as I know my while loop only reads from my input file one line at a time. How can I adjust this so that it reads 2 lines at a time?
Suppose a 2-line chunk looks like this
%line1
THISISLINE2
I want my while loop to copy the first line and paste it after the second line (but replace % with #). I also want to add a line of 11 characters of A as line 4. Essentially I want the output to be
%line1
THISISLINE2
#line1
AAAAAAAAAAA
How can I write a while loop to do this?
I am going to make a guess that you've got multi-line records like this:
%line1
something something line1
%lineB
something to do with lineb
I would suggest in this scenario - rather than reading two lines at a time, you instead set your record separator via $/.
E.g.:
#!/usr/bin/env perl;
use strict;
use warnings;
local $/ = "%";
while (<DATA>) {
chomp;
my #lines = split "\n";
next unless #lines;
print '%', join( "\n", #lines ), "\n";
print $lines[0] =~ s/^/\#/r, "\n";
print "Something else to do with record $.\n";
print "---END---\n";
}
__DATA__
%line1
something something line1
%lineB
something to do with lineb
This means that each iteration of the while loop - it reads until the next % symbol. As a result, the first iteration is empty, but subsequent records will work fine.
This prints:
%line1
something something line1
#line1
Something else to do with record 2
---END---
%lineB
something to do with lineb
#lineB
Something else to do with record 3
---END---
Here is one option for a loop that gets two lines at once:
my $l1;
my $l2;
while (defined($l1=<DATA>) and defined($l2=<DATA>))
{
print "line 1: $l1\n";
print "line 2: $l2\n";
}
__DATA__
line1
line2
line3
line4
line5
This does not require reading the whole file into an array first.
It also ignores a single line at the end of the file (but you could change that by switching to or).
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", "test.txt") or die $!;
open (my $op, ">", "output.txt") or die $!;
my #slurp = <$fh>;
while(my #lines = splice(#slurp, 0, 2)){
my ($line1, $line2) = #lines;
print $op $line1;
print $op $line2;
if($line1 =~ s/%/#/){
print $op $line1;
if($line2 =~ tr/.*/A/c){
print $op $line2."\n";
}
}
}
can you use a for loop instead of while? remember that for requires the whole file to be read into memory. but unless you have very high performance standards and a very big datafile it shouldn't a problem.
open IN,"<",$file;
my #lines = <IN>;
for (my $i = 0;$i le $#lines; $i = $i+2) {
my $first_line = $lines[$i];
my $second_line = $lines[$i+1];
}
Your question looks like a possible use case for a simple finite-state machine:
use strict;
use warnings;
my $state = 1;
my $first_line;
while (<>) {
if ($state == 1) {
$first_line = $_;
$state = 2;
} elsif ($state == 2) {
# do whatever you want with $_ and $first_line
$state = 1;
} else {
die "Unknown state '$state', not sure how we got here";
};
};
I am trying to use Parse::CSV to parse through a simple CSV file with a header and 2 columns. The second column may contain commas but I want to ignore them. Is there anyway to limit how many times it splits on commas? Here is what I have so far
#!/usr/bin/perl
use Parse::CSV;
my $csv = Parse::CSV->new(file => 'file.csv');
while (my $row = $csv->fetch) {
print $row->[0] . "\t" . $row->[1] . "\n";
}
Here is an example of what my data looks like:
1234,text1,text2
5678,text3
90,text4,text5
This would return
1234 text1,text2
5678 text3
90 text4,text5
If you're really wed to Parse::CSV, you can do this using a filter:
use strict;
use warnings;
use 5.010;
use Parse::CSV;
my $parser = Parse::CSV->new(
file => 'input.csv',
filter => sub { return [ shift #$_, join(',', #$_) ] }
);
while ( my $row = $parser->fetch ) {
say join("\t", #$row);
}
die $parser->errstr if $parser->errstr;
Output:
1234 text1,text2
5678 text3
90 text4,text5
Note that performance will be poor because Parse::CSV is splitting the columns for you, but then you immediately join them back together again.
However, since it appears that you're not working with a true CSV (columns containing the delimiter aren't quoted or escaped in any way), why not just use split with a third argument to specify the maximum number of fields?
use strict;
use warnings;
use 5.010;
open my $fh, '<', 'input.csv' or die $!;
while (<$fh>) {
chomp;
my #fields = split(',', $_, 2);
say join("\t", #fields);
}
close $fh;
I'm a novice at using perl. What I want to do is compare two files. One is my index file that I am calling "temp." I am attempting to use this to search through a main file that I am calling "array." The index file has only numbers in it. There are lines in my array that have those numbers. I've been trying to find the intersection between those two files, but my code is not working. Here's what I've been trying to do.
#!/usr/bin/perl
print "Enter the input file:";
my $filename=<STDIN>;
open (FILE, "$filename") || die "Cannot open file: $!";
my #array=<FILE>;
close(FILE);
print "Enter the index file:";
my $temp=<STDIN>;
open (TEMP, "$temp") || die "Cannot open file: $!";
my #temp=<TEMP>;
close(TEMP);
my %seen= ();
foreach (#array) {
$seen{$_}=1;
}
my #intersection=grep($seen{$_}, #temp);
foreach (#intersection) {
print "$_\n";
}
If I can't use intersection, then what else can I do to move each line that has a match between the two files?
For those of you asking for the main file and the index file:
Main file:
1 CP TRT
...
14 C1 MPE
15 C2 MPE
...
20 CA1 MPE
Index file
20
24
22
17
18
...
I want to put those lines that contain one of the numbers in my index file into a new array. So using this example, only
20 CA1 MPE would be placed into a new array.
My main file and index file are both longer than what I've shown, but that hopefully gives you an idea on what I'm trying to do.
I am assuming something like this?
use strict;
use warnings;
use Data::Dumper;
# creating arrays instead of reading from file just for demo
# based on the assumption that your files are 1 number per line
# and no need for any particular parsing
my #array = qw/1 2 3 20 60 50 4 5 6 7/;
my #index = qw/10 12 5 3 2/;
my #intersection = ();
my %hash1 = map{$_ => 1} #array;
foreach (#index)
{
if (defined $hash1{$_})
{
push #intersection, $_;
}
}
print Dumper(\#intersection);
==== Out ====
$VAR1 = [
'5',
'3',
'2'
];
A few things:
Always have use strict; and use warnings; in your program. This will catch a lot of possible errors.
Always chomp after reading input. Perl automatically adds \n to the end of lines read. chomp removes the \n.
Learn a more modern form of Perl.
Use nemonic variable names. $temp doesn't cut it.
Use spaces to help make your code more readable.
You never stated the errors you were getting. I assume it has to do with the fact that the input from your main file doesn't match your index file.
I use a hash to create an index that the index file can use via my ($index) = split /\s+/, $line;:
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
print "Input file name: ";
my $input_file = <STDIN>;
chomp $input_file; # Chomp Input!
print "Index file name: ";
my $index_file = <STDIN>;
chomp $index_file; # Chomp Input!
open my $input_fh, "<", $input_file;
my %hash;
while ( my $line = <$input_fh> ) {
chomp $line;
#
# Using split to find the item to index on
#
my ($index) = split /\s+/, $line;
$hash{$index} = $line;
}
close $input_fh;
open my $index_fh, "<", $index_file;
while ( my $index = <$index_fh> ) {
chomp $index;
#
# Now index can look up lines
#
if( exists $hash{$index} ) {
say qq(Index: $index Line: "$hash{$index}");
}
else {
say qq(Index "$index" doesn't exist in file.);
}
}
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
#ARGV = 'main_file';
open(my $fh_idx, '<', 'index_file');
chomp(my #idx = <$fh_idx>);
close($fh_idx);
while (defined(my $r = <>)) {
print $r if grep { $r =~ /^[ \t]*$_/ } #idx;
}
You may wish to replace those hardcoded file names for <STDIN>.
FYI: The defined call inside a while condition might be "optional".