I have two large CSV files which I need to compare on a daily base:
file1:
"Record number";"€ price"
"000001";"€ 19,95"
"000002";"€ 20,50"
file2:
record number;price;date
000001;18,95;01-01-2014
000002;21,50;02-02-2014
file1 contains line feeds after every record
file2 contains line feed and carriage return after every record.
I'm looking for a way in Perl to compare file1 with file2 based on record number, and print differences on the price column in a seperate file. (record number, price and date).
Hope anyone can give some advice on how to approach this?
Thanks in advance!
I am sure there are quicker and more efficient ways to do this, but below should do the job you describe. I have compared the two files by creating a hash therefore I have assumed that the record numbers are unique. I have commented the code to explain each step
#!/usr/bin/perl
use strict;
use warnings;
#declare hashes and arrays
my %file1_price;
my %file2_price;
my %file2_date;
my #file1_records;
#open the two files with the data
open (FILE1,'</path/to/file1.txt');
open (FILE2,'</path/to/file2.txt');
#open a third file to write the data out to
open (FILE3,'>/write/to/new_file.txt');
#get rid of the headers in both files
my #file1 = (<FILE1>);
my $file1_header = shift(#file1);
my #file2 = (<FILE2>);
my $file2_header = shift(#file2);
#start a loop in file 1 and change the data format to compare to file 2
foreach my $file1_data (#file1){
chomp $file1_data;
my ($file1_record, $file1_price) = split(/;/,$file1_data);
# get rid of the quotes in both variables
$file1_record =~s/"//g;
$file1_price =~s/"//g;
#get rid of the € and substitute the comma for a decimal point
$file1_price =~s/€//g;
$file1_price =~s/,/./g;
#create a hash for the price for later comparison
$file1_price{$file1_record} = $file1_price;
push #file1_records, $file1_record;
}
#start a loop in file 2 similar to the above loop ready to compare to file 1
foreach my $file2_data (#file2){
chomp $file2_data;
my ($file2_record, $file2_price, $file2_date) = split(/;/,$file2_data);
#substitute the comma for a decimal point
$file2_price =~s/,/./g;
#create hashes of the data for final comparison below
$file2_price{$file2_record} = $file2_price;
$file2_date{$file2_record} = $file2_date;
}
#final loop to compare both files' information
foreach my $record (#file1_records){
#find the difference between the two prices
my $difference = $file1_price{$record} - $file2_price{$record};
#print out to file
print FILE3 "$record, $file1_price{$record}, $file2_price{$record}, $difference,$file2_date{$record}\n";
}
Related
I have the following command in my perl script:
my #files = `find $basedir/ -type f -iname '$sampleid*.summary.csv'`; #there are multiple summary.csv files in my basedir. I store them in an array
my $summary = `tail -n 1 $files[0]`; #Each summary.csv contains a header line and a line with data. I fetch here the last line.
chomp($summary);
my #sp = split(/,/,$summary); # I split based on ','
my $gender = $sp[11]; # the values from column 11 are stored in $gender
my $qc = $sp[2]; # the values from column 2 are stored in $gender
Now, I'm experiencing the situation where my *summary.csv files don't have the same number of columns. They do all have 2 lines, where the first line represents the header.
What I want now is not storing the values from column 11 in gender, but I want to store the values from the column 'Gender' in $gender.
How can I achieve this?
First try at solution:
my %hash = ();
my $header = `head -n 1 $files[0]`; #reading the header
chomp ($header);
my #colnames = split (/,/,$header);
my $keyfield = $colnames[#here should be the column with the name 'Gender']
push #{ $hash{$keyfield} };
my $gender = $sp[$keyfield]
You will have to read the header line as well as the data to know what column holds which information. This is done easiest by writing actual Perl code instead of shelling out to various command line utilities. See further below for that solution.
Fixing your solution also requires a hash. You need to read the header line first, store the header fields in an array (as you've already done), and then read the data line. The data needs to be a hash, not an array. A hash is a map of keys and values.
# read the header and create a list of header fields
my $header = `head -n 1 $files[0]`;
chomp ($header);
my #colnames = split (/,/,$header);
# read the data line
my $summary = `tail -n 1 $files[0]`;
chomp($summary);
my %sp; # use a hash for the data, not an array
# use a hash slice to fill in the columns
#sp{#colnames} = split(/,/,$summary);
my $gender = $sp{Gender};
The tricky part here is this line.
#sp{#colnames} = split(/,/,$summary);
We have declared %sp as a hash, but we now access it with a # sigil. That's because we are taking a hash slice, as indicated by the curly braces {}. The slice we take is all elements with the names of the values in #colnames. There is more than one value, so the return value is not a scalar (with a $) any more. There is a list of return values, so the sigil turns to #. Now we use that list on the left hand side (that's called an LVALUE), and assign the result of the split to that list.
Doing it with modern Perl
The following program will use File::Find::Rule to replace your find command, and Text::CSV to read the CSV file. It grabs all the files, then opens one at a time. The header line will be read first, and fed into the Text::CSV object, so that it can then give back a hash reference, which you can use to access every field by name.
I've written it in a way that it will only read one line for each file, as you said there are only two lines per file. You can easily extend that to be a loop.
use strict;
use warnings;
use File::Find::Rule;
use Text::CSV;
my $sampleid;
my $basedir;
my $csv = Text::CSV->new(
{
binary => 1,
sep => ',',
}
) or die "Cannot use CSV: " . Text::CSV->error_diag;
my #files = File::Find::Rule->file()->name("$sampleid*.summary.csv")->in($basedir);
foreach my $file (#files) {
open my $fh, '<', $file or die "Can't open $file: $!";
# get the headers
my #cols = #{ $csv->getline($fh) };
$csv->column_names(#cols);
# read the first line
my $row = $csv->getline_hr($fh);
# do whatever you you want with the row
print "$file: ", $row->{gender};
}
Please note that I have not tested this program.
I have happened upon a problem with a program that parses through a CSV file with a few million records: two fields in each line has comments that users have put in, and sometimes they use commas within their comments. If there are commas input, that field will be contained in double quotes. I need to replace any commas found in those fields with a space. Here is one such line from the file to give you an idea -
1925,47365,2,650187016,1,1,"MADE FOR DRAWDOWNS, NEVER P/U",16,IFC 8112NP,Standalone-6,,,44,10/22/2015,91607,,B24W02651,,"PA-3, PURE",4/28/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,C48A0D001EF449E3AB97F0B98C811B1B,POS.MISTINT.V0000.UP.Q,PROD_SMISA_BK,414D512050524F445F504F5331393235906F28561D2F0020,10/22/2015 9:29,10/22/2015 9:30
NOTE - I do not have the Text::CSV module available to me, nor will it be made available in the server I am using.
Here is part of my code in parsing this file. The first thing I do is concatenate the very first three fields and prepend that concatenated field to each line. Then I want to clear out the commas in #fields[7,19], then format the DATE in three fields and the DATETIME in two fields. The only line I can't figure out is clearing out those commas -
my #data;
# Read the lines one by one.
while ( $line = <$FH> ) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
# remove user input commas in fields[7,19]
$_ = for fields[7,19];
# format DATE and DATETIME fields for MySQL/sqlbatch60
$_ = join '-', (split /\//)[2,0,1] for #fields[14,20,23];
$_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
# write the parsed record back to the file
push #data, \#fields;
}
If it is ONLY the eighth field that is troubling AND you know exactly how many fields there should be, you can do it this way
Suppose the total number of fields is always N
Split the line on commas ,
Separate and store the first six fields
Separate and store the last n fields, where n is N-8
Rejoin what remains with commas ,. This now forms field 8
and then do what ever you like to do with it. For example, write it to a proper CSV file
Text::CSV_XS handles quoted commas just fine:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my $aoa = csv(in => 'file.csv'); # The file contains the sample line.
print $aoa->[0][6];
Note The two main versions below clean up one field. The most recent change in the question states that there are, in fact, two such fields. The third version, at the end, works with any number of bad fields.
All code has been tested with the supplied example and its variations.
Following clarifications, this deals with the case when the file need be processed by hand. A module is easily recommended for parsing .csv, but there is a problem here: reliance on the user to enter double quotes. If they end up not being there we have a malformed file.
I take it that the number of fields in the file is known with certainty and ahead of time.
The two independent solutions below use either array or string processing.
(1) The file is being processed line by line anyway, the line being split already. If there are more fields than expected, join the extra array elements by space and then overwrite the array with correct fields. This is similar to what is outlined in the answer by vanHoesel.
use strict;
use warnings;
my $num_fields = 39; # what should be, using the example
my $ibad = 6; # index of the malformed field-to-be
my #last = (-($num_fields-$ibad-1)..-1); # index-range, rest of fields
my $file = "file.csv";
open my $fh, '<', $file;
while (my $line = <$fh>) { # chomp it if needed
my #fields = split ',', $line;
if (#fields != $num_fields) {
# join extra elements by space
my $fixed = join ' ', #fields[$ibad..$ibad+#fields-$num_fields];
# overwrite array by good fields
#fields = (#fields[0..$ibad-1], $fixed, #fields[#last]);
}
# Process #fields normally
print "#fields";
}
close $fh;
(2) Preprocess the file, only checking for malformed lines and fixing them as needed. Uses string manipulations. (Or, the method above can be used.) The $num_fields and $ibad are the same.
while (my $line = <$fh>) {
# Number of fields: commas + 1 (tr|,|| counts number of ",")
my $have_fields = $line =~ tr|,|| + 1;
if ($have_fields != $num_fields) {
# Get indices of commas delimiting the bad field
my ($beg, $end) = map {
my $p = '[^,]*,' x $_;
$line =~ /^$p/ and $+[0]-1;
} ($ibad, $ibad+$have_fields-$num_fields);
# Replace extra commas and overwrite that part of the string
my $bad_field = substr($line, $beg+1, $end-$beg-1);
(my $fixed = $bad_field) =~ tr/,/ /;
substr($line, $beg+1, $end-$beg-1) = $fixed;
}
# Perhaps write the line out, for a corrected .csv file
print $line;
}
In the last line the bad part of $line is overwritten by assigning to substr, what this function allows. The new substring $fixed is constructed with commas changed (or removed, if desired), and used to overwrite the bad part of the $line. See docs.
If quotes are known to be there a regex can be used. This works with any number of bad fields.
while (my $line = <$fh>) {
$line =~ s/."([^"]+)"/join ' ', split(',', $1)/eg; # "
# process the line. note that double quotes are removed
}
If the quotes are to be kept move them inside parenthesis, to be captured as well.
This one line is all that need be done after while (...) { to clean up data.
The /e modifier makes the replacement side be evaluated as code, instead of being used as a double-quoted string. There the matched part of the line (between ") is split by comma and then joined by space, thus fixing the field. See the last item under "Search and replace" in perlretut.
All code has been tested with multiple lines and multiple commas in the bad field.
I am dealing with very large amounts of data. Every now and then there is a slip up. I want to identify each row with an error, under a condition of my choice. With that I want the row number along with the line number of each erroneous row. I will be running this script on a handful of files and I will want to output the report to one.
So here is my example data:
File_source,ID,Name,Number,Date,Last_name
1.csv,1,Jim,9876,2014-08-14,Johnson
1.csv,2,Jim,9876,2014-08-14,smith
1.csv,3,Jim,9876,2014-08-14,williams
1.csv,4,Jim,9876,not_a_date,jones
1.csv,5,Jim,9876,2014-08-14,dean
1.csv,6,Jim,9876,2014-08-14,Ruzyck
Desired output:
Row#5,4.csv,4,Jim,9876,not_a_date,jones (this is an erroneous row)
The condition I have chosen is print to output if anything in the date field is not a date.
As you can see, my desired output contains the line number where the error occurred, along with the data itself.
After I have my output that shows the lines within each file that are in error, I want to grab that line from the untouched original CSV file to redo (both modified and original files contain the same amount of rows). After I have a file of these redone rows, I can omit and clean up where needed to prevent interruption of an import.
Folder structure will contain:
Modified: 4.txt
Original: 4.csv
I have something started here, written in Perl, which by the logic will at least return the rows I need. However I believe my syntax is a little off and I do not know how to plug in the other subroutines.
Code:
$count = 1;
while (<>) {
unless ($F[4] =~ /\d+[-]\d+[-]\d+/)
print "Row#" . $count++ . "," . "$_";
}
The code above is supposed to give me my erroneous rows, but to be able to extract them from the originals is beyond me. The above code also contains some syntax errors.
This will do as you ask.
Please be certain that none of the fields in the data can ever contain a comma , otherwise you will need to use Text::CSV to process it instead of just a simple split.
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'example.csv';
<$fh>; # Skip header
while (<$fh>) {
my #fields = split /,/;
if( $fields[4] !~ /^\d{4}-\d{2}-\d{2}$/ ) {
print "Row#$.,$_";
}
}
output
Row#5,4.csv,4,Jim,9876,not_a_date,jones
Update
If you want to process a number of files then you need this instead.
The close ARGV at the end of the loop is there so that the line counter $. is reset to
1 at the start of each file. Without it it just continues from 1 upwards across all the files.
You would run this like
rob#Samurai-U:~$ perl findbad.pl *.csv
or you could list the files individually, separated by spaces.
For the test I have created files 1.csv and 2.csv which are identical to your example data except that the first field of each line is the name of the file containing the data.
You may not want the line in the output that announces each file name, in which case you should replace the entire first if block with just next if $. == 1.
use strict;
use warnings;
#ARGV = map { glob qq{"$_"} } #ARGV; # For Windows
while (<>) {
if ($. == 1) {
print "\n\nFile: $ARGV\n\n";
next;
}
my #fields = split /,/;
unless ( $fields[4] =~ /^\d{4}-\d{2}-\d{2}$/ ) {
printf "Row#%d,%s", $., $_;
}
close ARGV if eof ARGV;
}
output
File: 1.csv
Row#5,1.csv,4,Jim,9876,not_a_date,jones
File: 2.csv
Row#5,2.csv,4,Jim,9876,not_a_date,jones
I want to write a script that does the following: the user can choose to import a .txt file (for this I have written the code)(here $list1). This file consists of only one column with names on each line. If the user imported a file which is not empty, than I want to compare the names from a column from another file (here $file2) with the names in the imported file. I there is a match, then the whole line of this original file ($file2) should be placed in a new file ($filter1).
This is what I have so far:
my $list1;
if (prompt_yn("Do you want to import a genelist for filtering?")){
my $genelist1 = prompt("Give the name of the first genelist file:\n");
open($list1,'+<',$genelist1) or die "Could not open file $genelist1 $!";
}
open(my $filter1,'+>',"filter1.txt") || die "Can't write new file: $!";
my %hash1=();
while(<$list1>){ # $list1 is the variable from the imported .txt file
chomp;
next unless -z $_;
my $keyfield= $_; # this imported file contains only one column
$hash1{$keyfield}++;
}
seek $file2,0,0; #cursor resetting
while(<$file2>){ # this is the other file with multiple columns
my #line=split(/\t/); # split on tabs
my $keyfield=$line[2]; # values to compare are in column 3
if (exists($hash1{$keyfield})){
print $filter1 $_;
}
}
When running this script my output filter1.txt is empty. Which is not correct because there are definitely matches between the columns.
Because you have declared the $list1 filehandle as a lexical ( "my" ) variable inside a block, it is only visible in that block.
So the later lines in your script can't see $list1 and it gives the error message mentioned
To fix this, declare $list1 before the if.. block that opens the file
As the script stands, doesn't set keys or values in %hash1
Your spec is fuzzy, but what you might be intending is loading hash1 keys from file1
while(<$list1>){ # $list1 is the variable from the imported .txt file
chomp; # remove newlines
my $keyfield=$_; # this imported file contains only one column
$hash1{$keyfield}++; # this sets a key in %hash1
}
Then when going through file2
while(<$file2>){ # this is the other file with multiple columns
my #line=split(/\t/); # split on tabs
my $keyfield=$line[2]; # values to compare are in column "2"
if (exists($hash1{$keyfield}) ){ # do hash lookup for exact match
print $_; # output entire line
}
Incidentally, $line[2] is actually column 3, the first column is $line[0], the second $line[1] etc
If you actually want to do a partial or pattern match (like a grep) then using a hash isn't appropriate
Finally, you will have to amend the print $_; # output entire line to output to a file, if this is what you require. I removed the reference to $filter1 as this isn't declared in the script fragment shown
I've got a bunch of data in a CSV file, first row is all strings (all text and underscores), all subsequent rows are filled with numbers relating to said strings.
I'm trying to parse through the first line and find particular strings, remember which column that string was in, and then go through the rest of the file and get the data in the same column. I need to do this to three strings.
I've been using Text::CSV but I can't figure out how to get it to increment a counter until it finds the string in the first line and then go to the next line, get the data from that same column, etc. etc. Here's what I've tried so far:
while (<CSV>) {
if ($csv->parse($data)) {
my #field = $csv->fields;
my $count = 0;
for $column (#field) {
print ++$count, " => ", $column, "\n";
}
} else {
my $err = $csv->error_input;
print "Failed to parse line: $err";
}
}
Since $data is in line 1, it prints "1 $data" 25 times (# of lines in CSV file). How do I get it to remember which column it found $data in? Also, since I know all of the strings are in line 1, how do I get it to only parse through line 1, find all of the strings in #data, and then parse through the rest of the file, grabbing data from the necessary columns and putting it into a matrix or array of arrays?
Thanks for the help!
edit: I realized my questions were a bit poorly phrased. I don't know how to get the column number from CSV. How is this done?
Also, once I've got the column number, how do I tell it CSV to run through the subsequent lines and grab data from only that column?
Try something like this:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({binary=>1});
my $thing_to_match = "blah";
my $matched_index;
my #stored_data = ();
while(my $row= $csv->getline(*DATA)) #grabs lines below __DATA__
#(near the end of the script)
{
my #fields = #$row;
#If we haven't found the matched index, yet, search for it.
if(not defined $matched_index)
{
foreach my $i(0..$#fields)
{
$matched_index = $i if($fields[$i] eq $thing_to_match);
}
}
#NOTE: We're pushing a *reference* to an array!
#Look at perldoc perldata
push #stored_data,\#fields;
}
die "Column for '$thing_to_match' not found!" unless defined $matched_index;
foreach my $row(#stored_data)
{
print $row->[$matched_index] . "\n";
}
__DATA__
stuff,more stuff,yet more stuff
"yes, this thing, is one item",blah,blarg
1,2,3
The output is:
more stuff
blah
2
I don't have time to write up a full example, but I wrote a module that might help you do this. Tie::Array::CSV uses some magic to make your csv file act like a Perl array of arrayrefs. In this way you can use your knowledge of Perl to interact with the file.
A word of warning though! One benefit of my module is that it is read/write. Since you only want read, be careful not to assign to it!