Using Perl or Powershell, how to compare 2 CSV files and get only the new rows? - perl

I am comparing two large comma-delimited CSV files File1.csv and File2.csv using the
Text::Diff Perl module.
The Perl program is called from a .bat file and I put the result in a third file Diff.csv
Perl
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Diff;
my $diffs = diff $ARGV[0] => $ARGV[1];
$diffs =~ s/^(?:[^\n]*+\n){2}//;
$diffs =~ s/^(?:[\# ][^\n]*+)?+\n//mg;
print $diffs;
This is how I call the Perl script:
perl "C:\diffBetweenTwoFiles.pl" "C:\File1.csv" "C:\File2.csv" > "C:\Diff.csv"
One of the columns in the CSV file is Name.
Currently the result lists all rows whose values in any columns change, but I want only to list new Name rows.
For example:
File1.csv
"Name","DOB","Address"
"One","1/1/01","5 Stock Rd"
"Two","1/2/02","1 Research Rd"
File2.csv
"Name","DOB","Address"
"One","1/1/01","5 Stock Rd"
"Two","1/2/02","111 Research Rd"
"Three","1/3/03","3 Bold Rd"
Currently, the result list these (it includes "Two" because its Address changed):
"Name","DOB","Address"
"Two","1/2/02","111 Research Rd"
"Three","1/3/03","3 Bold Rd"
But, I only want the result to list the new "Name" like this:
"Name","DOB","Address"
"Three","1/3/03","3 Bold Rd"
How can I do that in Perl or Powershell script?

Use Text::CSV in Perl
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my ($file_old, $file_new, $file_diff) =
map { $_ . '.csv' } qw(File1 File2 Diff);
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag();
my ($old, $header) = get_lines($csv, $file_old, 1);
my $new = get_lines($csv, $file_new);
my #lines_with_new_names = #{ new_names($old, $new) };
open my $fh, '>', $file_diff or die "Can't open $file_diff: $!";
$csv->say($fh, $header);
$csv->say($fh, $_) for #lines_with_new_names; # or print with eol set
sub new_names {
my ($old, $new) = #_;
my %old = map { $_->[0] => 1 } #$old;
return [ map { (!exists $old{$_->[0]}) ? $_ : () } #$new ];
}
sub get_lines {
my ($csv, $file, $return_header) = #_;
open my $fh, '<', $file or die "Can't open $file $!";
my $header = $csv->getline($fh); # remove the header line
return ($return_header)
? ( $csv->getline_all($fh), $header )
: $csv->getline_all($fh);
}
This prints the correct difference with the provided samples.
Variable names tagged with old are related to the file with fewer lines, the other one being new. The "Name" column is taken to be the first one.
Comments
The getline_all method returns an arrayref for all lines, where each is an arrayref with all fields. This is done from a sub, with an option to return the header line as well.
The optional return of another variable here makes a difference of whether a single scalar or a list is returned, so it can also be handled using wantarray builtin
return wantarray ? ( LIST ) : scalar;
which returns true if the sub is called in a list context. Thus the caller decides by invoking the sub in either the list or scalar context, my ($v1, $v2) = f(...) or my $v = f(...), in which case a flag is not needed in the call. I opted for a more explicit way.
The difference in names' list is produced in new_names sub. First a lookup hash is made with all names from the "old" arrayref. Then lines in "new" arrayref are filtered, taking those which don't have a name in the "old" (no such key in the hash), and returned in an arrayref [].
Such use of a hash is a standard technique for finding differences between arrays.
The documented method say used for printing doesn't work on my older version of the module with which this is tested. In that case use print and set eol in the constructor.

Since you're working with large files that are stressing your memory limit, you can try:
Read the first CSV file one line at a time, and use a hashtable to store the file's Name entries.
Read the second CSV file one line at a time and compare it's Name entries against the first.
(UPDATED based on comments) A simple example in PowerShell:
$output = New-Object System.Text.StringBuilder;
$file1 = #{};
$header = $null;
# $filePaths is two-element array with full path to CSV files
for ($i = 0; $i -lt $filePaths.Length; ++$i) {
$reader = New-Object System.IO.StreamReader($filePaths[$i]);
while (($line = $reader.ReadLine()) -ne $null) {
if ($line -match '\S') {
if ($header -eq $null) {
$header = $line;
$output.AppendLine($line) | Out-Null;
}
$name = ($line -split ',')[0];
switch ($i) {
0 { $file1.Add($name, $null); }
1 {
if (!$file1.ContainsKey($name)) {
$output.AppendLine($line) | Out-Null;
}
}
}
}
}
$reader.Dispose();
}
$output.ToString() | Out-File -FilePath $outPath;

Related

Parsing a GenBank file

I trying to parse a GenBank file so I could get the accession number, the definition, the size of the file and the DNA sequence
Is there a way to modify my code and make it shorter and just declare all the variables at once like they do in the book and parse the file in one or two blocks of code?
If you have access to Bio Perl, you might find a solution such as the following.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => "input.txt",
-format => 'GenBank');
while ( my $seq = $in->next_seq ) {
my $acc = $seq->accession;
my $length = $seq->length;
my $definition = $seq->desc;
my $type = $seq->molecule;
my $organism = $seq->species->binomial;
if ($type eq 'mRNA' &&
$organism =~ /homo sapiens/i &&
$acc =~ /[A-Za-z]{2}_[0-9]{6,}/ )
{
print "$acc | $definition | $length\n";
print $seq->seq, "\n";
print "\n";
}
}
I was able to capture the 5 variables from a sample GenBank file I have (input.txt). It should simplify your code.

Excel::Writer::XLSX append text on a cell

I looked at the documentation couldn;t find it so far. I'm using "Excel::Writer::XLSX" to write text to excel.
Question is : How can I append any text on a cell. Supposed cell A1 has "abc" already written , how can i append it with say "def" with any delimiter . Finally A1 cell should have "abc-def".
Currently its overwritten the old data and only showing "def" .
File has say :
Hostname name abc
Random Lines
Hostname name def
Random Data
open my $FH, '<', $filenames or die $!;
while(<$FH>)
{
if($_ =~ /^Hostname+\s+name+\s(.*)/){
my $hostname = $1;
print "\nHostname : $1\n";
$worksheet->write(0, 0, $hostname);
}
}
Now if you look at the code..when the first regex matach its write on the cell as abc ...Now when the regex match next time ...it delete abc and write def..I wanted someway to append it...
thanks in advance.
Excel::Writer::XLSX does not expose functionality to read the temporary file, while you create it. You need to save the value outside the loop:
use strict;
use warnings;
my $hostname = '';
my $delimiter = '-';
open my $FH, '<', $filenames or die $!;
while(<$FH>)
{
if($_ =~ /^Hostname+\s+name+\s(.*)/){
$hostname .= $delimiter if ($hostname);
$hostname .= $1;
print "\nHostname : $hostname\n";
}
}
$worksheet->write(0, 0, $hostname);
To hack the internal data structure to get the temporary value this is how it says it:
# Write a shared string or an in-line string based on optimisation level.
if ( $self->{_optimization} == 0 ) {
$index = $self->_get_shared_string_index( $str );
}
else {
$index = $str;
}
$self->{_table}->{$row}->{$col} = [ $type, $index, $xf ];
So to read the string back, without optimisation:
my $value = $self->{_table}->{$row}->{$col}->[1];

Comparing a csv file with file and finding the match is not working

I am comparing csv file with another normal file both the files have lot of similar words(Fields) but it is not matching
my $file = "sample.csv";
open my $fh, "<", $file or die "$file: $!";
my $csv1 = Text::CSV->new ({
binary => 1, # Allow special character. Always set this
auto_diag => 1, # Report irregularities immediately
});
my #lines = read_file("brand1.txt");
my $count = 0;
while (my $row = $csv1->getline ($fh)) {
$count = $count + 1;
foreach my $line(#lines) {
my $che = $row->[4];
print $count;
if ($line eq $che){
print $line ."\t". $che;
}
}
}
This code gives me blank output in terminal.
But comparing two files(without csv file) works with the same script
The best thing one can do when trying to figure out why two things aren't equal is to print those two things.
print "<<<$line>>>\n>>>$che<<<\n";
This will show you visually what the differences are, which most of the time will make it obvious.
In your case the issue is that read_file doesn't chomp input, so each line in #lines has a \n at the end. However, your parsed CSV does not.
If you do this:
chomp #lines;
It should work fine.

Perl -two list matching elements

I am trying to grab the list of the files jenkins has updated from last build and latest build and stored in a perl array.
Now i have list of files and folders in source code which are considered as sensitive in terms of changes like XXXX\yyy/., XXX/TTTT/FFF.txt,...in FILE.TXT
i want that script should tell me if any these sensitive files was part of my changed files and if yes list its name so that we can double check with development team about is change before we trigger build .
How should i achieve this , and how to ,compare multiple files under one path form the changed path files .
have written below script ---which is called inside the jenkins with %workspace# as argument
This is not giving any matching result.
use warnings;
use Array::Utils qw(:all);
$url = `curl -s "http://localhost:8080/job/Rev-number/lastStableBuild/" | findstr "started by"`;
$url =~ /([0-9]+)/;
system("cd $ARGV[1]");
#difffiles = `svn diff -r $1:HEAD --summarize`;
chomp #difffiles;
foreach $f (#difffiles) {
$f = substr( $f, 8 );
$f = "$f\n";
}
open FILE, '/path/to/file'
or die "Can't open file: $!\n";
#array = <FILE>;
#isect = intersect( #difffiles, #array );
print #isect;
I have manged to solve this issue using below perl script -
sub Verifysensitivefileschanges()
{
$count1=0;
#isect = intersect(#difffiles,#sensitive);
#print "ISECT=#isect\n";
foreach $t (#isect)
{
if (#isect) {print "Matching element found -- $t\n"; $count1=1;}
}
return $count1;
}
sub Verifysensitivedirschanges()
{
$count2=0;
foreach $g (#difffiles)
{
$dirs = dirname($g);
$filename = basename($g);
#print "$dirs\n";
foreach $j (#array)
{
if( "$j" =~ /\Q$dirs/)
{print "Sensitive path files changed under path $j and file name is $filename\n";$count2=1;}
}
}
return $count2;
}

Random element order in XML document using XML::LibXML

I have a Perl script that reads a simple .csv file like below-
"header1","header2","header3","header4"
"12","12-JUL-2012","Active","Processed"
"13","11-JUL-2012","In Process","Pending"
"32","10-JUL-2012","Active","Processed"
"24","08-JUL-2012","Active","Processed"
.....
The aim is to convert this .csv to an .xml file something like below-
<ORDERS>
<LIST_G_ROWS>
<G_ROWS>
<header1>12</header1>
<header2>12-JUL-2012</header2>
<header3>Active</header3>
<header4>Processed</header4>
</G_ROWS>
<G_ROWS>
<header1>13</header1>
<header2>11-JUL-2012</header2>
<header3>In Process</header3>
<header4>Pending</header4>
</G_ROWS>
....
....
</LIST_G_ROWS>
</ORDERS>
I know that there is XML::CSV available in CPAN which will make my life easier but I want to make use of already installed XML::LibXML to create the XML, instead of installing XML::CSV. I was able to read the CSV and create the XML file as above without any issues, but I am getting a random order of the elements in the XML i.e. something like below. I need to have the order of the elements (child nodes) to be in sync with the .csv file as shown above, but I am not quite sure how do go around that. I am using a hash and sort() ing the hash didn't quite solve the problem either.
<ORDERS>
<LIST_G_ROWS>
<G_ROWS>
<header3>Active</header3>
<header1>12</header1>
<header4>Processed</header4>
<header2>12-JUL-2012</header2>
</G_ROWS>
......
and so on. Below is the snippet from my perl code
use XML::LibXML;
use strict;
my $outcsv="/path/to/data.csv";
my $$xmlFile="/path/to/data.xml";
my $headers = 0;
my $doc = XML::LibXML::Document->new('1.0', 'UTF-8');
my $root = $doc->createElement("ORDERS");
my $list = $doc->createElement("LIST_G_ROWS");
$root->appendChild($list);
open(IN,"$outcsv") || die "can not open $outcsv: $!\n";
while(<IN>){
chomp($_);
if ($headers == 0)
{
$_ =~ s/^\"//g; #remove starting (")
$_ =~ s/\"$//g; #remove trailing (")
#keys = split(/\",\"/,$_); #split per ","
s{^\s+|\s+$}{}g foreach #keys; #remove leading and trailing spaces from each field
$headers = 1;
}
else{
$_ =~ s/^\"//g; #remove starting (")
$_ =~ s/\"$//g; #remove trailing (")
#vals = split(/\",\"/,$_); #split per ","
s{^\s+|\s+$}{}g foreach #vals; #remove leading and trailing spaces from each field
my %tags = map {$keys[$_] => $vals[$_]} (0..#keys-1);
my $row = $doc->createElement("G_ROWS");
$list->appendChild($row);
for my $name (keys %tags) {
my $tag = $doc->createElement($name);
my $value = $tags{$name};
$tag->appendTextNode($value);
$row->appendChild($tag);
}
}
}
close(IN);
$doc->setDocumentElement($root);
open(OUT,">$xmlFile") || die "can not open $xmlFile: $!\n";
print OUT $doc->toString();
close(OUT);
You could forget the %tags hash entirely. Instead, loop over the indices of #keys:
for my $i (0 .. #keys - 1) {
my $key = $keys[$i];
my $value = $values[$i];
my $tag = $doc->createElement($key);
$tag->appendTextNode($value);
$row->appendChild($tag);
}
That way, the ordering of your keys is preserved. When a hash is used, the ordering is indeterminate.
Your program is far more involved than it needs to be. For convenience and reliability you should use Text::CSV to parse your CSV file.
The program below does what you need.
use strict;
use warnings;
use Text::CSV;
use XML::LibXML;
open my $csv_fh, '<', '/path/to/data.csv' or die $!;
my $csv = Text::CSV->new;
my $headers = $csv->getline($csv_fh);
my $doc = XML::LibXML::Document->new('1.0', 'UTF-8');
my $orders = $doc->createElement('ORDERS');
$doc->setDocumentElement($orders);
my $list = $orders->appendChild($doc->createElement('LIST_G_ROWS'));
while ( my $data = $csv->getline($csv_fh) ) {
my $rows = $list->appendChild($doc->createElement('G_ROWS'));
for my $i (0 .. $#$data) {
$rows->appendTextChild($headers->[$i], $data->[$i]);
}
}
print $doc->toFile('/path/to/data.xml', 1);
output
<?xml version="1.0" encoding="UTF-8"?>
<ORDERS>
<LIST_G_ROWS>
<G_ROWS>
<header1>12</header1>
<header2>12-JUL-2012</header2>
<header3>Active</header3>
<header4>Processed</header4>
</G_ROWS>
<G_ROWS>
<header1>13</header1>
<header2>11-JUL-2012</header2>
<header3>In Process</header3>
<header4>Pending</header4>
</G_ROWS>
<G_ROWS>
<header1>32</header1>
<header2>10-JUL-2012</header2>
<header3>Active</header3>
<header4>Processed</header4>
</G_ROWS>
<G_ROWS>
<header1>24</header1>
<header2>08-JUL-2012</header2>
<header3>Active</header3>
<header4>Processed</header4>
</G_ROWS>
</LIST_G_ROWS>
</ORDERS>
Update
Without the exotic options that Text::CSV provides, its functionality is fairly simple if its options are fixed. This alternative provides a subroutine csv_readline to replace the Text::CSV method readline. It works mostly in the same way as the module.
The output of this program is identical to that above.
use strict;
use warnings;
use XML::LibXML;
open my $csv_fh, '<', '/path/to/data.csv' or die $!;
my $doc = XML::LibXML::Document->new('1.0', 'UTF-8');
my $orders = $doc->createElement('ORDERS');
$doc->setDocumentElement($orders);
my $list = $orders->appendChild($doc->createElement('LIST_G_ROWS'));
my $headers = csv_getline($csv_fh);
while ( my $data = csv_getline($csv_fh) ) {
my $rows = $list->appendChild($doc->createElement('G_ROWS'));
for my $i (0 .. $#$data) {
$rows->appendTextChild($headers->[$i], $data->[$i]);
}
}
print $doc->toFile('/path/to/data.xml', 1);
sub csv_getline {
my $fh = shift;
defined (my $line = <$fh>) or return;
$line =~ s/\s*\z/,/;
[ map { /"(.*)"/ ? $1 : $_ } $line =~ /( " [^"]* " | [^,]* ) , /gx ];
}
Seems like something that XML::LibXml is an overkill for, just use XML::Simple and build the proper hash which will describe that XML structure, than dump it with XMLOut to an XML file