Powershell / Perl : Merging multiple CSV files into one? - perl

I have the following CSV files, I want to merge these into a single CSV
01.csv
apples,48,12,7
pear,17,16,2
orange,22,6,1
02.csv
apples,51,8,6
grape,87,42,12
pear,22,3,7
03.csv
apples,11,12,13
grape,81,5,8
pear,11,5,6
04.csv
apples,14,12,8
orange,5,7,9
Desired output:
apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,,,87,42,12,81,5,8,,,
pear,17,16,2,22,3,7,11,5,6,,,
orange,22,6,1,,,,,,5,7,9
Can anyone provide guidance on how to achieve this? Preferably using Powershell but open to alternatives like Perl if that's easier.
Thanks Pantik, your code's output is close to what I want:
apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,87,42,12,81,5,8
orange,22,6,1,5,7,9
pear,17,16,2,22,3,7,11,5,6
Unfortunately I need "placeholder" commas in place for when the entry is NOT present in a CSV file, e.g. orange,22,6,1,,,,,,5,7,9 rather than orange,22,6,1,5,7,9
UPDATE: I would like these parsed in order of the filenames, e.g.:
$myFiles = #(gci *.csv) | sort Name
foreach ($file in $myFiles){
regards
ted

Here is my Perl version:
use strict;
use warnings;
my $filenum = 0;
my ( %fruits, %data );
foreach my $file ( sort glob("*.csv") ) {
$filenum++;
open my $fh, "<", $file or die $!;
while ( my $line = <$fh> ) {
chomp $line;
my ( $fruit, #values ) = split /,/, $line;
$fruits{$fruit} = 1;
$data{$filenum}{$fruit} = \#values;
}
close $fh;
}
foreach my $fruit ( sort keys %fruits ) {
print $fruit, ",", join( ",", map { $data{$_}{$fruit} ? #{ $data{$_}{$fruit} } : ",," } 1 .. $filenum ), "\n";
}
Which gives me:
apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,,,,87,42,12,81,5,8,,,
orange,22,6,1,,,,,,,5,7,9
pear,17,16,2,22,3,7,11,5,6,,,
So do you have a typo for grape or i have misunderstood something?

Ok, gangabass solution works, and is cooler than mine, but I'll add mine anyway. It is slightly stricter, and preserves a data structure that can be used as well. So, enjoy. ;)
use strict;
use warnings;
opendir my $dir, '.' or die $!;
my #csv = grep (/^\d+\.csv$/i, readdir $dir);
closedir $dir;
# sorting numerically based on leading digits in filename
#csv = sort {($a=~/^(\d+)/)[0] <=> ($b=~/^(\d+)/)[0]} #csv;
my %data;
# To print empty records we first need to know all the names
for my $file (#csv) {
open my $fh, '<', $file or die $!;
while (<$fh>) {
if (m/^([^,]+),/) {
#{ $data{$1} } = ();
}
}
close $fh;
}
# Now we can fill in values
for my $file (#csv) {
open my $fh, '<', $file or die $!;
my %tmp;
while (<$fh>) {
chomp;
next if (/^\s*$/);
my ($tag,#values) = split (/,/);
$tmp{$tag} = \#values;
}
for my $key (keys %data) {
unless (defined $tmp{$key}) {
# Fill in empty values
#{$tmp{$key}} = ("","","");
}
push #{ $data{$key} }, #{ $tmp{$key} };
}
}
&myreport;
sub myreport {
for my $key (sort keys %data) {
print "$key," . (join ',', #{$data{$key}}), "\n";
}
}

Powershell:
$produce = "apples","grape","orange","pear"
$produce_hash = #{}
$produce | foreach-object {$produce_hash[$_] = #(,$_)}
$myFiles = #(gci *.csv) | sort Name
foreach ($file in $myFiles){
$file_hash = #{}
$produce | foreach-object {$file_hash[$_] = #($null,$null,$null)}
get-content $file | foreach-object{
$line = $_.split(",")
$file_hash[$line[0]] = $line[1..3]
}
$produce | foreach-object {
$produce_hash[$_] += $file_hash[$_]
}
}
$ofs = ","
$out = #()
$produce | foreach-object {
$out += [string]$produce_hash[$_]
}
$out | out-file "outputfile.csv"
gc outputfile.csv
apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,,,,87,42,12,81,5,8,,,
orange,22,6,1,,,,,,,5,7,9
pear,17,16,2,22,3,7,11,5,6,,,
Should be easy to modify for additional items. Just add them to the $produce array.

Second Powershell solution (as requested)
$produce = #()
$produce_hash = #{}
$file_count = -1
$myFiles = #(gci 0*.csv) | sort Name
foreach ($file in $myFiles){
$file_count ++
$file_hash = #{}
get-content $file | foreach-object{
$line = $_.split(",")
if ($produce -contains $line[0]){
$file_hash[$line[0]] += $line[1..3]
}
else {
$produce += $line[0]
$file_hash[$line[0]] = #(,$line[0]) + (#($null) * 3 * $file_count) + $line[1..3]
}
}
$produce | foreach-object {
if ($file_hash[$_]){$produce_hash[$_] += $file_hash[$_]}
else {$produce_hash[$_] += #(,$null) * 3}
}
}
$ofs = ","
$out = #()
$produce_hash.keys | foreach-object {
$out += [string]$produce_hash[$_]
}
$out | out-file "outputfile.csv"
gc outputfile.csv
apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,,,,87,42,12,81,5,8,,,
orange,22,6,1,,,,,,,5,7,9
pear,17,16,2,22,3,7,11,5,6,,,

you have to parse the files, I don't see easier way hot to do it
solution in powershell:
UPDATE: ok, adjusted a bit - hopefully understandable
$items = #{}
$colCount = 0 # total amount of columns
# loop through all files
foreach ($file in (gci *.csv | sort Name))
{
$content = Get-Content $file
$itemsToAdd = 0; # columns added by this file
foreach ($line in $content)
{
if ($line -match "^(?<group>\w+),(?<value>.*)")
{
$group = $matches["group"]
if (-not $items.ContainsKey($group))
{ # in case the row doesn't exists add and fill with empty columns
$items.Add($group, #())
for($i = 0; $i -lt $colCount; $i++) { $items[$group] += "" }
}
# add new values to correct row
$matches["value"].Split(",") | foreach { $items[$group] += $_ }
$itemsToAdd = ($matches["value"].Split(",") | measure).Count # saves col count
}
}
# in case that file didn't contain some row, add empty cols for those rows
$colCount += $itemsToAdd
$toAddEmpty = #()
$items.Keys | ? { (($items[$_] | measure).Count -lt $colCount) } | foreach { $toAddEmpty += $_ }
foreach ($key in $toAddEmpty)
{
for($i = 0; $i -lt $itemsToAdd; $i++) { $items[$key] += "" }
}
}
# output
Remove-Item "output.csv" -ea 0
foreach ($key in $items.Keys)
{
"$key,{0}" -f [string]::Join(",", $items[$key]) | Add-Content "output.csv"
}
Output:
apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,,,,87,42,12,81,5,8,,,
orange,22,6,1,,,,,,,5,7,9
pear,17,16,2,22,3,7,11,5,6,,,

Here is a more consise way how to do it. However, it still doesn't add the commas when the item is missing.
Get-ChildItem D:\temp\a\ *.csv |
Get-Content |
ForEach-Object -begin { $result=#{} } -process {
$name, $otherCols = $_ -split '(?<=\w+),'
if (!$result[$name]) { $result[$name] = #() }
$result[$name] += $otherCols
} -end {
$result.GetEnumerator() | % {
"{0},{1}" -f $_.Key, ($_.Value -join ",")
}
} | Sort

Related

Blank result for String concatenation in foreach

There might not be the best way but what is wrong with this code. Why do i get blank result from $str in the foreach loop, whereas if it try to concatenate individual cells i get the right result.
$csv = import-csv -Path "C:\Users\abc\csv4script\TEST.csv" -Header 'IPs'
$str = ''
$x = 1
foreach ($cell in $csv.){
if ($x -le 4000){
$str = $str + ", " + $cell.IPs
if ($x -eq 4000){
$x = 1}
$str = ''
}
$x = $x + 1
}
$str
# $str = $str + $csv[1].IPs + ", " + $csv[2].IPs
$str
It doesn't have a value because you misplaced a '}' character.
$x = 1}
Move it to after $str = ''
$x = 1
$str = ''}
Also as Thomas pointed out, remove the period after $csv variable in your foreach statement, it should be like: foreach($cell in $csv)
Also if you want to increment a value,
$x++ is quicker and easier to read than $x = $x + 1
Lastly, as you have it, if your list has more than 4000 IPs the $str variable will empty out.

Reading 2000 files and building hashes in perl

I have a code to parse 2000 csv files and build hashes based on them.
code is running good and fast until it reads ~100 files and there after it is running at snail pace
Memory consumed is ~ 1.8 GB uncompressed
Goal is to build global hash %_hist from the csv files.
File sizes range between 20KB to 30 MB
OS is Mac with 12 GB RAM
64 bit perl 5.18
I have create every variable in the functions as "my" expecting it to be released after the function exits.
The only persistent global variable is %_hist
Is there a way to improve performance?
foreach my $file (#files){
iLog ("Checking $file");
$| = 1; #flush io
return error("File $file doesn't exist") if not -e $file;
my #records = readCSVFile($file); #reads csv file to 2d array and returns the array
my #formatted_recs;
foreach $rec ( #records ){
my ($time,$c,$user_dst,$client,$ip_src,$first_seen,$last_seen,$first_seen_time,$last_seen_time,$device_ip,$country,$org,$user_agent) = #$rec;
my #newrec = ($time,$c,$client,$first_seen,$last_seen,$ip_src,$user_agent,$device_ip,$country,$org);
next if $time =~ /time/i; #Ignore first record
push(#formatted_recs, \#newrec);
}
baselineHistRecords(#formatted_recs);
}
sub readCSVFile{
my $file = shift;
my #data;
open my $fh, '<', $file or return error("Could not open $file: $!");
my $line = <$fh>; #Read headerline
my $sep_char = ',';
$sep_char = ';' if $line =~ /;"/;
$sep_char = '|' if $line =~ /\|/;
my $csv = Text::CSV->new({ sep_char => "$sep_char" });
push (#data, split(/$sep_char/, $line) );
while( my $row = $csv->getline( $fh ) ) {
push #data, $row;
}
close $fh;
return #data;
}
sub baselineHistRecords{
my #recs = #_;
undef $_ for ($time,$c,$client,$first_seen,$last_seen,$ip_src,$user_agent,$device_ip,$country,$org) ;
undef $_ for (%device_count, %ua_count, %location_count, %org_count );
my ($time,$c,$client,$first_seen,$last_seen,$ip_src,$user_agent,$device_ip,$country,$org) ;
my %loc = {}; my %loc2rec = {};
my %device_count = {}; my %ua_count = {}; my %location_count = {}; my %sorg_count = {};
my $hits=0;
my #suspicious_hits = ();
foreach $rec (#recs){
my $devtag=''; my $os = '';
my #row = #{$rec};
($time,$c,$client,$first_seen,$last_seen,$ip_src,$ua,$device_ip,$country,$org) = #row;
veryverbose("\n$time,$c,$client,$first_seen,$last_seen,$ip_src,$user_agent,$device_ip,$country,$org");
next if not is_ipv4($ip_src);
###### 1. Enrich IP
my $org = getOrgForIP($ip_src);
my ($country_code,$region,$city) = getGeoForIP($ip_src);
my $isp = getISPForIP($ip_src);
my $loc = join(" > ",($country_code, $region));
my $city = join(" > ",($country_code, $region, $city));
my $cidr = $ip_src; $cidr =~ s/\d+\.\d+$/0\.0\/16/; #Removing last octet
# my $packetmail = getPacketmailRep($ip_src);
# push (#suspicious_hits, "$time $c $client $ip_src $ua / $packetmail") if $packetmail !~ /NOTFOUND/;
##### 2. SANITIZE
$ua = cannonize($ua);
$devtag = $& if $ua =~ /\([^\)]+\)/;
#tokens = split(/;/, $devtag);
$os = $tokens[0];
$os =~ s/\+/ /g;$os =~ s/\(//g;$os =~ s/\)//g;
$os = 'Android' if $os !~ /Android/i and $devtag =~ /Android/i;
$os = "Windows NT" if $os =~ /compatible/i or $os =~ /Windows NT/i;
$_hist{$client}{"isp"}{$isp}{c} += 1;
$_hist{$client}{"os"}{$os}{c} += 1;
$_hist{$client}{"ua"}{$ua}{c} += 1 if not is_empty ($ua);
$_hist{$client}{"ua"}{c} += 1 if not is_empty ($ua); #An exception marked since all logs doesn't have UA values
$_hist{$client}{"loc"}{$loc}{c} += 1;
$_hist{$client}{"org"}{$org}{c} += 1;
$_hist{$client}{"cidr"}{$cidr}{c} += 1;
$_hist{$client}{"city"}{$city}{c} += 1;
$_hist{$client}{"c"} += 1;
$hits = $hits + 1;
print "." if $hits%100==0;
debug( "\n$ip_src : $os $loc $isp $org $ua: ".$_hist{$client}{"os"}{$os}{c} );
}
print "\nHITS: $hits";
return if ($hits==0); #return if empty
printf("\n######(( BASELINE for $client (".$_hist{$client}{c} ." records) ))#######################\n");
foreach my $item (qw/os org isp loc ua cidr/){
debug( sprintf ("\n\n--(( %s: %s ))-------------------------------- ",$client,uc($item)) );
## COMPUTE Usage Percent
my #item_values = sort { $_hist{$client}{$item}{$b}{c} <=> $_hist{$client}{$item}{$a}{c} } keys %{ $_hist{$client}{$item} };
my #cvalues = ();
foreach my $key ( #item_values ){
my $count = $_hist{$client}{$item}{$key}{c};
my $total = $_hist{$client}{c};
$total = $_hist{$client}{"ua"}{c} if $item =~ /^ua|os$/i and $_hist{$client}{"ua"}{c}; #Over for User_agent and OS determination as all logs doesn't have them
my $pc = ceil(( $count / $total ) * 100) ;
debug ("Ignoring empty value") if is_empty($key); # Ignoring Empty values
next if is_empty($key);
$_hist{$client}{$item}{$key}{p} = $pc ;
push (#cvalues, $pc);
#printf("\n%3d \% : %s",$pc,$key) if $pc>0;
}
## COMPUTE Cluster Centers
my #clustercenters = getClusterCenters(3,#cvalues);
my ($low, $medium, $high) = #clustercenters;
$_hist{$client}{$item}{low} = $low;
$_hist{$client}{$item}{medium} = $medium;
$_hist{$client}{$item}{high} = $high;
my %tags = ( $low => "rare",
$medium => "normal",
$high =>"most common",
);
debug ("\n(Cluster Centers) : $low \t$medium \t $high\n");
foreach my $key ( #item_values ){
next if is_empty($key);
my $pc = $_hist{$client}{$item}{$key}{p};
$_hist{$client}{$item}{$key}{tag} = $tags{ closest($pc, #clustercenters) };
debug( sprintf("\n%3d \% : %s : %s",$pc, $_hist{$client}{$item}{$key}{tag} , $key) );
}
}
printf("\n\n###################################\n");
saveHistBaselines();
}
Thanks,
Uma
This is more question for code review.
There's a ton of completely useless copying around in the code. E.g.: why the hell you copy data from my #$rec to #newrec? $rec to #row? Why do you return plain list of lines from readCSVFile instead of reference?
You don't really need to read entire file in memory and then process it - you can process data line by line and throw it away immideately after you done with it.

Printing the search path taken to find item during BFS

I am trying to solve the doublets puzzle problem using Perl. This is one of my first times using Perl so please excuse the messy code.
I have everything working, I believe, but am having an issue printing the shortest path. Using a queue and BFS I am able to find the target word but not the actual path taken.
Does anyone have any suggestions? I have been told to keep track of the parents of each element but it is not working.
#!/usr/bin/perl
use strict;
my $file = 'test';
#my $file = 'wordlist';
open(my $fh, $file);
my $len = length($ARGV[0]);
my $source = $ARGV[0];
my $target = $ARGV[1];
my #words;
# Creates new array of correct length words
while (my $row = <$fh>) {
chomp $row;
my $rowlen = length($row);
if ($rowlen == $len) {
push #words, $row;
}
}
my %wordHash;
# Creates graph for word variations using dictionary
foreach my $word (#words) {
my $wordArr = [];
for (my $i = 0; $i < $len; $i++) {
my $begin = substr($word, 0, $i);
my $end = substr($word, $i+1, $len);
my $key = "$begin" . "_" . "$end";
my $Arr = [];
my $regex = "$begin" . "[a-z]" . "$end";
foreach my $wordTest (#words) {
if ("$wordTest" =~ $regex && "$wordTest" ne "$word") {
push $wordArr, "$wordTest";
}
}
}
$wordHash{"$word"} = $wordArr;
}
my #queue;
push(#queue, "$source");
my $next = $source;
my %visited;
my %parents;
my #path;
# Finds path using BFS and Queue
while ("$next" ne "$target") {
print "$next: ";
foreach my $variation (#{$wordHash{$next}}) {
push(#queue, "$variation");
$parents{"$variation"} = $next;
print "$variation | ";
}
print "\n-----------------\n";
$visited{"$next"} = 1;
push(#path, "$next");
$next = shift(#queue);
while ($visited{"$next"} == 1) {
$next = shift(#queue);
}
}
print "FOUND: $next\n\n";
print "Path the BFS took: ";
print "#path\n\n";
print "Value -> Parent: \n";
for my $key (keys %parents) {
print "$key -> $parents{$key}\n";
}
Before you accept a word from the #queue to be $next, you test to ensure that it's not been %visited. By then, though, damage has been done. The test has ensured a visited word wont become the focus again and hence, will prevent loops but the earlier code updated %parents whether the word had been %visited or not.
If a word has been %visited, you not only want to avoid it becomming the $next candidate, you want to avoid it being a considered $variation as that will screw up %parents. I don't have a word dictionary to test with and you haven't given an example of the failure but I think you can fix this up by shifting the %visited guard into the inner loop where variations are considered;
foreach my $variation (#{$wordHash{$next}}) {
next if %visited{ $variation } ;
push(#queue, "$variation");
... etc ...
This will protect the integrity of your #parents array as well as stop loops. On a small note, you don't need use double quotes when indexing into a hash; as I've done above, just state the scalar variable - using quotes just interpolates the value of the variable which produces the same result.
Your code, IMHO, is excellent for a beginner, BTW.
Update
I've since got a word dictionary and the problem above does exists as well as one other. The code does move one letter at a time from the source but in a near random direction - not necessarily closer to the target. To correct that, I changed the regex you use to build your graph such that the corresponding letter from the target replaces the generic [a-z]. There are also a couple of minor changes - mostly style related. The updated code looks like this;
use v5.12;
my $file = 'wordlist.txt';
#my $file = 'wordlist';
open(my $fh, $file);
my $len = length($ARGV[0]);
my $source = $ARGV[0];
my $target = $ARGV[1];
chomp $target;
my #target = split('', $target);
my #words;
# Creates new array of correct length words
while (my $row = <$fh>) {
$row =~ s/[\r\n]+$//;
my $rowlen = length($row);
if ($rowlen == $len) {
push #words, $row;
}
}
my %wordHash;
# Creates graph for word variations using dictionary
foreach my $word (#words) {
my $wordArr = [];
for (my $i = 0; $i < $len; $i++) {
my $begin = substr($word, 0, $i);
my $end = substr($word, $i+1, $len);
my $key = "$begin" . "_" . "$end";
my $Arr = [];
# my $re_str = "$begin[a-z]$end";
my $regex = $begin . $target[$i] . $end ;
foreach my $wordTest (#words) {
if ($wordTest =~ / ^ $regex $ /x ) {
next if $wordTest eq $word ;
push $wordArr, "$wordTest";
}
}
}
$wordHash{"$word"} = $wordArr;
}
my #queue;
push(#queue, "$source");
my $next = $source;
my %visited;
my %parents;
my #path;
# Finds path using BFS and Queue
while ($next ne $target) {
print "$next: ";
$visited{$next} = 1;
foreach my $variation (#{$wordHash{$next}}) {
next if $visited{ $variation } ;
push(#queue, $variation);
$parents{$variation} = $next;
print "$variation | ";
}
print "\n-----------------\n";
push(#path, $next);
while ( $visited{$next} ) {
$next = shift #queue ;
}
}
push #path, $target ;
print "FOUND: $next\n\n";
print "Path the BFS took: #path\n\n";
print "Value -> Parent: \n";
for my $key (keys %parents) {
print "$key -> $parents{$key}\n";
}
and when ran produces;
./words.pl head tail | more
head: heal |
-----------------
heal: teal | heil |
-----------------
teal:
-----------------
heil: hail |
-----------------
hail: tail |
-----------------
FOUND: tail
Path the BFS took: head heal teal heil hail tail
Value -> Parent:
hail -> heil
heil -> heal
teal -> heal
tail -> hail
heal -> head
You could probably remove the printing of the %parents hash - as hash values come out randomly, it doesnt tell you much

Powershell output file missing lines from input file

Why would the output file produced by the following script not contain all of the lines that were contained in the original file? I'm doing replacement logic on a line by line level, I'm not explicitely removing any lines though.
[regex]$r = "\$\%\$##(.+)\$\%\$##";
(Get-Content $inputFile) |
Foreach-Object {
$line = $_;
$find = $r.matches($line);
if ($find[0].Success) {
foreach ($match in $find) {
$found = $match.value
$replace = $found -replace "[^A-Za-z0-9\s]", "";
$line.Replace($found, $replace);
}
}
} |
Set-Content $outputFile
Input File
Output File
You're only outputting content to the pipe if it finds a match, at this line:
$line.Replace($found, $replace)
If there was not a match found, then you need to output the line without doing any replacement:
[regex]$r = "\$\%\$##(.+)\$\%\$##";
(Get-Content $inputFile) |
Foreach-Object {
$line = $_;
$find = $r.matches($line);
if ($find[0].Success) {
foreach ($match in $find) {
$found = $match.value
$replace = $found -replace "[^A-Za-z0-9\s]", "";
$line.Replace($found, $replace);
}
}
Else { $line }
} |
Set-Content $outputFile

the analysis of my $file = ${$chainro->{$ro}->{$id}}[$i];

I have a two level hash %chainro , each key of $chainro{$ro}{$id}points to an array. The following code is to iterate through the first level of hash, $chainro->{$ro}. I can guess what
does my $file = ${$chainro->{$ro}->$id}}[$i]; aim to perform. However, I do not know why ${$chainro->{$ro}->{$id}}was written this way? In specific, why do we need to add ${ } to wrap the $chainro->${ro}->{$id}
foreach my $id (keys %{$chainro->{$ro}})
{
$size = $#{$chainro->{$ro}->{$id}};
for ($i=0; $i<$size; $i++)
{
my $file = ${$chainro->{$ro}->{$id}}[$i];
}
}
${ EXPR1 }[ EXPR2 ]
is an alternate way of writing
EXPR1->[ EXPR2 ]
so
${ $chainro->{$ro}->{$id} }[$i]
can be written as
$chainro->{$ro}->{$id}->[$i]
or even as
$chainro->{$ro}{$id}[$i]
Cleaned up:
for my $id (keys %{ $chainro->{$ro} }) {
my $files = $chainro->{$ro}{$id};
for my $i (0..$#$files) {
my $file = $files->[$i];
...
}
}
Or if you don't need $i:
for my $id (keys %{ $chainro->{$ro} }) {
my $files = $chainro->{$ro}{$id};
for my $file (#$files) {
...
}
}
It is to dereference a reference to something.
The something is an array here.