Compare two files and write matching data from first file using perl - perl

First file
FirstName:LastName:Location:Country:ID
FirstName1:LastName1:Location1:Country1:ID1
FirstName2:LastName2:Location2:Country2:ID2
FirstName3:LastName3:Location3:Country3:ID3
FirstName4:LastName4:Location4:Country4:ID4
Second file
FirstName:LastName:Location:Country:Old_ID
FirstName2:LastName2:Location2:Country2:Old_ID2
FirstName4:LastName4:Location4:Country4:Old_ID4
Have to compare first and second file and print matching rows with data from first file which is have new ID's.
Below script fetches me Old_ID's from second file and not the new ones from first file
use warnings;
use strict;
my $details = 'file2.txt';
my $old_details = 'file1.txt';
my %names;
open my $data, '<', $details or die $!;
while (<$data>)
{
my ($name, #ids) = split;
push #{ $names{$_} }, $name for #ids;
}
open my $old_data, '<', $old_details or die $!;
while (<$old_data>)
{
chomp;
print #{ $names{$_} // [$_] }, "\n";
}
Output:
FirstName:LastName:Location:Country:Old_ID
FirstName2:LastName2:Location2:Country2:Old_ID2
FirstName4:LastName4:Location4:Country4:Old_ID4
Expected output:
FirstName:LastName:Location:Country:ID
FirstName2:LastName2:Location2:Country2:ID2
FirstName4:LastName4:Location4:Country4:ID4

Just try this way:
use strict; # Use strict Pragma
use warnings;
my ($file1, $filecnt1, $file2, $filecnt2) = ""; #Declaring variables
$file1 = "a1.txt"; $file2 = "b1.txt"; #Sample files
readFileinString($file1, \$filecnt1); # Reading first file
readFileinString($file2, \$filecnt2); # Reading second file
$filecnt2=~s/\:Old\_ID/\:ID/g; # Replacing that difference content
my #firstfle = split "\n", $filecnt1; # Move content to array variable to compare
my #secndfle = split "\n", $filecnt2;
my %firstfle = map { $_ => 1 } #firstfle; #Mapping the array into hash variable
my #scdcmp = grep { $firstfle{$_} } #secndfle;
print join "\n", #scdcmp;
#---------------> File reading
sub readFileinString
#--------------->
{
my $File = shift;
my $string = shift;
open(FILE1, "<$File") or die "\nFailed Reading File: [$File]\n\tReason: $!";
read(FILE1, $$string, -s $File, 0);
close(FILE1);
}
#---------------> File Writing
sub writeFileinString
#--------------->
{
my $File = shift;
my $string = shift;
my #cDir = split(/\\/, $File);
my $tmp = "";
for(my $i = 0; $i < $#cDir; $i++)
{
$tmp = $tmp . "$cDir[$i]\\";
mkdir "$tmp";
}
if(-f $File){
unlink($File);
}
open(FILE, ">$File") or die "\n\nFailed File Open for Writing: [$File]\n\nReason: $!\n";
print FILE $$string;
close(FILE);
}

Related

How to filter columns from CSV file based on names of columns

I am using the CSV data like below. I don't want to use user and timestamp from csv file. I may add few columns or delete columns.
I didnt find the any suitable method in Text CSV.
Please let me know if any method or module is available
UniqueId, Name, description, user,timestamp
1,jana,testing,janardar,12-10-2018:00:
sub _filter_common_columns_from_csv{
my $csvfile = shift;
my $CSV = Text::CSV_XS->new(
{
binary => 1,
auto_diag => 3,
allow_quotes => 0,
eol => $/
});
my $_columns ||= do {
open(my $fh, '<', $csvfile) or die $!;
my #cols = #{ $CSV->getline($fh) };
close $fh or die $!;
for (#cols) { s/^\s+//; s/\s+$//; }
\#cols;
};
my #columns = #{ $_columns };
my %deleted;
my #regexes = qw(user timestamp);
foreach my $regex (#regexes) {
foreach my $i (0 .. ($#columns - 1)) {
my $col = $columns[$i];
$deleted{$i} = $col if $col =~ /$regex/;
}
}
my #wanted_columns = grep { !$deleted{$_} } 0 .. $#columns - 1;
my $input_temp = "$ENV{HOME}/output/temp_test.csv";
open my $tem, ">",$input_temp or die "$input_temp: $!";
open(my $fh, '<', $csvfile) or die $!;
while (my $row = $CSV->getline($fh)) {
my #fields = #$row;
$CSV->print($tem, [ #fields[#wanted_columns] ]) or $CSV->error_diag;
}
close $fh or die $!;
close $tem or die $!;
return $input_temp;
}
See getline_hr
use warnings;
use strict;
use feature 'say';
use List::MoreUtils qw(any);
use Text::CSV;
my $file = shift #ARGV || die "Usage: $0 filename\n";
my #exclude_cols = qw(user timestamp);
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, '<', $file or die "Can't open $file: $!";
my #cols = #{ $csv->getline($fh) };
my #wanted_cols = grep {
my $name = $_;
not any { $name eq $_ } #exclude_cols;
} #cols;
my $row = {};
$csv->bind_columns (\#{$row}{#cols});
while ($csv->getline($fh)) {
my #wanted_fields = #$row{ #wanted_cols };
say "#wanted_fields";
}
The syntax #$row{#wanted_cols} is for a hash slice, which returns a list of values for the keys in #wanted_cols from the hashref $row.
Actual example using Text::AutoCSV to remove given named columns from arbitrary CSV files like in your posted code (More complicated than the examples in the documentation for only writing specific columns):
#!/usr/bin/perl
use warnings;
use strict;
use Text::AutoCSV qw/remove_accents/;
sub remove_columns {
my ($infile, $outfile, $drop) = #_;
my $csv = Text::AutoCSV->new(in_file => $infile, out_file => $outfile);
# Normalize column names the same way that Text::AutoCSV does
my %drops = map { my $h = remove_accents $_;
$h =~ s/[^[:alnum:]_]//gi;
$h = uc $h;
$h => 1 } #$drop;
my #cols = grep { not exists $drops{$_} } $csv->get_fields_names;
# Hack to avoid reading the file twice.
$csv->{out_fields} = \#cols;
$csv->write();
}
remove_columns "in.csv", "out.csv", [ "user", "timestamp" ];
If you want to modify your CSV in other ways, too, and if SQL would be convenient for those modifications, then consider using DBD::CSV.
You can then open a database handle on your CSV file, select the desired columns with a SELECT query, and write the results with Text::CSV or Text::CSV_XS.
For more details, see the DBD::CSV documentation or e.g. this simple wrapper script for querying CSV files.

perl + read multiple csv files + manipulate files + provide output_files

Apologies if this is a bit long winded, bu i really appreciate an answer here as i am having difficulty getting this to work.
Building on from this question here, i have this script that works on a csv file(orig.csv) and provides a csv file that i want(format.csv). What I want is to make this more generic and accept any number of '.csv' files and provide a 'output_csv' for each inputed file. Can anyone help?
#!/usr/bin/perl
use strict;
use warnings;
open my $orig_fh, '<', 'orig.csv' or die $!;
open my $format_fh, '>', 'format.csv' or die $!;
print $format_fh scalar <$orig_fh>; # Copy header line
my %data;
my #labels;
while (<$orig_fh>) {
chomp;
my #fields = split /,/, $_, -1;
my ($label, $max_val) = #fields[1,12];
if ( exists $data{$label} ) {
my $prev_max_val = $data{$label}[12] || 0;
$data{$label} = \#fields if $max_val and $max_val > $prev_max_val;
}
else {
$data{$label} = \#fields;
push #labels, $label;
}
}
for my $label (#labels) {
print $format_fh join(',', #{ $data{$label} }), "\n";
}
i was hoping to use this script from here but am having great difficulty putting the 2 together:
#!/usr/bin/perl
use strict;
use warnings;
#If you want to open a new output file for every input file
#Do it in your loop, not here.
#my $outfile = "KAC.pdb";
#open( my $fh, '>>', $outfile );
opendir( DIR, "/data/tmp" ) or die "$!";
my #files = readdir(DIR);
closedir DIR;
foreach my $file (#files) {
open( FH, "/data/tmp/$file" ) or die "$!";
my $outfile = "output_$file"; #Add a prefix (anything, doesn't have to say 'output')
open(my $fh, '>', $outfile);
while (<FH>) {
my ($line) = $_;
chomp($line);
if ( $line =~ m/KAC 50/ ) {
print $fh $_;
}
}
close($fh);
}
the script reads all the files in the directory and finds the line with this string 'KAC 50' and then appends that line to an output_$file for that inputfile. so there will be 1 output_$file for every inputfile that is read
issues with this script that I have noted and was looking to fix:
- it reads the '.' and '..' files in the directory and produces a
'output_.' and 'output_..' file
- it will also do the same with this script file.
I was also trying to make it dynamic by getting this script to work in any directory it is run in by adding this code:
use Cwd qw();
my $path = Cwd::cwd();
print "$path\n";
and
opendir( DIR, $path ) or die "$!"; # open the current directory
open( FH, "$path/$file" ) or die "$!"; #open the file
**EDIT::I have tried combining the versions but am getting errors.Advise greatly appreciated*
UserName#wabcl13 ~/Perl
$ perl formatfile_QforStackOverflow.pl
Parentheses missing around "my" list at formatfile_QforStackOverflow.pl line 13.
source dir -> /home/UserName/Perl
Can't use string ("/home/UserName/Perl/format_or"...) as a symbol ref while "strict refs" in use at formatfile_QforStackOverflow.pl line 28.
combined code::
use strict;
use warnings;
use autodie; # this is used for the multiple files part...
#START::Getting current working directory
use Cwd qw();
my $source_dir = Cwd::cwd();
#END::Getting current working directory
print "source dir -> $source_dir\n";
my $output_prefix = 'format_';
opendir my $dh, $source_dir; #Changing this to work on current directory; changing back
for my $file (readdir($dh)) {
next if $file !~ /\.csv$/;
next if $file =~ /^\Q$output_prefix\E/;
my $orig_file = "$source_dir/$file";
my $format_file = "$source_dir/$output_prefix$file";
# .... old processing code here ...
## Start:: This part works on one file edited for this script ##
#open my $orig_fh, '<', 'orig.csv' or die $!; #line 14 and 15 above already do this!!
#open my $format_fh, '>', 'format.csv' or die $!;
#print $format_fh scalar <$orig_fh>; # Copy header line #orig needs changeing
print $format_file scalar <$orig_file>; # Copy header line
my %data;
my #labels;
#while (<$orig_fh>) { #orig needs changing
while (<$orig_file>) {
chomp;
my #fields = split /,/, $_, -1;
my ($label, $max_val) = #fields[1,12];
if ( exists $data{$label} ) {
my $prev_max_val = $data{$label}[12] || 0;
$data{$label} = \#fields if $max_val and $max_val > $prev_max_val;
}
else {
$data{$label} = \#fields;
push #labels, $label;
}
}
for my $label (#labels) {
#print $format_fh join(',', #{ $data{$label} }), "\n"; #orig needs changing
print $format_file join(',', #{ $data{$label} }), "\n";
}
## END:: This part works on one file edited for this script ##
}
How do you plan on inputting the list of files to process and their preferred output destination? Maybe just have a fixed directory that you want to process all the cvs files, and prefix the result.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $source_dir = '/some/dir/with/cvs/files';
my $output_prefix = 'format_';
opendir my $dh, $source_dir;
for my $file (readdir($dh)) {
next if $file !~ /\.csv$/;
next if $file =~ /^\Q$output_prefix\E/;
my $orig_file = "$source_dir/$file";
my $format_file = "$source_dir/$output_prefix$file";
.... old processing code here ...
}
Alternatively, you could just have an output directory instead of prefixing the files. Either way, this should get you on your way.

Can i collect the output of find(\&wanted, #directories) in an array

I am writing a script which will traverse the directory(including subdir also) and push the desired file in an array so that i can work on each file.
Here is my code:
use strict;
use warnings;
use File::Find;
my $path = $ARGV[0];
find({ wanted => \&GetappropriateFile }, $path);
sub GetappropriateFile
{
my $file = $_;
my #all_file;
# print "$file\n";
if ( -f and /traces[_d+]/)
{
#print "$file\n";
open(my $fh, "<", $file) or die "cannot open file:$!\n";
while( my $line = <$fh>){
$line =~ /Cmd\sline:\s+com.android*/;
push(#all_file,$file);
#print "$file\n";
}
close($fh);
#print"#all_file\n";
}
}
Problem Area : my $file = $_;
Instead of using " $file" if i could get a way to use an array here then i can easily read those files one by one and filter it.
Here what i am tring to do is : I have to open each file and check for the string "Cmd line: com.android" as soon as i get this string in the file i have to push this current file in an array and start reading the another file.
It would be better to avoid global vars.
use strict;
use warnings;
use File::Find qw( find );
sub IsAppropriateFile {
my ($file) = #_;
if (-f $file && $file =~ /traces[_d+]/) {
open(my $fh, "<", $file) or die "cannot open file:$!\n";
while ( my $line = <$fh> ) {
if ($line =~ /Cmd\sline:\s+com.android*/) {
return 1;
}
}
}
return 0;
}
{
my $path = $ARGV[0];
my #matching_files;
find({
wanted => sub {
push #matching_files, $_ if IsAppropriateFile($_);
},
}, $path);
print("$_\n") for #matching_files; # Or whatever.
}
Put declaration of #all_file outside of function, and use it after find() finishes,
my #all_file;
sub GetappropriateFile
{
..
}
You could also stop with file reading after successful match,
if ($line =~ /Cmd\sline:\s+com.android*/) {
push(#all_file, $file);
last;
}

Reading and comparing lines in Perl

I am having trouble with getting my perl script to work. The issue might be related to the reading of the Extract file line by line within the while loop, any help would be appreciated. There are two files
Bad file that contains a list of bad IDs (100s of IDs)
2
3
Extract that contains a delimited data with the ID in field 1 (millions of rows)
1|data|data|data
2|data|data|data
2|data|data|data
2|data|data|data
3|data|data|data
4|data|data|data
5|data|data|data
I am trying to remove all the rows from the large extract file where the IDs match. There can be multiple rows where the ID matches. The extract is sorted.
#use strict;
#use warnning;
$SourceFile = $ARGV[0];
$ToRemove = $ARGV[1];
$FieldNum = $ARGV[2];
$NewFile = $ARGV[3];
$LargeRecords = $ARGV[4];
open(INFILE, $SourceFile) or die "Can't open source file: $SourceFile \n";
open(REMOVE, $ToRemove) or die "Can't open toRemove file: $ToRemove \n";
open(OutGood, "> $NewFile") or die "Can't open good output file \n";
open(OutLarge, "> $LargeRecords") or die "Can't open Large Records output file \n";
#Read in the list of bad IDs into array
#array = <REMOVE>;
#Loop through each bad record
foreach (#array)
{
$badID = $_;
#read the extract line by line
while(<INFILE>)
{
#take the line and split it into
#fields = split /\|/, $_;
my $extractID = $fields[$FieldNum];
#print "Here's what we got: $badID and $extractID\n";
while($extractID == $badID)
{
#Write out bad large records
print OutLarge join '|', #fields;
#Get the next line in the extract file
#fields = split /\|/, <INFILE>;
my $extractID = $fields[$FieldNum];
$found = 1; #true
#print " We got a match!!";
#remove item after it has been found
my $input_remove = $badID;
#array = grep {!/$input_remove/} #array;
}
print OutGood join '|', #fields;
}
}
Try this:
$ perl -F'|' -nae 'BEGIN {while(<>){chomp; $bad{$_}++;last if eof;}} print unless $bad{$F[0]};' bad good
First, you are lucky: The number of bad IDs is small. That means, you can read the list of bad IDs once, stick them in a hash table without running into any difficulty with memory usage. Once you have them in a hash, you just read the big data file line by line, skipping output for bad IDs.
#!/usr/bin/env perl
use strict;
use warnings;
# hardwired for convenience
my $bad_id_file = 'bad.txt';
my $data_file = 'data.txt';
my $bad_ids = read_bad_ids($bad_id_file);
remove_data_with_bad_ids($data_file, $bad_ids);
sub remove_data_with_bad_ids {
my $file = shift;
my $bad = shift;
open my $in, '<', $file
or die "Cannot open '$file': $!";
while (my $line = <$in>) {
if (my ($id) = extract_id(\$line)) {
exists $bad->{ $id } or print $line;
}
}
close $in
or die "Cannot close '$file': $!";
return;
}
sub read_bad_ids {
my $file = shift;
open my $in, '<', $file
or die "Cannot open '$file': $!";
my %bad;
while (my $line = <$in>) {
if (my ($id) = extract_id(\$line)) {
$bad{ $id } = undef;
}
}
close $in
or die "Cannot close '$file': $!";
return \%bad;
}
sub extract_id {
my $string_ref = shift;
if (my ($id) = ($$string_ref =~ m{\A ([0-9]+) }x)) {
return $id;
}
return;
}
I'd use a hash as follows:
use warnings;
use strict;
my #bad = qw(2 3);
my %bad;
$bad{$_} = 1 foreach #bad;
my #file = qw (1|data|data|data 2|data|data|data 2|data|data|data 2|data|data|data 3|data|data|data 4|data|data|data 5|data|data|data);
my %hash;
foreach (#file){
my #split = split(/\|/);
$hash{$split[0]} = $_;
}
foreach (sort keys %hash){
print "$hash{$_}\n" unless exists $bad{$_};
}
Which gives:
   
1|data|data|data
4|data|data|data
5|data|data|data

<DATA> prevents foreach loop from being executed, why? :)

I have two nested foreach loops. If I use this code:
foreach (#directories) {
my $actual_directory = $_;
print "\nactual directory: ".$actual_directory."\n";
foreach (#files) {
my $file_name = $_;
my $actual_file = $actual_directory.$file_name;
print $actual_file."\n";
open(DATA, $actual_file) or die "Nelze otevřít zdrojový soubor: $!\n";
my $line_number = 0;
# while (<DATA>){
# my #znaky = split(' ',$_);
# my $poradi = $znaky[0]; #poradi nukleotidu
# my $hodnota = $znaky[1]; #hodnota
# my #temp = $files_to_sum_of_lines{$actual_file};
# $temp[$line_number] += $hodnota;
# $files_to_sum_of_lines{$actual_file} = #temp;
# $line_number+=1;
# }
# close(DATA);
}
}
I got this output:
actual directory: /home/n/Plocha/counting_files/1/
/home/n/Plocha/counting_files/1/a.txt
/home/n/Plocha/counting_files/1/b.txt
actual directory: /home/n/Plocha/counting_files/2/
/home/n/Plocha/counting_files/2/a.txt
/home/n/Plocha/counting_files/2/b.txt
However, if I uncomment "while (<DATA>){ }", I loose a.txt and b.txt, so the output looks like this:
actual directory: /home/n/Plocha/counting_files/1/
/home/n/Plocha/counting_files/1/a.txt
/home/n/Plocha/counting_files/1/b.txt
actual directory: /home/n/Plocha/counting_files/2/
/home/n/Plocha/counting_files/2/
/home/n/Plocha/counting_files/2/
How can this while (<DATA>) prevent my foreach from being executed?
Any help will be appreciated. Thanks a lot.
In addition to not using DATA, try using lexical loop variables, and lexical filehandles. Also, Perl's built-in $. keeps track of line numbers for you.
for my $actual_directory (#directories) {
print "\nactual directory: ".$actual_directory."\n";
foreach my $file_name (#files) {
my $actual_file = $actual_directory.$file_name;
print $actual_file."\n";
open my $INPUT, '<', $actual_file
or die "Nelze otevřít zdrojový soubor: $!\n";
while (my $line = <$INPUT>) {
my #znaky = split(' ', $line);
my $poradi = $znaky[0]; #poradi nukleotidu
my $hodnota = $znaky[1]; #hodnota
#temp = $files_to_sum_of_lines{$actual_file};
$temp[ $. ] += $hodnota;
$files_to_sum_of_lines{$actual_file} = #temp;
}
close $INPUT;
}
}
On the other hand, I can't quite tell if there is a logic error in there. Something like the following might be useful:
#!/usr/bin/perl
use warnings; use strict;
use Carp;
use File::Find;
use File::Spec::Functions qw( catfile canonpath );
my %counts;
find(\&count_lines_in_files, #ARGV);
for my $dir (sort keys %counts) {
print "$dir\n";
my $dircounts = $counts{ $dir };
for my $file (sort keys %{ $dircounts }) {
printf "\t%s: %d\n", $file, $dircounts->{ $file };
}
}
sub count_lines_in_files {
my $file = canonpath $_;
my $dir = canonpath $File::Find::dir;
my $path = canonpath $File::Find::name;
return unless -f $path;
$counts{ $dir }{ $file } = count_lines_in_file($path);
}
sub count_lines_in_file {
my ($path) = #_;
my $ret = open my $fh, '<', $path;
unless ($ret) {
carp "Cannot open '$path': $!";
return;
}
1 while <$fh>;
my $n_lines = $.;
close $fh
or croak "Cannot close '$path': $!";
return $n_lines;
}
Perl uses __DATA__ to make a pseudo-data file at the end of the package. You can access that using the filehandle DATA, e.g. <DATA>. Is it possible that your filehandle is conflicting? Try changing the filehandle to something else and see if it works better.