Pattern Matching in Columns

Pattern Matching in Columns - perl

File 1
A11;F1;BMW
A23;F2;BMW
B12;F3;BMW
H11;F4;JBW
File 2
P01;A1;0;0--00 ;123;456;150
P01;A11;0;0--00 ;123;444;208
P01;B12;0;0--00 ;123;111;36
P01;V11;0;0--00 ;123;787;33.9
Output
-;-;-;P01;A1;0;0--00 ;123;456;150
A11;F1;BMW;P01;A11;0;0--00 ;123;444;208
B12;F3;BMW;P01;B12;0;0--00 ;123;111;36
-;-;-;P01;V11;0;0--00 ;123;787;33.9
I TRIED
awk 'FNR==NR {a[$2] = $0; next }{ if($1 in a) {p=$1;$1="";print a[p],$0}}' File1 File2
But didnt work.
Basically I want to get the details from FILE 1 and compare with FILE2 (master list) .
Example :
A1 in FILE2 was not available in FILE1 , so in output file we have “-“ for 1st three fields and rest from FILE2 . Now, we have A11 and we got the detail in FILE1. So we write details of A11 from both File 1 & 2

I would do this in Perl, personally, but since everyone and their mother is giving you a Perl solution, here's an alternative:
Provided that the records in each file have a consistent number of fields, and provided that the records in each file are sorted by the "join" field in lexicographic order, you can use join:
join -1 1 -2 2 -t ';' -e - -o '1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7' -a 2 File1 File2
Explanation of options:
-1 1 and -2 2 mean that the "join" field (A11, A23, etc.) is the first field in File1 and the second field in File2.
-t ';' means that fields are separated by ;
-e - means that empty fields should be replaced by -
-o '1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7' means that you want each output-line to consist of the first three fields from File1, followed by the first seven fields from File2. (This is why this approach requires that the records in each file have a consistent number of fields.)
-a 2 means that you want to include every line from File2 in the output, even if there's no corresponding line from File1. (Otherwise it would only output lines that have a match in both files.)

The usual Perl way: use a hash to remember the master list:
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
open my $MASTER, '<', 'File1' or die $!;
while (<$MASTER>) {
chomp;
my #columns = split /;/;
$hash{$columns[0]} = [#columns[1 .. $#columns]];
}
close $MASTER;
open my $DETAIL, '<', 'File2' or die $!;
while (<$DETAIL>) {
my #columns = split /;/;
if (exists $hash{$columns[1]}) {
print join ';', $columns[1], #{ $hash{$columns[1]} }, q();
} else {
print '-;-;-;';
}
print;
}
close $DETAIL;

With Perl:
use warnings;
use strict;
my %file1;
open (my $f1, "<", "file1") or die();
while (<$f1>) {
chomp;
my #v = (split(/;/))[0];
$file1{$v[0]} = $_;
}
close ($f1);
open (my $f2, "<", "file2") or die();
while (<$f2>) {
chomp;
my $v = (split(/;/))[1];
if (defined $file1{$v}) {
print "$file1{$v};$_\n";
} else {
print "-;-;-;$_\n";
}
}
close ($f2);

A perl solution might include the very nice module Text::CSV. If so, you might extract the values into a hash, and later use that hash for lookup. When looking up values, you would insert blank values -;-;-; for any undefined values in the lookup hash.
use strict;
use warnings;
use Text::CSV;
my $lookup = "file1.csv"; # whatever file is used to look up fields 0-2
my $master = "file2.csv"; # the file controlling the printing
my $csv = Text::CSV->new({
sep_char => ";",
eol => $/, # to add newline to $csv->print()
quote_space => 0, # to avoid adding quotes
});
my %lookup;
open my $fh, "<", $lookup or die $!;
while (my $row = $csv->getline($fh)) {
$lookup{$row->[0]} = $row; # add entire row to specific key
}
open $fh, "<", $master or die $!; # new $fh needs no close
while (my $row = $csv->getline($fh)) {
my $extra = $lookup{$row->[1]} // [ qw(- - -) ]; # blank row if undef
unshift #$row, #$extra; # add the new values
$csv->print(*STDOUT, $row); # then print them
}
Output:
-;-;-;P01;A1;0;0--00 ;123;456;150
A11;F1;BMW;P01;A11;0;0--00 ;123;444;208
B12;F3;BMW;P01;B12;0;0--00 ;123;111;36
-;-;-;P01;V11;0;0--00 ;123;787;33.9

This can't be done conveniently in a one-line program as it involves reading two input files, however the problem isn't difficult
This program reads all the lines from file1, and uses the first field as a key to store the line in a hash
Then all the lines from file2 are read and the second field used as a key to access the hash. The // defined-or operator is used to print either element's value is printed if it exists, or the default string if not
Finally the current line from file2 is printed
use strict;
use warnings;
my %data;
open my $fh, '<', 'file1' or die $!;
while (<$fh>) {
chomp;
my $key = (split /;/)[0];
$data{$key} = $_;
}
open $fh, '<', 'file2' or die $!;
while (<$fh>) {
my $key = (split /;/)[1];
print $data{$key} // '-;-;-;', $_;
}
output
-;-;-;P01;A1;0;0--00 ;123;456;150
A11;F1;BMWP01;A11;0;0--00 ;123;444;208
B12;F3;BMWP01;B12;0;0--00 ;123;111;36
-;-;-;P01;V11;0;0--00 ;123;787;33.9

Related

Perl: Read columns and convert to array

I am new to perl, trying to read a file with columns and creating an array.
I am having a file with following columns.
file.txt
A 15
A 20
A 33
B 20
B 45
C 32
C 78
I wanted to create an array for each unique item present in A with its values assigned from second column.
eg:
#A = (15,20,33)
#B = (20,45)
#C = (32,78)
Tried following code, only for printing 2 columns
use strict;
use warnings;
my $filename = $ARGV[0];
open(FILE, $filename) or die "Could not open file '$filename' $!";
my %seen;
while (<FILE>)
{
chomp;
my $line = $_;
my #elements = split (" ", $line);
my $row_name = join "\t", #elements[0,1];
print $row_name . "\n" if ! $seen{$row_name}++;
}
close FILE;
Thanks

Firstly some general Perl advice. These days, we like to use lexical variables as filehandles and pass three arguments to open().
open(my $fh, '<', $filename) or die "Could not open file '$filename' $!";
And then...
while (<$fh>) { ... }
But, given that you have your filename in $ARGV[0], another tip is to use an empty file input operator (<>) which will return data from the files named in #ARGV without you having to open them. So you can remove your open() line completely and replace the while with:
while (<>) { ... }
Second piece of advice - don't store this data in individual arrays. Far better to store it in a more complex data structure. I'd suggest a hash where the key is the letter and the value is an array containing all of the numbers matching that letter. This is surprisingly easy to build:
use strict;
use warnings;
use feature 'say';
my %data; # I'd give this a better name if I knew what your data was
while (<>) {
chomp;
my ($letter, $number) = split; # splits $_ on whitespace by default
push #{ $data{$letter} }, $number;
}
# Walk the hash to see what we've got
for (sort keys %data) {
say "$_ : #{ $data{$_ } }";
}

Change the loop to be something like:
while (my $line = <FILE>)
{
chomp($line);
my #elements = split (" ", $line);
push(#{$seen{$elements[0]}}, $elements[1]);
}
This will create/append a list of each item as it is found, and result in a hash where the keys are the left items, and the values are lists of the right items. You can then process or reassign the values as you wish.

Count the number of items derived from split without putting into an array

I am looking to spare the use of an array for memory's sake, but still get the number of items derived from the split function for each pass of a while loop.
The ultimate goal is to filter the output files according to the number of their sequences, which could either be deduced by the number of rows the file has, or the number of carrots that appear, or the number of line breaks, etc.
Below is my code:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => "TATAGCGCTTTATGCTAGCTAGC",
"TTTATGG" => "TAGCTAGCTTTATGGGCTAGCTA",
"TTTATCC" => "GCTAGCTATTTATCCGCTAGCTA",
"TTTATCG" => "AGTCATGCTTTATCGCGATCGAT",
"TTTATAA" => "TAGCTAGCTTTATAATAGCTAGC",
"TTTATAA" => "ATCGATCGTTTATAACGATCGAT",
"TTTATAT" => "TCGATCGATTTATATTAGCTAGC",
"TTTATAT" => "TAGCTAGCTTTATATGCTAGCTA",
"TTTATTA" => "GCTAGCTATTTATTATAGCTAGC",
"CTTGTAA" => "ATCGATCGCTTGTAACGATTAGC",
);
while(my $line = <INFILE>){
chomp $line;
open my $out, '>', "Clustered_Barcode_$..txt" or die $!;
foreach my $sequence (split /\t/, $line){
if (exists $hash{$sequence}){
print $out ">$sequence\n$hash{$sequence}\n";
}
}
}
The input file, "Clustered_Barcodes.txt" when opened, looks like the following:
TTTATGC TTTATGG TTTATCC TTTATCG
TTTATAA TTTATAA TTTATAT TTTATAT TTTATTA
CTTGTAA
There will be three output files from the code, "Clustered_Barcode_1.txt", "Clustered_Barcode_2.txt", and "Clustered_Barcode_3.txt". An example of what the output files would look like could be the 3rd and final file, which would look like the following:
>CTTGTAA
ATCGATCGCTTGTAACGATTAGC
I need some way to modify my code to identify the number of rows, carrots, or sequences that appear in the file and work that into the title of the file. The new title for the above sequence could be something like "Clustered_Barcode_Number_3_1_Sequence.txt"
PS- I made the hash in the above code manually in attempt to make things simpler. If you want to see the original code, here it is. The input file format is something like:
>TAGCTAGC
GCTAAGCGATGCTACGGCTATTAGCTAGCCGGTA
Here is the code for setting up the hash:
my $dir = ("~/Documents/Sequences");
open(INFILE, "<", "~/Documents/Clustered_Barcodes.txt") or die $!;
my %hash = ();
my #ArrayofFiles = glob "$dir/*"; #put all files from the specified directory into an array
#print join("\n", #ArrayofFiles), "\n"; #this is a diagnostic test print statement
foreach my $file (#ArrayofFiles){ #make hash of barcodes and sequences
open (my $sequence, $file) or die "can't open file: $!";
while (my $line = <$sequence>) {
if ($line !~/^>/){
my $seq = $line;
$seq =~ s/\R//g;
#print $seq;
$seq =~ m/(CATCAT|TACTAC)([TAGC]{16})([TAGC]+)([TAGC]{16})(CATCAT|TACTAC)/;
$hash{$2} = $3;
}
}
}
while(<INFILE>){
etc

You can use regex to get the count:
my $delimiter = "\t";
my $line = "zyz pqr abc xyz";
my $count = () = $line =~ /$delimiter/g; # $count is now 3
print $count;

Your hash structure is not right for your problem as you have multiple entries for same ids. for example TTTATAA hash id has 2 entries in your %hash.
To solve this, use hash of array to create the hash.
Change your hash creation code in
$hash{$2} = $3;
to
push(#{$hash{$2}}, $3);
Now change your code in the while loop
while(my $line = <INFILE>){
chomp $line;
open my $out, '>', "Clustered_Barcode_$..txt" or die $!;
my %id_list;
foreach my $sequence (split /\t/, $line){
$id_list{$sequence}=1;
}
foreach my $sequence(keys %id_list)
{
foreach my $val (#{$hash{$sequence}})
{
print $out ">$sequence\n$val\n";
}
}
}

I have assummed that;
The first digit in the output file name is the input file line number
The second digit in the output file name is the input file column number
That the input hash is a hash of arrays to cover the case of several sequences "matching" the one barcode as mentioned in the comments
When a barcode has a match in the hash, that the output file will lists all the sequences in the array, one per line.
The simplest way to do this that I can see is to build the output file using a temporary filename and the rename it when you have all the data. According to the perl cookbook, the easiest way to create temporary files is with the module File::Temp.
The key to this solution is to move through the list of barcodes that appear on a line by column index rather than the usual perl way of simply iterating over the list itself. To get the actual barcodes, the column number $col is used to index back into #barcodes which is created by splitting the line on whitespace. (Note that splitting on a single space is special cased by perl to emulate the behaviour of one of its predecessors, awk (leading whitespace is removed and the split is on whitespace, not a single space)).
This way we have the column number (indexed from 1) and the line number we can get from the perl special variable, $. We can then use these to rename the file using the builtin, rename().
use warnings;
use strict;
use diagnostics;
use File::Temp qw(tempfile);
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => [ "TATAGCGCTTTATGCTAGCTAGC" ],
"TTTATGG" => [ "TAGCTAGCTTTATGGGCTAGCTA" ],
"TTTATCC" => [ "GCTAGCTATTTATCCGCTAGCTA" ],
"TTTATCG" => [ "AGTCATGCTTTATCGCGATCGAT" ],
"TTTATAA" => [ "TAGCTAGCTTTATAATAGCTAGC", "ATCGATCGTTTATAACGATCGAT" ],
"TTTATAT" => [ "TCGATCGATTTATATTAGCTAGC", "TAGCTAGCTTTATATGCTAGCTA" ],
"TTTATTA" => [ "GCTAGCTATTTATTATAGCTAGC" ],
"CTTGTAA" => [ "ATCGATCGCTTGTAACGATTAGC" ]
);
my $cbn = "Clustered_Barcode_Number";
my $trailer = "Sequence.txt";
while (my $line = <INFILE>) {
chomp $line ;
my $line_num = $. ;
my #barcodes = split " ", $line ;
for my $col ( 1 .. #barcodes ) {
my $barcode = $barcodes[ $col - 1 ]; # arrays indexed from 0
# skip this one if its not in the hash
next unless exists $hash{$barcode} ;
my #sequences = #{ $hash{$barcode} } ;
# Have a hit - create temp file and output sequences
my ($out, $temp_filename) = tempfile();
say $out ">$barcode" ;
say $out $_ for (#sequences) ;
close $out ;
# Rename based on input line and column
my $new_name = join "_", $cbn, $line_num, $col, $trailer ;
rename ($temp_filename, $new_name) or
warn "Couldn't rename $temp_filename to $new_name: $!\n" ;
}
}
close INFILE
All of the barcodes in your sample input data have a match in the hash, so when I run this, I get 4 files for line 1, 5 for line 2 and 1 for line 3.
Clustered_Barcode_Number_1_1_Sequence.txt
Clustered_Barcode_Number_1_2_Sequence.txt
Clustered_Barcode_Number_1_3_Sequence.txt
Clustered_Barcode_Number_1_4_Sequence.txt
Clustered_Barcode_Number_2_1_Sequence.txt
Clustered_Barcode_Number_2_2_Sequence.txt
Clustered_Barcode_Number_2_3_Sequence.txt
Clustered_Barcode_Number_2_4_Sequence.txt
Clustered_Barcode_Number_2_5_Sequence.txt
Clustered_Barcode_Number_3_1_Sequence.txt
Clustered_Barcode_Number_1_2_Sequence.txt for example has:
>TTTATGG
TAGCTAGCTTTATGGGCTAGCTA
and Clustered_Barcode_Number_2_5_Sequence.txt has:
>TTTATTA
GCTAGCTATTTATTATAGCTAGC
Clustered_Barcode_Number_2_3_Sequence.txt - which matched a hash key with two sequences - had the following;
>TTTATAT
TCGATCGATTTATATTAGCTAGC
TAGCTAGCTTTATATGCTAGCTA
I was speculating here about what you wanted when a supplied barcode had two matches. Hope that helps.

replace 4th column from the last and also pick unique value from 3rd column at the same time

I have two files both of them are delimited by pipe.
First file:
has may be around 10 columns but i am interested in first two columns which would useful in updating the column value of the second file.
first file detail:
1|alpha|s3.3|4|6|7|8|9
2|beta|s3.3|4|6|7|8|9
20|charlie|s3.3|4|6|7|8|9
6|romeo|s3.3|4|6|7|8|9
Second file detail:
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**1**|a10|a11|a12
a1|a2|**ray**|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|**kate**|a3|a4|a5|a6|a7|a8|**20**|a10|a11|a12
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**6**|a10|a11|a12
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**45**|a10|a11|a12
My requirement here is to find unique values from 3rd column and also replace the 4th column from the last . The 4th column from the last may/may not have numeric number . This number would be appearing in the first field of first file as well. I need replace (second file )this number with the corresponding value that appears in the second column of the first file.
expected output:
unique string : ray kate bob
a1|a2|bob|a3|a4|a5|a6|a7|a8|**alpha**|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|**charlie**|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|**romeo**|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
I am able to pick the unique string using below command
awk -F'|' '{a[$3]++}END{for(i in a){print i}}' filename
I would dont want to read the second file twice , first to pick the unique string and second time to replace 4th column from the last as the file size is huge. It would be around 500mb and there are many such files.
Currently i am using perl (Text::CSV) module to read the first file ( this file is of small size ) and load the first two columns into a hash , considering first column as key and second as value. then read the second file and replace the n-4 column with hash value. But this seems to be time consuming as Text::CSV parsing seems to be slow.
Any awk/perl solution keeping speed in mind would be really helpful :)
Note: Ignore the ** asterix around the text , they are just to highlight they are not part of the data.
UPDATE : Code
#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Utils;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ sep_char => '|' });
my $file = $ARGV[0] or die "Need to get CSV file on the command line\n";
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$data>) {
chomp $line;
if ($csv->parse($line)) {
my #fields = $csv->fields();
$hash{$field[0]}=$field[1];
} else {
warn "Line could not be parsed: $line\n";
}
}
close($data);
my $csv = Text::CSV->new({ sep_char => '|' , blank_is_undef => 1 , eol => "\n"});
my $file2 = $ARGV[1] or die "Need to get CSV file on the command line\n";
open ( my $fh,'>','/tmp/outputfile') or die "Could not open file $!\n";
open(my $data2, '<', $file2) or die "Could not open '$file' $!\n";
while (my $line = <$data2>) {
chomp $line;
if ($csv->parse($line)) {
my #fields = $csv->fields();
if (defined ($field[-4]) && looks_like_number($field[-4]))
{
$field[-4]=$hash{$field[-4]};
}
$csv->print($fh,\#fields);
} else {
warn "Line could not be parsed: $line\n";
}
}
close($data2);
close($fh);

Here's an option that doesn't use Text::CSV:
use strict;
use warnings;
#ARGV == 3 or die 'Usage: perl firstFile secondFile outFile';
my ( %hash, %seen );
local $" = '|';
while (<>) {
my ( $key, $val ) = split /\|/, $_, 3;
$hash{$key} = $val;
last if eof;
}
open my $outFH, '>', pop or die $!;
while (<>) {
my #F = split /\|/;
$seen{ $F[2] } = undef;
$F[-4] = $hash{ $F[-4] } if exists $hash{ $F[-4] };
print $outFH "#F";
}
close $outFH;
print 'unique string : ', join( ' ', reverse sort keys %seen ), "\n";
Command-line usage: perl firstFile secondFile outFile
Contents of outFile from your datasets (asterisks removed):
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
STDOUT:
unique string : ray kate bob
Hope this helps!

Use getline instead of parse, it is much faster. The following is a more idiomatic way of performing this task. Note that you can reuse the same Text::CSV object for multiple files.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Text::CSV;
my $csv = Text::CSV->new({
auto_diag => 1,
binary => 1,
blank_is_undef => 1,
eol => $/,
sep_char => '|'
}) or die "Can't use CSV: " . Text::CSV->error_diag;
open my $map_fh, '<', 'map.csv' or die "map.csv: $!";
my %mapping;
while (my $row = $csv->getline($map_fh)) {
$mapping{ $row->[0] } = $row->[1];
}
close $map_fh;
open my $in_fh, '<', 'input.csv' or die "input.csv: $!";
open my $out_fh, '>', 'output.csv' or die "output.csv: $!";
my %seen;
while (my $row = $csv->getline($in_fh)) {
$seen{ $row->[2] } = 1;
my $key = $row->[-4];
$row->[-4] = $mapping{$key} if defined $key and exists $mapping{$key};
$csv->print($out_fh, $row);
}
close $in_fh;
close $out_fh;
say join ',', keys %seen;
map.csv
1|alpha|s3.3|4|6|7|8|9
2|beta|s3.3|4|6|7|8|9
20|charlie|s3.3|4|6|7|8|9
6|romeo|s3.3|4|6|7|8|9
input.csv
a1|a2|bob|a3|a4|a5|a6|a7|a8|1|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|20|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|6|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
output.csv
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
STDOUT
kate,bob,ray

This awk should work.
$ awk '
BEGIN { FS = OFS = "|" }
NR==FNR { a[$1] = $2; next }
{ !unique[$3]++ }
{ $(NF-3) = (a[$(NF-3)]) ? a[$(NF-3)] : $(NF-3) }1
END {
for(n in unique) print n > "unique.txt"
}' file1 file2 > output.txt
Explanation:
We set the input and output field separators to |.
We iterate through first file creating an array storing column one as key and assigning column two as the value
Once the first file is loaded in memory, we create another array by reading the second file. This array stores the unique values from column three of second file.
While reading the file, we look at the forth value from last to be present in our array from first file. If it is we replace it with the value from array. If not then we leave the existing value as is.
In the END block we iterate through our unique array and print it to a file called unique.txt. This holds all the unique entries seen on column three of second file.
The entire output of the second file is redirected to output.txt which now has the modified forth column from last.
$ cat output.txt
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
$ cat unique.txt
kate
bob
ray

Perl find and replace multiple(huge) strings in one shot

Based on a mapping file, i need to search for a string and if found append the replace string to the end of line.
I'm traversing through the mapping file line by line and using the below perl one-liner, appending the strings.
Issues:
1.Huge find & replace Entries: But the issues is the mapping file has huge number of entries (~7000 entries) and perl one-liners takes ~1 seconds for each entries which boils down to ~1 Hour to complete the entire replacement.
2.Not Simple Find and Replace: Its not a simple Find & Replace. It is - if found string, append the replace string to EOL.
If there is no efficient way to process this, i would even consider replacing rather than appending.
Mine is on Windows 7 64-Bit environment and im using active perl. No *unix support.
File Samples
Map.csv
findStr1,RplStr1
findStr2,RplStr2
findStr3,RplStr3
.....
findStr7000,RplStr7000
input.csv
col1,col2,col3,findStr1,....col-N
col1,col2,col3,findStr2,....col-N
col1,col2,col3,FIND-STR-NOT-EXIST,....col-N
output.csv (Expected Output)
col1,col2,col3,findStr1,....col-N,**RplStr1**
col1,col2,col3,findStr1,....col-N,**RplStr2**
col1,col2,col3,FIND-STR-NOT-EXIST,....col-N
Perl Code Snippet
One-Liner
perl -pe '/findStr/ && s/$/RplStr/' file.csv
open( INFILE, $MarketMapFile ) or die "Error occured: $!";
my #data = <INFILE>;
my $cnt=1;
foreach $line (#data) {
eval {
# Remove end of line character.
$line =~ s/\n//g;
my ( $eNodeBID, $MarketName ) = split( ',', $line );
my $exeCmd = 'perl -i.bak -p -e "/'.$eNodeBID.'\(M\)/ && s/$/,'.$MarketName.'/;" '.$CSVFile;
print "\n $cnt Repelacing $eNodeBID with $MarketName and cmd is $exeCmd";
system($exeCmd);
$cnt++;
}
}
close(INFILE);

To do this in a single pass through your input CSV, it's easiest to store your mapping in a hash. 7000 entries is not particularly huge, but if you're worried about storing all of that in memory you can use Tie::File::AsHash.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Tie::File::AsHash;
tie my %replace, 'Tie::File::AsHash', 'map.csv', split => ',' or die $!;
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => $/ })
or die Text::CSV->error_diag;
open my $in_fh, '<', 'input.csv' or die $!;
open my $out_fh, '>', 'output.csv' or die $!;
while (my $row = $csv->getline($in_fh)) {
push #$row, $replace{$row->[3]};
$csv->print($out_fh, $row);
}
untie %replace;
close $in_fh;
close $out_fh;
map.csv
foo,bar
apple,orange
pony,unicorn
input.csv
field1,field2,field3,pony,field5,field6
field1,field2,field3,banana,field5,field6
field1,field2,field3,apple,field5,field6
output.csv
field1,field2,field3,pony,field5,field6,unicorn
field1,field2,field3,banana,field5,field6,
field1,field2,field3,apple,field5,field6,orange
I don't recommend screwing up your CSV format by only appending fields to matching lines, so I add an empty field if a match isn't found.
To use a regular hash instead of Tie::File::AsHash, simply replace the tie statement with
open my $map_fh, '<', 'map.csv' or die $!;
my %replace = map { chomp; split /,/ } <$map_fh>;
close $map_fh;

This is untested code / pseudo-Perl you'll need to polish it (strict, warnings, etc.):
# load the search and replace sreings into memeory
open($mapfh, "<", mapfile);
%maplines;
while ( $mapline = <fh> ) {
($findstr, $replstr) = split(/,/, $mapline);
%maplines{$findstr} = $replstr;
}
close $mapfh;
open($ifh, "<", inputfile);
while ($inputline = <$ifh>) { # read an input line
#input = split(/,/, $inputline); # split it into a list
if (exists $maplines{$input[3]}) { # does this line match
chomp $input[-1]; # remove the new line
push #input, $maplines{$input[3]}; # add the replace str to the end
last; # done processing this line
}
print join(',', #input); # or print or an output file
}
close($ihf)

How to compare two text files and removing the matching contents and pass to output in perl?

I have two text files text1.txt and text2.txt like below
text1
ac
abc
abcd
abcde
text2
ab
abc
acd
abcd
output
ac
abcde
I need to compare the two files and remove the content from text1 when there is a match in the second file.
I want the code in Perl. Currently I am trying the below code.
#!usr/bin/perl
use strict;
use warnings;
open (GEN, "text1.txt") || die ("cannot open general.txt");
open (SEA, "text2.txt") || die ("cannot open search.txt");
open (OUT,">> output.txt") || die ("cannot open intflist.txt");
open (LOG, ">> logfile.txt");
undef $/;
foreach (<GEN>) {
my $gen = $_;
chomp ($gen);
print LOG $gen;
foreach (<SEA>) {
my $sea = $_;
chomp($sea);
print LOG $sea;
if($gen ne $sea) {
print OUT $gen;
}
}
}
In this I am getting all content from text1, not the unmatched content. Please help me out.

I think you should read the text2 in an array and then in the second foreach on that array use the array.
#b = <SEA>;
Or else in the second loop the file pointer would be at the end already

One way:
#!/usr/bin/perl
use strict;
use warnings;
$\="\n";
open my $fh1, '<', 'file1' or die $!;
open my $fh2, '<', 'file2' or die $!;
open my $out, '>', 'file3' or die $!;
chomp(my #arr1=<$fh1>);
chomp(my #arr2=<$fh2>);
foreach my $x (#arr1){
print $out $x if (!grep (/^\Q$x\E$/,#arr2));
}
close $fh1;
close $fh2;
close $out;
After executing the above, the file 'file3' contains:
$ cat file3
ac
abcde

This is my plan:
Read the contents of first file in a hash, with a counter of occurrences. For example, working with your data you get:
%lines = ( 'ac' => 1,
'abc' => 1,
'abcd' => 1,
'abcde' => 1);
Read the second file, deleting the previous hash %lines if key exists.
Print the keys %lines to the desired file.
Example:
use strict;
open my $fh1, '<', 'text1' or die $!;
open my $fh2, '<', 'text2' or die $!;
open my $out, '>', 'output' or die $!;
my %lines = ();
while( my $key = <$fh1> ) {
chomp $key;
$lines{$key} = 1;
}
while( my $key = <$fh2> ) {
chomp $key;
delete $lines{$key};
}
foreach my $key(keys %lines){
print $out $key, "\n";
}
close $fh1;
close $fh2;
close $out;

Your main problem is that you have undefined the input record separator $/. That means the whole file will be read as a single string, and all you can do is say that the two files are different.
Remove undef $/ and things will work a whole lot better. However the inner for loop will read and print all the lines in file2 that don't match the first line of file1. The second time this loop is encountered all the data has been read from the file so the body of the loop won't be executed at all. You must either open file2 inside the outer loop or read the file into an array and loop over that instead.
Then again, do you really want to print all lines from file2 that aren't equal to each line in file1?
Update
As I wrote in my comment, it sounds like you want to output the lines in text1 that don't appear anywhere in text2. That is easily achieved using a hash:
use strict;
use warnings;
my %exclude;
open my $fh, '<', 'text2.txt' or die $!;
while (<$fh>) {
chomp;
$exclude{$_}++;
}
open $fh, '<', 'text1.txt' or die $!;
while (<$fh>) {
chomp;
print "$_\n" unless $exclude{$_};
}
With the data you show in your question, that produces this output
ac
abcde

I would like to view your problem like this:
You have a set S of strings in file.txt.
You have a set F of forbidden strings in forbidden.txt.
You want the strings that are allowed, so S \ F (setminus).
There is a data structure in Perl that implements a set of strings: The hash. (It can also map to scalars, but that is secondary here).
So first we create the set of the lines we have. We let all the strings in that file map to undef, as we don't need that value:
open my $FILE, "<", "file.txt" or die "Can't open file.txt: $!";
my %Set = map {$_ => undef} <$FILE>;
We create the forbidden set the same way:
open my $FORBIDDEN, "<", "forbidden.txt" or die "Can't open forbidden.txt: $!";
my %Forbidden = map {$_ => undef} <$FORBIDDEN>;
The set minus works like either of these ways:
For each element x in S, x is in the result set R iff x isn't in F.
my %Result = map {$_ => $Set{$_}} grep {not exists $Forbidden{$_}} keys %Set;
The result set R initially is S. For each element in F, we delete that item from R:
my %Result = %Set; # make a copy
delete $Result{$_} for keys %Forbidden;
(the keys function accesses the elements in the set of strings)
We can then print out all the keys: print keys %Result.
But what if we want to preserve the order? Entries in a hash can also carry an associated value, so why not the line number? We create the set S like this:
open my $FILE, "<", "file.txt" or die "Can't open file.txt: $!";
my $line_no = 1;
my %Set = map {$_ => $line_no++} <$FILE>;
Now, this value is carried around with the string, and we can access it at the end. Specifically, we sort the keys in the hash after their line number:
my #sorted_keys = sort { $Result{$a} <=> $Result{$b} } keys %Result;
print #sorted_keys;
Note: All of this assumes that the files are terminated by newline. Else, you would have to chomp.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pattern Matching in Columns - perl

Related

Perl: Read columns and convert to array

Count the number of items derived from split without putting into an array

replace 4th column from the last and also pick unique value from 3rd column at the same time

Perl find and replace multiple(huge) strings in one shot

How to compare two text files and removing the matching contents and pass to output in perl?

Categories

Resources