I have an input file as following. I need to break them into multiple files based on the columns 2,3&5. The file has more columns but i have used cut command to get only the required columns.
12,Accounts,India,free,Internal
13,Finance,China,used,Internal
16,Finance,China,free,Internal
12,HR,India,free,External
19,HR,China,used,Internal
33,Finance,Japan,free,Internal
39,Accounts,US,used,External
14,Accounts,Japan,used,External
11,Finance,India,used,External
11,HR,US,used,External
10,HR,India,used,External
Output files:
Accounts_India_Internal --
12,Accounts,India,free,Internal
Finance_China_Internal --
13,Finance,China,used,Internal
16,Finance,China,free,Internal
HR_India_External --
12,HR,India,free,External
10,HR,India,used,External
HR_China_Internal --
19,HR,China,used,Internal
and so on..
Please let me know how to achieve this.
As of now, I am thinking to sort the file based on these columns (2,3,5) and then run a loop on each record and start creating files. If a file does not exist, then create and add the record. Otherwise open the old file and add the record.
Is it possible to do this using shell scripting (bash)?
Is it possible to do this using shell scripting (bash)?
If you simply want to split the files based on fields 2, 3 and 5 you can do that quickly with awk:
awk -F, '{print >> $2"_"$3"_"$5}' infile.txt
That appends each line to a file whose name is made up of fields 2, 3 and 5.
Example:
[me#home]$ awk -F, '{print >> $2"_"$3"_"$5}' infile.txt
[me#home]$ cat Accounts_India_Internal
12,Accounts,India,free,Internal
[me#home]$ cat Finance_China_Internal
13,Finance,China,used,Internal
16,Finance,China,free,Internal
If you do want output sorted, you can first run the file through sort.
sort -k2,3 -k5,5 -t, infile.txt | awk -F, '{print >> $2"_"$3"_"$5}'
That sorts the lines on fields 2, 3, and 5 before passing them on to the awk command.
Do note that the we're appending to the files so if you repeat the command without deleting the output files, you'll end up with duplicate data in the output files. To address this, as well as include your additional requirements (using first line as header for all new files) as mentioned in the chat, see this solution.
I suggest you keep a hash of file handles keyed by their corresponding file names
This program demonstrates. The input file is expected as a parameter on the command line
use strict;
use warnings;
my %fh;
while (<>) {
chomp;
my $filename = join '_', (split /,/)[1,2,4];
if (not $fh{$filename}) {
open $fh{$filename}, '>', $filename or die "Unable to open '$filename' for output: $!";
print "$filename created\n";
}
print { $fh{$filename} } $_, "\n";
}
output
Accounts_India_Internal created
Finance_China_Internal created
HR_India_External created
HR_China_Internal created
Finance_Japan_Internal created
Accounts_US_External created
Accounts_Japan_External created
Finance_India_External created
HR_US_External created
Note: To use the code, simply change <DATA> to <> and use the file name as argument. The Data::Dumper print is there only for demonstration purposes and can also be removed.
use strict;
use warnings;
use Data::Dumper;
my %h;
while (<DATA>) {
chomp;
my #data = split /,/;
my $file = join "_", #data[1,2,4];
push #{$h{$file}}, $_;
}
print Dumper \%h;
__DATA__
12,Accounts,India,free,Internal
13,Finance,China,used,Internal
16,Finance,China,free,Internal
12,HR,India,free,External
19,HR,China,used,Internal
33,Finance,Japan,free,Internal
39,Accounts,US,used,External
14,Accounts,Japan,used,External
11,Finance,India,used,External
11,HR,US,used,External
10,HR,India,used,External
To print the files, you could use a subroutine like so:
for my $key (keys %h) {
print_file($key, $h{$key};
}
sub print_file {
my ($file, $data) = #_;
open my $fh, ">", $file or die $!;
print $fh "$_\n" for #$data;
}
save input text as foo, then:
cat foo | perl -nle '$k = join "_", (split ",", $_)[1,2,4]; $t{$k} = [#{$t{$k}}, $_]; END{for (keys %t){print join "\n", "$_ --", #{$t{$_}}, undef }}' | csplit -sz - '/^$/' {*}
Related
In AWK, it is common to see this kind of structure for a script that runs on two files:
awk 'NR==FNR { print "first file"; next } { print "second file" }' file1 file2
Which uses the fact that there are two variables defined: FNR, which is the line number in the current file and NR which is the global count (equivalent to Perl's $.).
Is there something similar to this in Perl? I suppose that I could maybe use eof and a counter variable:
perl -nE 'if (! $fn) { say "first file" } else { say "second file" } ++$fn if eof' file1 file2
This works but it feels like I might be missing something.
To provide some context, I wrote this answer in which I manually define a hash but instead, I would like to populate the hash from the values in the first file, then do the substitutions on the second file. I suspect that there is a neat, idiomatic way of doing this in Perl.
Unfortunately, perl doesn't have a similar NR==FNR construct to differentiate between two files. What you can do is use the BEGIN block to process one file and main body to process the other.
For example, to process a file with the following:
map.txt
a=apple
b=ball
c=cat
d=dog
alpha.txt
f
a
b
d
You can do:
perl -lne'
BEGIN {
$x = pop;
%h = map { chomp; ($k,$v) = split /=/; $k => $v } <>;
#ARGV = $x
}
print join ":", $_, $h{$_} //= "Not Found"
' map.txt alpha.txt
f:Not Found
a:apple
b:ball
d:dog
Update:
I gave a pretty simple example, and now when I look at that, I can only say TIMTOWDI since you can do:
perl -F'=' -lane'
if (#F == 2) { $h{$F[0]} = $F[1]; next }
print join ":", $_, $h{$_} //= "Not Found"
' map.txt alpha.txt
f:Not Found
a:apple
b:ball
d:dog
However, I can say for sure, there is no NR==FNR construct for perl and you can probably process them in various different ways based on the files.
It looks like what you're aiming for is to use the same loop for reading both files, and have a conditional inside the loop that chooses what to do with the data. I would avoid that idea because you are hiding what two distinct processes in the same stretch of code, making it less than clear what is going on.
But, in the case of just two files, you could compare the current file with the first element of #ARGV, like this
perl -nE 'if ($ARGV eq $ARGV[0]) { say "first file" } else { say "second file" }' file1 file2
Forgetting about one-line programs, which I hate with a passion, I would just explicitly open $ARGV[0] and $ARGV[1]. Perhaps naming them like this
use strict;
use warnings;
use 5.010;
use autodie;
my ($definitions, $data) = #ARGV;
open my $fh, '<', $definitions;
while (<$fh>) {
# Build hash
}
open $fh, '<', $data;
while (<$fh>) {
# Process file
}
But if you want to avail yourself of the automatic opening facilities then you can mess with #ARGV like this
use strict;
use warnings;
my ($definitions, $data) = #ARGV;
#ARGV = ($definitions);
while (<>) {
# Build hash
}
#ARGV = ($data);
while (<>) {
# Process file
}
You can also create your own $fnr and compare to $..
Given:
var='first line
second line'
echo "$var" >f1
echo "$var" >f2
echo "$var" >f3
You can create a pseudo FNR by setting a variable in the BEGIN block and resetting at each eof:
perl -lnE 'BEGIN{$fnr=1;}
if ($fnr==$.) {
say "first file: $ARGV, $fnr, $. $_";
}
else {
say "$ARGV, $fnr, $. $_";
}
eof ? $fnr=1 : $fnr++;' f{1..3}
Prints:
first file: f1, 1, 1 first line
first file: f1, 2, 2 second line
f2, 1, 3 first line
f2, 2, 4 second line
f3, 1, 5 first line
f3, 2, 6 second line
Definitely not as elegant as awk but it works.
Note that Ruby has support for FNR==NR type logic.
i want to match two different file for similar line in both the files without using hash function. but i am getting no clue of doing that. can you please help
here is my input:
file1:
delhi
bombay
kolkata
shimla
ghuhati
File2:
london
delhi
jammu
punjab
shimla
output:
delhi
shimla
Stub of code:
#/usr/bin/perl
use strict;
use warnings;
my #line1 = <file1>;
my #line2 = <file2>;
while (<file1>) {
do stuff;
}
You really need a very good reason not to use a hash to solve this problem. Is it a homework assignment?
This will do what you ask. It reads all of file1 into an array, then reads through file2 one line at a time, using grep to check whether the name appears in the array.
use strict;
use warnings;
use autodie;
my #file1 = do {
open my $fh, '<', 'file1.txt';
<$fh>;
};
chomp #file1;
open my $fh, '<', 'file2.txt';
while (my $line = <$fh>) {
chomp $line;
print "$line\n" if grep { $_ eq $line } #file1;
}
output
delhi
shimla
Of course you can do it without using hash, in a very inefficient way:
my #line1 = <file1>;
my #line2 = <file2>;
foreach (#line1) {
if (isElement($_, \#line2)) {
print;
}
}
sub isElement {
my ($ele, $arr) = #_;
foreach (#$arr) {
if ($ele eq $_) {
return 1;
}
}
return 0;
}
You don't want to use hashes? Can I use modules?
The easiest way may be to avoid using Perl, but use the Unix sort and comm utilities:
$ sort -u file1.txt > file1.sort.txt
$ sort -u file2.txt > file2.sort.txt
# comm -12 file1.sort.txt file2.sort.txt
The comm utility prints out three columns: The first column are lines found only in the first file. The second column are lines found only in the second file, and the last column are lines found in both files. The -12 parameter suppresses the printing of columns #1 and #2 giving you only the lines in common.
You could use a double loop where you compare each value with each other one as Lee Duhem did, but that's pretty inefficient. The time it takes increases by the square of the number of items.
I have a csv file that contains below lines:
23000747,,2015582,-375080.2254,-375080,-375080
23000749,,SA1555,"-30,448,276","-30,448,456","-30,448,239"
I'd like to remove the double quotes and commas from all the quoted columns so that the result will be something like below:
23000747,,2015582,-375080.2254,-375080,-375080
23000749,,SA1555,-30448276,-30448456,-30448239
I have managed to be able to locate the parts on which I want to remove the comma using below command, but I couldn't figure out how to do s/,//g and s/"//g on \1.
sed 's/\("[-,0-9]*"\)/#\1#/g' 1.txt
23000747,,2015582,-375080.2254,-375080,-375080
23000749,,SA1555,#"-30,448,276"#,#"-30,448,456"#,#"-30,448,239"#
Really appreciate if anyone can help here.
Jack
sed is not appropiate for your job. You could use Perl and the Text::CSV module, but if you have GNU awk you can use the FPAT variable:
awk 'BEGIN { FPAT = "([^,]*)|(\"[^\"]+\")"; OFS="," } { for (i=1; i<=NF; i++) gsub(/[\",]/,"", $i) }1'
Results:
23000747,,2015582,-375080.2254,-375080,-375080
23000749,,SA1555,-30448276,-30448456,-30448239
For this specific task, the shell is limited. An advanced text manipulation language like Perl is more suitable with a CSV parser, see :
my $file = "/path/to/file.csv";
use strict; use warnings;
use feature qw/say/;
use Text::CSV;
my $csv = Text::CSV->new()
or die "Cannot use CSV: ".Text::CSV->error_diag();
open my $fh, "<:encoding(utf8)", $file
or die "$file: $!";
while (my $row = $csv->getline($fh)) {
map { tr/,// } #$row;
say join ",", #$row;
}
$csv->eof or $csv->error_diag();
close $fh;
If you need to remove commas on particular columns, replace
map { tr/,// } #$row;
by
map { tr/,// } #$row[3..5]; # array slice (columns N-1)
File 1
A11;F1;BMW
A23;F2;BMW
B12;F3;BMW
H11;F4;JBW
File 2
P01;A1;0;0--00 ;123;456;150
P01;A11;0;0--00 ;123;444;208
P01;B12;0;0--00 ;123;111;36
P01;V11;0;0--00 ;123;787;33.9
Output
-;-;-;P01;A1;0;0--00 ;123;456;150
A11;F1;BMW;P01;A11;0;0--00 ;123;444;208
B12;F3;BMW;P01;B12;0;0--00 ;123;111;36
-;-;-;P01;V11;0;0--00 ;123;787;33.9
I TRIED
awk 'FNR==NR {a[$2] = $0; next }{ if($1 in a) {p=$1;$1="";print a[p],$0}}' File1 File2
But didnt work.
Basically I want to get the details from FILE 1 and compare with FILE2 (master list) .
Example :
A1 in FILE2 was not available in FILE1 , so in output file we have “-“ for 1st three fields and rest from FILE2 . Now, we have A11 and we got the detail in FILE1. So we write details of A11 from both File 1 & 2
I would do this in Perl, personally, but since everyone and their mother is giving you a Perl solution, here's an alternative:
Provided that the records in each file have a consistent number of fields, and provided that the records in each file are sorted by the "join" field in lexicographic order, you can use join:
join -1 1 -2 2 -t ';' -e - -o '1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7' -a 2 File1 File2
Explanation of options:
-1 1 and -2 2 mean that the "join" field (A11, A23, etc.) is the first field in File1 and the second field in File2.
-t ';' means that fields are separated by ;
-e - means that empty fields should be replaced by -
-o '1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7' means that you want each output-line to consist of the first three fields from File1, followed by the first seven fields from File2. (This is why this approach requires that the records in each file have a consistent number of fields.)
-a 2 means that you want to include every line from File2 in the output, even if there's no corresponding line from File1. (Otherwise it would only output lines that have a match in both files.)
The usual Perl way: use a hash to remember the master list:
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
open my $MASTER, '<', 'File1' or die $!;
while (<$MASTER>) {
chomp;
my #columns = split /;/;
$hash{$columns[0]} = [#columns[1 .. $#columns]];
}
close $MASTER;
open my $DETAIL, '<', 'File2' or die $!;
while (<$DETAIL>) {
my #columns = split /;/;
if (exists $hash{$columns[1]}) {
print join ';', $columns[1], #{ $hash{$columns[1]} }, q();
} else {
print '-;-;-;';
}
print;
}
close $DETAIL;
With Perl:
use warnings;
use strict;
my %file1;
open (my $f1, "<", "file1") or die();
while (<$f1>) {
chomp;
my #v = (split(/;/))[0];
$file1{$v[0]} = $_;
}
close ($f1);
open (my $f2, "<", "file2") or die();
while (<$f2>) {
chomp;
my $v = (split(/;/))[1];
if (defined $file1{$v}) {
print "$file1{$v};$_\n";
} else {
print "-;-;-;$_\n";
}
}
close ($f2);
A perl solution might include the very nice module Text::CSV. If so, you might extract the values into a hash, and later use that hash for lookup. When looking up values, you would insert blank values -;-;-; for any undefined values in the lookup hash.
use strict;
use warnings;
use Text::CSV;
my $lookup = "file1.csv"; # whatever file is used to look up fields 0-2
my $master = "file2.csv"; # the file controlling the printing
my $csv = Text::CSV->new({
sep_char => ";",
eol => $/, # to add newline to $csv->print()
quote_space => 0, # to avoid adding quotes
});
my %lookup;
open my $fh, "<", $lookup or die $!;
while (my $row = $csv->getline($fh)) {
$lookup{$row->[0]} = $row; # add entire row to specific key
}
open $fh, "<", $master or die $!; # new $fh needs no close
while (my $row = $csv->getline($fh)) {
my $extra = $lookup{$row->[1]} // [ qw(- - -) ]; # blank row if undef
unshift #$row, #$extra; # add the new values
$csv->print(*STDOUT, $row); # then print them
}
Output:
-;-;-;P01;A1;0;0--00 ;123;456;150
A11;F1;BMW;P01;A11;0;0--00 ;123;444;208
B12;F3;BMW;P01;B12;0;0--00 ;123;111;36
-;-;-;P01;V11;0;0--00 ;123;787;33.9
This can't be done conveniently in a one-line program as it involves reading two input files, however the problem isn't difficult
This program reads all the lines from file1, and uses the first field as a key to store the line in a hash
Then all the lines from file2 are read and the second field used as a key to access the hash. The // defined-or operator is used to print either element's value is printed if it exists, or the default string if not
Finally the current line from file2 is printed
use strict;
use warnings;
my %data;
open my $fh, '<', 'file1' or die $!;
while (<$fh>) {
chomp;
my $key = (split /;/)[0];
$data{$key} = $_;
}
open $fh, '<', 'file2' or die $!;
while (<$fh>) {
my $key = (split /;/)[1];
print $data{$key} // '-;-;-;', $_;
}
output
-;-;-;P01;A1;0;0--00 ;123;456;150
A11;F1;BMWP01;A11;0;0--00 ;123;444;208
B12;F3;BMWP01;B12;0;0--00 ;123;111;36
-;-;-;P01;V11;0;0--00 ;123;787;33.9
I have several text files, that were once tables in a database, which is now disassembled. I'm trying to reassemble them, which will be easy, once I get them into a usable form. The first file, "keys.text" is just a list of labels, inconsistently formatted. Like:
Sa 1 #
Sa 2
U 328 #*
It's always letter(s), [space], number(s), [space], and sometime symbol(s). The text files that match these keys are the same, then followed by a line of text, also separated, or delimited, by a SPACE.
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
What I'm trying to do in the code below, is match the key from "keys.text", with the same key in the .txt files, and put a tab between the key, and the text. I'm sure I'm overlooking something very basic, but the result I'm getting, looks identical to the source .txt file.
Thanks in advance for any leads or assistance!
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open(IN1, "keys.text");
my $key;
# Read each line one at a time
while ($key = <IN1>) {
# For each txt file in the current directory
foreach my $file (<*.txt>) {
open(IN, $file) or die("Cannot open TXT file for reading: $!");
open(OUT, ">temp.txt") or die("Cannot open output file: $!");
# Add temp modified file into directory
my $newFilename = "modified\/keyed_" . $file;
my $line;
# Read each line one at a time
while ($line = <IN>) {
$line =~ s/"\$key"/"\$key" . "\/t"/;
print(OUT "$line");
}
rename("temp.txt", "$newFilename");
}
}
EDIT: Just to clarify, the results should retain the symbols from the keys as well, if there are any. So they'd look like:
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
The regex seems quoted rather oddly to me. Wouldn't
$line =~ s/$key/$key\t/;
work better?
Also, IIRC, <IN1> will leave the newline on the end of your $key. chomp $key to get rid of that.
And don't put parentheses around your print args, esp when you're writing to a file handle. It looks wrong, whether it is or not, and distracts people from the real problems.
if Perl is not a must, you can use this awk one liner
$ cat keys.txt
Sa 1 #
Sa 2
U 328 #*
$ cat mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
$ awk 'FNR==NR{ k[$1 SEP $2];next }($1 SEP $2 in k) {$2=$2"\t"}1 ' keys.txt mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
Using split rather than s/// makes the problem straightforward. In the code below, read_keys extracts the keys from keys.text and records them in a hash.
Then for all files named on the command line, available in the special Perl array #ARGV, we inspect each line to see whether it begins with a key. If not, we leave it alone, but otherwise insert a TAB between the key and the text.
Note that we edit the files in-place thanks to Perl's handy -i option:
-i[extension]
specifies that files processed by the <> construct are to be edited in-place. It does this by renaming the input file, opening the output file by the original name, and selecting that output file as the default for print statements. The extension, if supplied, is used to modify the name of the old file to make a backup copy …
The line split " ", $_, 3 separates the current line into exactly three fields. This is necessary to protect whitespace that's likely to be present in the text portion of the line.
#! /usr/bin/perl -i.bak
use warnings;
use strict;
sub usage { "Usage: $0 text-file\n" }
sub read_keys {
my $path = "keys.text";
open my $fh, "<", $path
or die "$0: open $path: $!";
my %key;
while (<$fh>) {
my($text,$num) = split;
++$key{$text}{$num} if defined $text && defined $num;
}
wantarray ? %key : \%key;
}
die usage unless #ARGV;
my %key = read_keys;
while (<>) {
my($text,$num,$line) = split " ", $_, 3;
$_ = "$text $num\t$line" if defined $text &&
defined $num &&
$key{$text}{$num};
print;
}
Sample run:
$ ./add-tab input
$ diff -u input.bak input
--- input.bak 2010-07-20 20:47:38.688916978 -0500
+++ input 2010-07-20 21:00:21.119531937 -0500
## -1,3 +1,3 ##
-Sa 1 # Random line of text follows.
-Sa 2 This text is just as random.
-U 328 #* Continuing text...
+Sa 1 # Random line of text follows.
+Sa 2 This text is just as random.
+U 328 #* Continuing text...
Fun answers:
$line =~ s/(?<=$key)/\t/;
Where (?<=XXXX) is a zero-width positive lookbehind for XXXX. That means it matches just after XXXX without being part of the match that gets substituted.
And:
$line =~ s/$key/$key . "\t"/e;
Where the /e flag at the end means to do one eval of what's in the second half of the s/// before filling it in.
Important note: I'm not recommending either of these, they obfuscate the program. But they're interesting. :-)
How about doing two separate slurps of each file. For the first file you open the keys and create a preliminary hash. For the second file then all you need to do is add the text to the hash.
use strict;
use warnings;
my $keys_file = "path to keys.txt";
my $content_file = "path to content.txt";
my $output_file = "path to output.txt";
my %hash = ();
my $keys_regex = '^([a-zA-Z]+)\s*\(d+)\s*([^\da-zA-Z\s]+)';
open my $fh, '<', $keys_file or die "could not open $key_file";
while(<$fh>){
my $line = $_;
if ($line =~ /$keys_regex/){
my $key = $1;
my $number = $2;
my $symbol = $3;
$hash{$key}{'number'} = $number;
$hash{$key}{'symbol'} = $symbol;
}
}
close $fh;
open my $fh, '<', $content_file or die "could not open $content_file";
while(<$fh>){
my $line = $_;
if ($line =~ /^([a-zA-Z]+)/){
my $key = $1;
// strip content_file line from keys/number/symbols to leave text
line =~ s/^$key//;
line =~ s/\s*$hash{$key}{'number'}//;
line =~ s/\s*$hash{$key}{'symbol'}//;
$line =~ s/^\s+//g;
$hash{$key}{'text'} = $line;
}
}
close $fh;
open my $fh, '>', $output_file or die "could not open $output_file";
for my $key (keys %hash){
print $fh $key . " " . $hash{$key}{'number'} . " " . $hash{$key}{'symbol'} . "\t" . $hash{$key}{'text'} . "\n";
}
close $fh;
I haven't had a chance to test it yet and the solution seems a little hacky with all the regex but might give you an idea of something else you can try.
This looks like the perfect place for the map function in Perl! Read in the entire text file into an array, then apply the map function across the entire array. The only other thing you might want to do is use the quotemeta function to escape out any possible regular expressions in your keys.
Using map is very efficient. I also read the keys into an array in order to not have to keep opening and closing the keys file in my loop. It's an O^2 algorithm, but if your keys aren't that big, it shouldn't be too bad.
#! /usr/bin/env perl
use strict;
use vars;
use warnings;
open (KEYS, "keys.text")
or die "Cannot open 'keys.text' for reading\n";
my #keys = <KEYS>;
close (KEYS);
foreach my $file (glob("*.txt")) {
open (TEXT, "$file")
or die "Cannot open '$file' for reading\n";
my #textArray = <TEXT>;
close (TEXT);
foreach my $line (#keys) {
chomp $line;
map($_ =~ s/^$line/$line\t/, #textArray);
}
open (NEW_TEXT, ">$file.new") or
die qq(Can't open file "$file" for writing\n);
print TEXT join("\n", #textArray) . "\n";
close (TEXT);
}