Required suggestions to optimize a piece of unix ksh code - sed

I'm new to shell scripting and expecting some guidance on how to optimize the following piece of code to avoid unnecessary loops.
The file "DD.$BUS_DT.dat" is a pipe delimited file and contains 4 columns. Sample data in DD.2015-05-19.dat will be as follows
cust portal|10|10|0
sys-b|10|10|0
Code
i=0;
sed 's///g;s/[0-9]//g' ./DD.$BUS_DT.dat > ./temp-processed.dat
set -A sourceList
while read line
do
#echo $line
case $line in
'cust portal') sourceList[$i]=custportal;;
*) sourceList[$i]=${line};;
esac
(( i += 1));
done < ./temp-processed.dat;
echo ${sourceList[#]};
i=0;
while [[ i -lt ${#sourceList[#]} ]]; do
print ${sourceList[i]} >> ./processed-$BUS_DT.dat
(( i += 1))
done
My goal is to read the data from the first column of the file without spaces so that the output should be like ...
custportal
sys-b
Your help will be appreciated.

I haven't gone through all your script, but if you just want to get the first column on |-separated columns, stripping the spaces that they may have, you can use awk like this:
$ awk -F"|" '{sub(" ","",$1); print $1}' file
custportal
sys-b
This uses | as field separator and replaces all the spaces with an empty string. Then, it prints it.

Related

Replacing all occurrence after nth occurrence in a line in perl

I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.
Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file
I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.
You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.
I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file
With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

while loop provide just one block of result using awk

I am processing my text file using awk. I have written the code below:
#!/bin/bash
l=1
while [ $l -lt 5 ]
do
echo $l
awk -v L=$l '/^BS[0-5]|^FG[2-7]/ && length<10 {i++}i==L {print}'
l=$(expr $l + 1)
done <input.txt
But, once i run the code I just get first awk output.
Would you please let me know how can I fix this code?
The while loop is taking its input from the file input.txt. awk is inheriting its input from the while loop, and it reads all of the input on the first iteration of the loop. In the 2nd iteration, awk gets no data. If you want to read from that file on each iteration, pass input.txt as an argument to awk.
Better yet, skip the while loop and do the whole thing in awk:
awk '/^BS[0-5]|^FG[2-7]/ && length<10 && i++ < 5' input.txt
You do not read the lines of our file. Do:
while read line && [ $l -lt 5 ]
do
...

Change numbering according to field value by bash script

I have a tab delimited file like this (without the headers and in the example I use the pipe character as delimiter for clarity)
ID1|ID2|VAL1|
1|1|3
1|1|4
1|2|3
1|2|5
2|2|6
I want add a new field to this file that changes whenever ID1 or ID2 change. Like this:
1|1|3|1
1|1|4|1
1|2|3|2
1|2|5|2
2|2|6|3
Is this possible with an one liner in sed,awk, perl etc... or should I use a standard programming language (Java) for this task. Thanks in advance for your time.
Here is an awk
awk -F\| '$1$2!=a {f++} {print $0,f;a=$1$2}' OFS=\| file
1|1|3|1
1|1|4|1
1|2|3|2
1|2|5|2
2|2|6|3
Simple enough with bash, though I'm sure you could figure out a 1-line awk
#!/bin/bash
count=1
while IFS='|' read -r id1 id2 val1; do
#Can remove next 3 lines if you're sure you won't have extraneous whitespace
id1="${id1//[[:space:]]/}"
id2="${id2//[[:space:]]/}"
val1="${val1//[[:space:]]/}"
[[ ( -n $old1 && $old1 -ne $id1 ) || ( -n $old2 && $old2 -ne $id2 ) ]] && ((count+=1))
echo "$id1|$id2|$val1|$count"
old1="$id1" && old2="$id2"
done < file
For example
> cat file
1|1|3
1|1|4
1|2|3
1|2|5
2|2|6
> ./abovescript
1|1|3|1
1|1|4|1
1|2|3|2
1|2|5|2
2|2|6|3
Replace IFS='|' with IFS=$'\t' for tab delimited
Using awk
awk 'FNR>1{print $0 FS (++a[$1$2]=="1"?++i:i)}' FS=\| file

Insert a string/number into a specific cell of a csv file

Basically right now I have a for loop running that runs a series of tests. Once the tests pass I input the results into a csv file:
for (( some statement ))
do
if[[ something ]]
input this value into a specific row and column
fi
done
What I can't figure out right now is how to input a specific value into a specific cell in the csv file. I know in awk you can read a cell with this command:
awk -v "row=2" -F'#' 'NR == row { print $2 }' some.csv and this will print the cell in the 2nd row and 2nd column. I need something similar to this except it can input a value into a specific cell instead of read it. Is there a function that does this?
You can use the following:
awk -v value=$value -v row=$row -v col=$col 'BEGIN{FS=OFS="#"} NR==row {$col=value}1' file
And set the bash values $value, $row and $col. Then you can redirect and move to the original:
awk ... file > new_file && mv new_file file
This && means that just if the first command (awk...) is executed successfully, then the second one will be performed.
Explanation
-v value=$value -v row=$row -v col=$col pass the bash variables to awk. Note value, row and col could be other names, I just used the same as bash to make it easier to understand.
BEGIN{FS=OFS="#"} set the Field Separator and Output Field Separator to be #. The OFS="#" is not necessary here, but can be useful in case you do some print.
NR==row {$col=value} when the number of record (number of line here) is equal to row, then set the col column with value value.
1 perform the default awk action: {print $0}.
Example
$ cat a
hello#how#are#you
i#am#fine#thanks
hoho#haha#hehe
$ row=2
$ col=3
$ value="XXX"
$ awk -v value=$value -v row=$row -v col=$col 'BEGIN{FS=OFS="#"} NR==row {$col=value}1' a
hello#how#are#you
i#am#XXX#thanks
hoho#haha#hehe
Your question has a 'perl' tag so here is a way to do it using Tie::Array::CSV which allows you to treat the CSV file as an array of arrays and use standard array operations:
use strict;
use warnings;
use Tie::Array::CSV;
my $row = 2;
my $col = 3;
my $value = 'value';
my $filename = '/path/to/file.csv';
tie my #file, 'Tie::Array::CSV', $filename, sep_char => '#';
$file[$row][$col] = $value;
untie #file;
using sed
row=2 # define the row number
col=3 # define the column number
value="value" # define the value you need change.
sed "$row s/[^#]\{1,\}/$value/$col" file.csv # use shell variable in sed to find row number first, then replace any word between #, and only replace the nominate column.
# So above sed command is converted to sed "2 s/[^#]\{1,\}/value/3" file.csv
If the above command is fine, and your sed command support the option -i, then run the command to change the content directly in file.csv
sed -i "$row s/[^#]\{1,\}/$value/$col" file.csv
Otherwise, you need export to temp file, and change the name back.
sed "$row s/[^#]\{1,\}/$value/$col" file.csv > temp.csv
mv temp.csv file.csv

Awk: Match data in 2 files with duplicate keys

I have 2 files
file1
a^b=-123
a^3=-124
c^b=-129
a^b=-130
and file2
a^b=-523
a^3=-524
a^b=-530
I want to lookup the key using '=' as delimiter and get the following output
a^b^-123^-523
a^b^-130^-530
a^3^-124^-524
When there were no duplicate keys, it was easy to do it in awk mapping the first file and looping over the second, however, with the duplicates, its slightly difficult. I tried something like this:
awk -F"=" '
FNR == NR {
arr[$1 "^" $2] = $2;
next;
}
FNR < NR {
for (i in arr) {
match(i, , /^(.*\^.*)\^([-0-9]*)$/, , ar);
if ($1 == ar[1]) {
if ($2 in load == 0) {
if (ar[2] in l2 == 0) {
l2[ar[2]] = ar[2];
load[$2] = $2;
print i "^" $2
}
}
}
}
}
' file1 file2
This works just fine, however, not surprisingly it's extremely slow. On a file with about 600K records, it ran for 4 hours.
Is there a better and more efficient way to do this in one line awk or perl. If possible, a one liner would be great help.
thanks.
You might want to look at the join command which does something very much like you're doing here, but generates a full database-style join. For example, assuming file1 and file2 contain the data you show above, then the commands
$ sort -o file1.out -t = -k 1,1 file1
$ sort -o file2.out -t = -k 1,1 file2
$ join -t = file1.out file2.out
produces the output
a^3=-124=-524
a^b=-123=-523
a^b=-123=-530
a^b=-130=-523
a^b=-130=-530
The sorts are necessary because, to be efficient, join requires the input file to be sorted on the keys being compared. Note though that this generates the full cross-product join, which appears not to be what you want.
(Note: The following is a very shell-heavy solution, but you could cast it fairly easily into any programming language with dynamic arrays and a built-in sort primitive. Unfortunately, awk isn't one of those but perl and python are, as are I'm sure just about every newer scripting language.)
It seems that you really want each instance of a key to be consumed the first time it's emitted in any output. You can get this as follows, again starting with the original contents of file1 and file2.
$ nl -s = -n rz file1 | sort -t = -k 2,2 > file1.out
$ nl -s = -n rz file2 | sort -t = -k 2,2 > file2.out
This decorates each line with the original line number so that we can recover the original order later, and then sorts them on the key for join. The remainder of the work is a short pipeline, which I've broken up into multiple blocks so it can be explained as we go.
join -t = -1 2 -2 2 file1.out file2.out |
This command joins on the key names, now in field two, and emits records like those shown from the earlier output of join, except that each line now includes the line number where the key was found in file1 and file2. Next, we want to re-establish the search order your original algorithm would have used, so we continue the pipeline with
sort -t = -k 2,2 -k 4,4 |
which sorts first on the file1 line number and then on the file2 line numbers. Finally, we need to efficiently emulate the assumption that a particular key, once consumed, cannot be re-used, in order to eliminate the unwanted matches in the original join output.
awk '
BEGIN { OFS="="; FS="=" }
$2 in seen2 || $4 in seen4 { next }
{ seen2[$2]++; seen4[$4]++; print $1,$3,$5 }
'
This ignores every line that references a previously scanned key in either file, and otherwise prints the following
a^b=-123=-523
a^3=-124=-524
a^b=-130=-530
This should be uniformly efficient even for quite large inputs, because the sorts are O(n log n), and everything else is O(n).
try this awk codes, see if it would be faster than yours: (it could be an one-liner, if you join all lines, but I think with formatting, it is easier to read)
awk -F'=' -v OFS="^" 'NR==FNR{sub(/=/,"^");a[NR]=$0;t=NR;next}
{ s=$1
sub(/\^/,"\\^",s)
for(i=1;i<=t;i++){
if(a[i]~s){
print a[i],$2
delete a[i]
break
}
}
}' file1 file2
with your example, it outputs expected result:
a^b^-123^-523
a^3^-124^-524
a^b^-130^-530
But I think the key is performance here. so give it a try.