Merging newline separated strings - sh

Let's say I have two variables foo and bar containing the same number of newline separated strings, for instance
$ echo $foo
a
b
c
$ echo $bar
x
y
z
What is the simplest way to merge foo and bar to get the output below?
a x
b y
c z
If foo and bar were files I could do paste -d ' ' foo bar but in this case they are strings.

You can use process substitution in Bash to do this (not POSIX compliant):
foo=$'a\nb\nc'
bar=$'x\ny\nz'
paste -d ' ' <(printf '%s\n' "$foo") <(printf '%s\n' "$bar")
Outputs:
a x
b y
c z
An sh-compliant way seems a little convoluted:
foo=$'a\nb\nc'
bar=$'x\ny\nz'
res=$(while IFS=$'\n' read -u 3 -r f1 && IFS=$'\n' read -u 4 -r f2; do
printf '%s' "$f1"
printf ' %s\n' "$f2"
done 3<<<"$foo" 4<<<"$bar"
)

Related

Iterate over $# stored in another variable in another function

How can I iterate over $# after it has been stored in another variable in another function?
Note this is about the sh shell, not bash.
My code (super simplified):
#! /bin/sh
set -- a b "c d"
args=
argv() {
shift # pretend handling options
args="$#" # remaining arguments
}
fun() {
for arg in "$args"; do
echo "+$arg+"
done
}
argv "$#"
fun
Output:
+b c d+
I want:
+b+
+c d+
The special variable $# stores argv preserving whitespace. The for loop can loop over $# also preserving whitespace.
set -- a b "c d"
for arg in "$#"; do
echo "+$arg+"
done
Output:
+a+
+b+
+c d+
But once $# is assigned to another variable the whitespace preserving is gone.
set -- a b "c d"
args="$#"
for arg in "$args"; do
echo "+$arg+"
done
Output
+a b c d+
Without quotes:
for arg in $args; do
echo "+$arg+"
done
Output:
+a+
+b+
+c+
+d+
In bash it can be done using arrays.
set -- a b "c d"
args=("$#")
for arg in "${args[#]}"; do
echo "+$arg+"
done
Output:
+a+
+b+
+c d+
Can that be done in the sh shell?
You could use shift again inside fun if you know the shift has been performed in argv.
#! /bin/sh
set -- a b "c d"
args=
argv() {
shifted=1 # pretend handling options
shift $shifted
}
fun() {
[ -n $shifted ] && shift $shifted
for arg; do
echo "+$arg+"
done
}
argv "$#"
fun "$#"
Output:
+b+
+c d+
Here are two workarounds. Both have caveats.
First workaround: put newlines between arguments then use read.
set -- a b " c d "
args=
argv() {
shift
for arg in "$#"; do
args="$args$arg\n"
done
}
fun() {
printf "$args" | while IFS= read -r arg; do
echo "+$arg+"
done
}
argv "$#"
fun
Output:
+b+
+ c d +
Note that even the spaces before and after are preserved.
Caveat: if the arguments contain newlines you are screwed.
Second workaround: put quotes around arguments then use eval.
set -- a b " c d "
args=
argv() {
shift
for arg in "$#"; do
args="$args \"$arg\""
done
}
fun() {
for arg in "$#"; do
echo "+$arg+"
done
}
argv "$#"
eval fun "$args"
Caveat: if the arguments contain quotes you are screwed.

sed: replace spaces within quotes with underscores

I have input (for example, from ifconfig run0 scan on OpenBSD) that has some fields that are separated by spaces, but some of the fields themselves contain spaces (luckily, such fields that contain spaces are always enclosed in quotes).
I need to distinguish between the spaces within the quotes, and the separator spaces. The idea is to replace spaces within quotes with underscores.
Sample data:
%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3
nwid Websense chan 6 bssid 00:22:7f:xx:xx:xx 59dB 54M short_preamble,short_slottime
nwid ZyXEL chan 8 bssid cc:5d:4e:xx:xx:xx 5dB 54M privacy,short_slottime
nwid "myTouch 4G Hotspot" chan 11 bssid d8:b3:77:xx:xx:xx 49dB 54M privacy,short_slottime
Which doesn't end up processed the way I want, since I haven't replaced the spaces within the quotes with the underscores yet:
%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 |\
cut -s -d ' ' -f 2,4,6,7,8 | sort -n -k4
"myTouch Hotspot" 11 bssid d8:b3:77:xx:xx:xx
ZyXEL 8 cc:5d:4e:xx:xx:xx 5dB 54M
Websense 6 00:22:7f:xx:xx:xx 59dB 54M
For a sed-only solution (which I don't necessarily advocate), try:
echo 'a b "c d e" f g "h i"' |\
sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta'
a b "c_d_e" f g "h_i"
Translation:
Start at the beginning of the line.
Look for the pattern junk"junk", repeated zero or more times, where junk doesn't have a quote, followed by junk"junk space.
Replace the final space with _.
If successful, jump back to the beginning.
try this:
awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" file
it works for multi quotation parts in a line:
echo '"first part" foo "2nd part" bar "the 3rd part comes" baz'| awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\""
"first_part" foo "2nd_part" bar "the_3rd_part_comes" baz
EDIT alternative form:
awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' file
Another awk to try:
awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\"
Removing the quotes:
awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=
Some additional testing with a triple size test file further to the earlier tests done by #steve. I had to transform the sed statement a little bit so that non-GNU seds could process it as well. I included awk (bwk) gawk3, gawk4 and mawk:
$ for i in {1..1500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' ; done > test
$ time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null
real 0m27.802s
user 0m27.588s
sys 0m0.177s
$ time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m6.565s
user 0m6.500s
sys 0m0.059s
$ time gawk3 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m21.486s
user 0m18.326s
sys 0m2.658s
$ time gawk4 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m14.270s
user 0m14.173s
sys 0m0.083s
$ time mawk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m4.251s
user 0m4.193s
sys 0m0.053s
$ time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m13.229s
user 0m13.141s
sys 0m0.075s
$ time gawk3 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m33.965s
user 0m26.822s
sys 0m7.108s
$ time gawk4 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m15.437s
user 0m15.328s
sys 0m0.087s
$ time mawk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m4.002s
user 0m3.948s
sys 0m0.051s
$ time sed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null
real 5m14.008s
user 5m13.082s
sys 0m0.580s
$ time gsed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null
real 4m11.026s
user 4m10.318s
sys 0m0.463s
mawk rendered the fastest results...
You'd be better off with perl. The code is much more readable and maintainable:
perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge'
With your input, the results are:
a b "c_d_e" f g "h_i"
Explanation:
-p # enable printing
-e # the following expression...
s # begin a substitution
: # the first substitution delimiter
"[^"]*" # match a double quote followed by anything not a double quote any
# number of times followed by a double quote
: # the second substitution delimiter
($x=$&)=~s/ /_/g; # copy the pattern match ($&) into a variable ($x), then
# substitute a space for an underscore globally on $x. The
# variable $x is needed because capture groups and
# patterns are read only variables.
$x # return $x as the replacement.
: # the last delimiter
g # perform the nested substitution globally
e # make sure that the replacement is handled as an expression
Some testing:
for i in {1..500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' >> test; done
time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null
real 0m8.301s
user 0m8.273s
sys 0m0.020s
time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m4.967s
user 0m4.924s
sys 0m0.036s
time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m4.336s
user 0m4.244s
sys 0m0.056s
time sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test >/dev/null
real 2m26.101s
user 2m25.925s
sys 0m0.100s
NOT AN ANSWER, just posting awk equivalent code for #steve's perl code in case anyone's interested (and to help me remember this in future):
#steve posted:
perl -pe 's:"[^\"]*":($x=$&)=~s/ /_/g;$x:ge'
and from reading #steve's explanation the briefest awk equivalent to that perl code (NOT the preferred awk solution - see #Kent's answer for that) would be the GNU awk:
gawk '{
head = ""
while ( match($0,"\"[^\"]*\"") ) {
head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}'
which we get to by starting from a POSIX awk solution with more variables:
awk '{
head = ""
tail = $0
while ( match(tail,"\"[^\"]*\"") ) {
x = substr(tail,RSTART,RLENGTH)
gsub(/ /,"_",x)
head = head substr(tail,1,RSTART-1) x
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
and saving a line with GNU awk's gensub():
gawk '{
head = ""
tail = $0
while ( match(tail,"\"[^\"]*\"") ) {
x = gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
head = head substr(tail,1,RSTART-1) x
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
and then getting rid of the variable x:
gawk '{
head = ""
tail = $0
while ( match(tail,"\"[^\"]*\"") ) {
head = head substr(tail,1,RSTART-1) gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
and then getting rid of the variable "tail" if you don't need $0, NF, etc, left hanging around after the loop:
gawk '{
head = ""
while ( match($0,"\"[^\"]*\"") ) {
head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}'

Extract text with different delimiters

my textfile looks like this
foo.en 14 :: xyz 1;foo bar 2;foofoo 5;bar 9
bar.es 18 :: foo bar 4;kjp bar 2;bar 6;barbar 8
Ignoring text before the :: delimiter, is there a one liner unix command (many pipes allowed) or one liner perl script that extract the text such that yields the output of unique words delimited by ; ?:
xyz
foo bar
foofoo
bar
kjp bar
barbar
i've tried looping through the textfile with a python script but i'm looking for a one-liner for the task.
ans = set()
for line in open(textfile):
ans.add(line.partition(" :: ")[1].split(";").split(" ")[:-1])
for a in ans:
print a
With Perl:
perl -nle 's/.*?::\s*//;!$s{$_}++ and print for split /\s*\d+;?/' input
Description:
s/.*?::\s*//; # delete up to the first '::'
This part:
!$s{$_}++ and print for split /\s*\d+;?/
can be rewritten like this:
foreach my $word (split /\s*\d+;?/) { # for split /\s*\d+;?/
if (not defined $seen{$word}}) { # !$s{$_}
print $word; # and print
}
$seen{$word}++; # $s{$_}++
}
Since the increment in !$s{$_}++ is a post increment, Perl first test for the false condition and then does the increment. An undefined hash value has the value 0. If the test fails, i.e., $s{$_} was previously incremented, then the and part is skipped due to short circuiting.
cat textfile | sed 's/.*:://g' | tr '[0-9]*;' '\n' | sort -u
Explanation:
sed 's/.*:://g' Take everything up to and including `::` and replace it with nothing
tr '[0-9];' '\n' Replace numbers and semicolon with newlines
sort -u Sort, and return unique instances
it does result in a sorted output, I believe...
You can try this:
$ awk -F ' :: ' '{print $2}' input.txt | grep -oP '[^0-9;]+' | sort -u
bar
barbar
foo bar
foofoo
kjp bar
xyz
If your phrases contains numbers, try this perl regex: '[^;]+?(?=\s+\d+(;|$))'
With only awk :
$ awk -F' :: ' '{
gsub(/[0-9]+/, "")
split($2, arr, /;/ )
for (a in arr) arr2[arr[a]]=""
}
END{
for (i in arr2) print i
}' textfile.txt
And a one-liner version :
awk -F' :: ' '{gsub(/[0-9]+/, "");split($2, arr, /;/ );for (a in arr) arr2[arr[a]]="";}END{for (i in arr2) print i}' textfile.txt

deleting lines from text files based on the last character which are in another file using awk or sed

I have a file, xx.txt, like this.
1PPYA
2PPYB
1GBND
1CVHA
The first line of this file is "1PPYA". I would like to
Read the last character of "1PPYA." In this example, it's "A/"
Find "1PPY.txt" (the first four characters) from the "yy" directory.
Delete the lines start with "csh" which contain the "A" character.
Given the following "1PPY.txt" in the "yy" directory:
csh 1 A 1 27.704 6.347
csh 2 A 1 28.832 5.553
csh 3 A 1 28.324 4.589
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
csh 6 A 1 28.378 4.899
The required output would be:
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
Assuming your shell is bash
while read word; do
if [[ $word =~ ^(....)(.)$ ]]; then
filename="yy/${BASH_REMATCH[1]}.txt"
letter=${BASH_REMATCH[2]}
if [[ -f "$filename" ]]; then
sed "/^csh.*$letter/d" "$filename"
fi
fi
done < xx.txt
As you've tagged the question with awk:
awk '{
filename = "yy/" substr($1,1,4) ".txt"
letter = substr($1,5)
while (getline < filename)
if (! match($0, "^csh.*" letter))
print
close(filename)
}' xx.txt
This might work for you:
sed 's|^ *\(.*\)\(.\)$|sed -i.bak "/^ *csh.*\2/d" yy/\1.txt|' xx.txt | sh
N.B. I added a file backup. If this is not needed amend the -i.bak to -i
You can use this bash script:
while read f l
do
[[ -f $f ]] && awk -v l=$l '$3 != l' $f
done < <(awk '{len=length($0);l=substr($0,len);f=substr($0,0,len-1);print "yy/" f ".txt", l;}' xx.txt)
I posted this because you are a new user, however it will be much better to show us what you have tried and where you're stuck.
TXR:
#(next "xx.txt")
#(collect)
#*prefix#{suffix /./}
# (next `yy/#prefix.txt`)
# (collect)
# (all)
#{whole-line}
# (and)
# (none)
#shell #num #suffix #(skip)
# (end)
# (end)
# (do (put-string whole-line) (put-string "\n"))
# (end)
#(end)
Run:
$ txr del.txr
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
txr: unhandled exception of type file_error:
txr: (del.txr:5) could not open yy/2PPY.txt (error 2/No such file or directory)
Because of the outer #(collect)/#(end) (easily removed) this processes all of the lines from xx.txt, not just the first line, and so it blows up because I don't have 2PPY.txt.

sed/awk: replace N occurrence

it's possible to change N (for example second occurrence) in file using one-line sed/awk except such method?:
line_num=`awk '/WHAT_TO_CHANGE/ {c++; if (c>=2) {c=NR;exit}}END {print c}' INPUT_FILE` && sed "$line_num,$ s/WHAT_TO_CHANGE/REPLACE_TO/g" INPUT_FILE > OUTPUT_FILE
Thanks
To change the Nth occurence in a line you can use this:
$ echo foo bar foo bar foo bar foo bar | sed 's/foo/FOO/2'
foo bar FOO bar foo bar foo bar
So all you have to do is to create a "one-liner" of your text, e.g. using tr
tr '\n' ';'
do your replacement and then convert it back to a multiline again using
tr ';' '\n'
This awk solution assumes that WHAT_TO_CHANGE occurs only once per line. The following replaces the second 'one' with 'TWO':
awk -v n=2 '/one/ { if (++count == n) sub(/one/, "TWO"); } 1' file.txt