result splitted with a specific character using ksh - sh

I have several input files looking like this and before loop the processing for all the files i would like to get the 1st column in the same line splitted with || .
Input.txt
aa ,DEC
bb ,CHAR
cc ,CHAR
dd ,DEC
ee ,DEC
ff ,CHAR
gg ,DEC
For my try this is my commands :
cat $1| while read line
do
cle=`echo $line|cut -d"," -f1`
for elem in $cle
do
echo -n "$elem||"
done
fi
done
But the problem I got the || in the end of the output file ;
He is the result I'm looking for in one line :
aa || bb || cc || dd || ee || ff || gg

Probably use Awk instead.
awk -F ',' '{ printf "%s%s", sep, $1; sep = "||"; } END { printf "\n" }' "$1"
If you really wanted to use the shell, you can do pretty much the same thing, but it will typically be both clunkier and slower. Definitely prefer the Awk version for any real system.
sep=''
while IFS=',' read -r cle _; do
printf "%s%s" "$sep" "$cle"
sep="||"
done <"$1"
printf "\n"
Notice the absence of a useless cat and how the read command itself is perfectly able to split on whatever IFS is set to. (Your example looks like maybe you want to split on whitespace instead, which is the default behavior of both Awk and the shell. Drop the -F ',' or remove the IFS=',', respectively.) You obviously don't need a for loop to iterate over a single value, either. And always quote your variables.
If you want a space after the delimiter, set it to "|| " instead of just "||". Your example is not entirely consistent (or maybe the markup here hides some of your formatting).

Related

Awk: Match data in 2 files with duplicate keys

I have 2 files
file1
a^b=-123
a^3=-124
c^b=-129
a^b=-130
and file2
a^b=-523
a^3=-524
a^b=-530
I want to lookup the key using '=' as delimiter and get the following output
a^b^-123^-523
a^b^-130^-530
a^3^-124^-524
When there were no duplicate keys, it was easy to do it in awk mapping the first file and looping over the second, however, with the duplicates, its slightly difficult. I tried something like this:
awk -F"=" '
FNR == NR {
arr[$1 "^" $2] = $2;
next;
}
FNR < NR {
for (i in arr) {
match(i, , /^(.*\^.*)\^([-0-9]*)$/, , ar);
if ($1 == ar[1]) {
if ($2 in load == 0) {
if (ar[2] in l2 == 0) {
l2[ar[2]] = ar[2];
load[$2] = $2;
print i "^" $2
}
}
}
}
}
' file1 file2
This works just fine, however, not surprisingly it's extremely slow. On a file with about 600K records, it ran for 4 hours.
Is there a better and more efficient way to do this in one line awk or perl. If possible, a one liner would be great help.
thanks.
You might want to look at the join command which does something very much like you're doing here, but generates a full database-style join. For example, assuming file1 and file2 contain the data you show above, then the commands
$ sort -o file1.out -t = -k 1,1 file1
$ sort -o file2.out -t = -k 1,1 file2
$ join -t = file1.out file2.out
produces the output
a^3=-124=-524
a^b=-123=-523
a^b=-123=-530
a^b=-130=-523
a^b=-130=-530
The sorts are necessary because, to be efficient, join requires the input file to be sorted on the keys being compared. Note though that this generates the full cross-product join, which appears not to be what you want.
(Note: The following is a very shell-heavy solution, but you could cast it fairly easily into any programming language with dynamic arrays and a built-in sort primitive. Unfortunately, awk isn't one of those but perl and python are, as are I'm sure just about every newer scripting language.)
It seems that you really want each instance of a key to be consumed the first time it's emitted in any output. You can get this as follows, again starting with the original contents of file1 and file2.
$ nl -s = -n rz file1 | sort -t = -k 2,2 > file1.out
$ nl -s = -n rz file2 | sort -t = -k 2,2 > file2.out
This decorates each line with the original line number so that we can recover the original order later, and then sorts them on the key for join. The remainder of the work is a short pipeline, which I've broken up into multiple blocks so it can be explained as we go.
join -t = -1 2 -2 2 file1.out file2.out |
This command joins on the key names, now in field two, and emits records like those shown from the earlier output of join, except that each line now includes the line number where the key was found in file1 and file2. Next, we want to re-establish the search order your original algorithm would have used, so we continue the pipeline with
sort -t = -k 2,2 -k 4,4 |
which sorts first on the file1 line number and then on the file2 line numbers. Finally, we need to efficiently emulate the assumption that a particular key, once consumed, cannot be re-used, in order to eliminate the unwanted matches in the original join output.
awk '
BEGIN { OFS="="; FS="=" }
$2 in seen2 || $4 in seen4 { next }
{ seen2[$2]++; seen4[$4]++; print $1,$3,$5 }
'
This ignores every line that references a previously scanned key in either file, and otherwise prints the following
a^b=-123=-523
a^3=-124=-524
a^b=-130=-530
This should be uniformly efficient even for quite large inputs, because the sorts are O(n log n), and everything else is O(n).
try this awk codes, see if it would be faster than yours: (it could be an one-liner, if you join all lines, but I think with formatting, it is easier to read)
awk -F'=' -v OFS="^" 'NR==FNR{sub(/=/,"^");a[NR]=$0;t=NR;next}
{ s=$1
sub(/\^/,"\\^",s)
for(i=1;i<=t;i++){
if(a[i]~s){
print a[i],$2
delete a[i]
break
}
}
}' file1 file2
with your example, it outputs expected result:
a^b^-123^-523
a^3^-124^-524
a^b^-130^-530
But I think the key is performance here. so give it a try.

sed/awk/cut/grep - Best way to extract string

I have a results.txt file that is structured in this format:
Uncharted 3: Javithaxx l Rampant l Graveyard l Team Deathmatch HD (D1VpWBaxR8c)
Matt Darey feat. Kate Louise Smith - See The Sun (Toby Hedges Remix) (EQHdC_gGnA0)
The Matrix State (SXP06Oax70o)
Above & Beyond - Group Therapy Radio 014 (guest Lange) (2013-02-08) (8aOdRACuXiU)
I want to create a new file extracting the youtube URL ID specified in the last characters in each line line "8aOdRACuXiU"
I'm trying to build a URL like this in a new file:
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Note, I appended the &hd=1 to the string that I am trying to be replaced. I have tried using Linux reverse and cut but reverse or rev munges my data. The hard part here is that each line in my text file will have entries with parentheses and I only care about getting the data between the last set of parentheses. Each line has a variable length so that isn't helpful either. What about using grep and .$ for the end of the line?
In summary, I want to extract the youtube ID from results.txt and export it to a new file in the following format: http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Using awk:
awk '{
v = substr( $NF, 2, length( $NF ) - 2 )
printf "%s%s%s\n", "http://www.youtube.com/watch?v=", v, "&hd=1"
}' infile
It yields:
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
$ sed 's!.*(\(.*\))!http://www.youtube.com/watch?v=\1\&hd=1!' results.txt
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Here, .*(\(.*\)) looks for the last occurrence of a pair of parentheses, and captures the characters inside those parentheses. The captured group is then inserted into the URL using \1.
Using a perl one-liner :
perl -lne 'printf "http://www.youtube.com/watch?v=%s&hd=1\n", $& if /[^\(]+(?=\)$)/' file.txt
Or multi-line version :
perl -lne '
printf(
"http://www.youtube.com/watch?v=%s&hd=1\n",
$&
) if /[^\(]+(?=\)$)/
' file.txt

sed — joining a range of selected lines

I'm a beginner to sed. I know that it's possible to apply a command (or a set of commands) to a certain range of lines like so
sed '/[begin]/,/[end]/ [some command]'
where [begin] is a regular expression that designates the beginning line of the range and [end] is a regular expression that designates the ending line of the range (but is included in the range).
I'm trying to use this to specify a range of lines in a file and join them all into one line. Here's my best try, which didn't work:
sed '/[begin]/,/[end]/ {
N
s/\n//
}
'
I'm able to select the set of lines I want without any problem, but I just can't seem to merge them all into one line. If anyone could point me in the right direction, I would be really grateful.
One way using GNU sed:
sed -n '/begin/,/end/ { H;g; s/^\n//; /end/s/\n/ /gp }' file.txt
This is straight forward if you want to select some lines and join them. Use Steve's answer or my pipe-to-tr alternative:
sed -n '/begin/,/end/p' | tr -d '\n'
It becomes a bit trickier if you want to keep the other lines as well. Here is how I would do it (with GNU sed):
join.sed
/\[begin\]/ {
:a
/\[end\]/! { N; ba }
s/\n/ /g
}
So the logic here is:
When [begin] line is encountered start collecting lines into pattern space with a loop.
When [end] is found stop collecting and join the lines.
Example:
seq 9 | sed -e '3s/^/[begin]\n/' -e '6s/$/\n[end]/' | sed -f join.sed
Output:
1
2
[begin] 3 4 5 6 [end]
7
8
9
I like your question. I also like Sed. Regrettably, I do not know how to answer your question in Sed; so, like you, I am watching here for the answer.
Since no Sed answer has yet appeared here, here is how to do it in Perl:
perl -wne 'my $flag = 0; while (<>) { chomp; if (/[begin]/) {$flag = 1;} print if $flag; if (/[end]/) {print "\n" if $flag; $flag = 0;} } print "\n" if $flag;'

Swap two columns - awk, sed, python, perl

I've got data in a large file (280 columns wide, 7 million lines long!) and I need to swap the first two columns. I think I could do this with some kind of awk for loop, to print $2, $1, then a range to the end of the file - but I don't know how to do the range part, and I can't print $2, $1, $3...$280! Most of the column swap answers I've seen here are specific to small files with a manageable number of columns, so I need something that doesn't depend on specifying every column number.
The file is tab delimited:
Affy-id chr 0 pos NA06984 NA06985 NA06986 NA06989
You can do this by swapping values of the first two fields:
awk ' { t = $1; $1 = $2; $2 = t; print; } ' input_file
I tried the answer of perreal with cygwin on a windows system with a tab separated file. It didn't work, because the standard separator is space.
If you encounter the same problem, try this instead:
awk -F $'\t' ' { t = $1; $1 = $2; $2 = t; print; } ' OFS=$'\t' input_file
Incoming separator is defined by -F $'\t' and the seperator for output by OFS=$'\t'.
awk -F $'\t' ' { t = $1; $1 = $2; $2 = t; print; } ' OFS=$'\t' input_file > output_file
Try this more relevant to your question :
awk '{printf("%s\t%s\n", $2, $1)}' inputfile
This might work for you (GNU sed):
sed -i 's/^\([^\t]*\t\)\([^\t]*\t\)/\2\1/' file
Have you tried using the cut command? E.g.
cat myhugefile | cut -c10-20,c1-9,c21- > myrearrangedhugefile
This is also easy in perl:
perl -pe 's/^(\S+)\t(\S+)/$2\t$1/;' file > outputfile
You could do this in Perl:
perl -F\\t -nlae 'print join("\t", #F[1,0,2..$#F])' inputfile
The -F specifies the delimiter. In most shells you need to precede a backslash with another to escape it. On some platforms -F automatically implies -n and -a so they can be dropped.
For your problem you wouldn't need to use -l because the last columns appears last in the output. But if in a different situation, if the last column needs to appear between other columns, the newline character must be removed. The -l switch takes care of this.
The "\t" in join can be changed to anything else to produce a different delimiter in the output.
2..$#F specifies a range from 2 until the last column. As you might have guessed, inside the square brackets, you can put any single column or range of columns in the desired order.
No need to call anything else but your shell:
bash> while read col1 col2 rest; do
echo $col2 $col1 $rest
done <input_file
Test:
bash> echo "first second a c d e f g" |
while read col1 col2 rest; do
echo $col2 $col1 $rest
done
second first a b c d e f g
Maybe even with "inlined" Python - as in a Python script within a shell script - but only if you want to do some more scripting with Bash beforehand or afterwards... Otherwise it is unnecessarily complex.
Content of script file process.sh:
#!/bin/bash
# inline Python script
read -r -d '' PYSCR << EOSCR
from __future__ import print_function
import codecs
import sys
encoding = "utf-8"
fn_in = sys.argv[1]
fn_out = sys.argv[2]
# print("Input:", fn_in)
# print("Output:", fn_out)
with codecs.open(fn_in, "r", encoding) as fp_in, \
codecs.open(fn_out, "w", encoding) as fp_out:
for line in fp_in:
# split into two columns and rest
col1, col2, rest = line.split("\t", 2)
# swap columns in output
fp_out.write("{}\t{}\t{}".format(col2, col1, rest))
EOSCR
# ---------------------
# do setup work?
# e. g. list files for processing
# call python script with params
python3 -c "$PYSCR" "$inputfile" "$outputfile"
# do some more processing
# e. g. rename outputfile to inputfile, ...
If you only need to swap the columns for a single file, then you can also just create a single Python script and statically define the filenames. Or just use an answer above.
awk swapping sans temp-variable :
echo '777777744444444464449: 317 647 14423 262927714037 : 0x2A29D5A1BAA7A95541' |
mawk '1; ($1 = $2 substr(_, ($2 = $1)^_))^_' FS=':' OFS=':'
777777744444444464449: 317 647 14423 262927714037 : 0x2A29D5A1BAA7A95541
317 647 14423 262927714037 :777777744444444464449: 0x2A29D5A1BAA7A95541

Perform action on line range in sed/awk

How can I extract certain variables from a specific range of lines in sed/awk?
Example: I want to exctract the host and port from .tnsnames.ora
from this section that starts at line 105.
DB_CONNECTION=
(description=
(address=
(protocol=tcp)
(host=127.0.0.1)
(port=1234)
)
(connect_data=
(sid=ABCD)
(sdu=4321)
)
The gawk can use regular expression in field separater(FS).
'$0=$2' is always true, so automatically this script print $2.
$ gawk -F'[()]' 'NR>105&&NR<115&&(/host/||/port/)&&$0=$2' .tnsnames.ora
use:
sed '105,$< whatever sed code you want here >'
If you specifically want the host and the port you can do something like:
sed .tnsnames.ora '105,115p'|grep -e 'host=' -e 'port='
You can use address ranges to specify to which section to apply the regular expressions. If you leave the end line address out (keep the comma) it will match to the end of file. You can also 'chain' multiple expressions by using '-e' multiple times. The following expression will just print the port and host value to standard out. It uses back references (\1) in order to just print the matching parts.
sed -n -e '105,115s/(port=\([0-9].*\))/\1/p' -e '105,115s/(host=\([0-9\.].*\))/\1/p' tnsnames.ora
#lk, to address the answer you posted:
You can write awk code like C, but it's more succinctly expressed as "pattern {action}" pairs.
If you have gawk or nawk, the field separator is an ERE as Hirofumi Saito said
gawk -F'[()=]' '
NR < 105 {next}
NR > 115 {exit}
$2 == "host" || $2 == "port" {
# do stuff with $2 and $3
print $2 "=" $3
}
'