insert quotes for each field using awk - perl

I am looking for below input based on the sample provided below
Sample :
eno~ename~address~zip
123~abc~~560000~"a~b~c"
245~"abc ~ def"~hyd~560102
333~"ghi~jkl"~pub~560103
Expected output :
"eno"~"ename"~"address"~"zip"
"123"~"abc"~""~"560000"~"a~b~c"
"245"~"abc ~ def"~"hyd"~"560102"
"333"~"ghi~jkl"~"pub"~"560103"
command which i tried in awk it doesn't work if the delimiter value contains in data. If there are any alternate suggestions with perl/sed/awk suggest.
Below is the command : awk '{for (i=1;i<=NF;i++) $i="\""$i"\""}1' FS="~" OFS="~" sample

Could you please try following(tested with provided samples only).
awk 'BEGIN{s1="\"";FS=OFS="~"} {for(i=1;i<=NF;i++){if($i!~/^\"|\"$/){$i=s1 $i s1}}} 1' Input_file
Output will be as follows.
"eno"~"ename"~"address"~"zip"
"123"~"abc"~""~"560000"
"245"~"abc ~ def"~"hyd"~"560102"
"333"~"ghi~jkl"~"pub"~"560103"
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section of awk program here.
s1="\"" ##Setting variable s1 to " here.
FS=OFS="~" ##Setting value of FS and OFS as ~ here.
} ##Closing BEGIN block of awk code here.
{
for(i=1;i<=NF;i++){ ##Starting for loop here from i=1 to till value of NF here.
if($i!~/^\"|\"$/){ ##Checking condition of value of current field is NOT having s1 value in it.
$i=s1 $i s1 ##Adding s1 variable before and after the value of $i.
} ##Closing block for if condition.
} ##Closing block for for loop here.
} ##Closing main block here.
1 ##Mentioning 1 will print the lines of Input_file.
' Input_file ##mentioning Input_file name here.

Here you can use FPAT with gnu awk
awk -v FPAT='([^~]*)|("[^"]+")' -v OFS="~" '{for (i=1;i<=NF;i++) if ($i!~/^\"/) $i="\""$i"\""} 1' file
"eno"~"ename"~"address"~"zip"
"123"~"abc"~""~"560000"
"245"~"abc ~ def"~"hyd"~"560102"
"333"~"ghi~jkl"~"pub"~"560103"
Instead of telling how the Field Separator looks like, we tell how the filed looks like. Then test if field not has double quote, if no, add it.
You can then easy change the Field Separator if you like:
awk -v FPAT='([^~]*)|("[^"]+")' -v OFS="," '{for (i=1;i<=NF;i++) if ($i!~/^\"/) $i="\""$i"\""} 1' file
"eno","ename","address","zip"
"123","abc","","560000"
"245","abc ~ def","hyd","560102"
"333","ghi~jkl","pub","560103"

Related

How to apply one command into another sed command?

I have one command which is used to extract lines between two string patterns 'string1' and 'string2'. This is stored in variable called 'var1'.
var1=$(awk '/string1/{flag=1; next} /string2/{flag=0} flag' text.txt)
This command works well and the output is a set of lines.
Do you hear the people sing?
Singing a song of angry men?
It is the music of a people
Who will not be slaves again
I want the output of the above command to be inserted after a string pattern 'string3' in another file called stat.txt. I used sed as follows
sed '/string3/a'$var1'' stat.txt
I am having trouble getting the new output. Here, the $var1 seems to be working partially i.e. only one line -
string3
Do you hear the people sing?
Any other suggestions to solve this?
I would be tempted to use sed to extract the lines, and awk to insert them into the other text:
lines=$(sed -n '/string1/,/string2/ p' text.txt)
awk -v new="$lines" '{print} /string3/ {print new}' stat.txt
or perhaps both tasks in a single awk call
awk '
NR == FNR && /string1/ {flag = 1}
NR == FNR && /string2/ {flag = 0}
NR == FNR && flag {lines = lines $0 ORS}
NR == FNR {next}
{print}
/string3/ {printf "%s", lines} # it already ends with a newline
' text.txt stat.txt
It's a data format problem...
Appending a multi-line block of text with the sed append command requires that every line in the block to be appended ends with a \ -- except for the last line of that block. So if we take the two lines of code that didn't work in the question, and reformat the text as required by the append command, the original code should work as expected:
var1=$(awk '/string1/{flag=1; next} /string2/{flag=0} flag' text.txt)
var1="$(sed '$!s/$/\\/' <<< "$var1")"
sed '/string3/a'$var1'' stat.txt
Note that the 2nd line above contains a bashism. A more portable version would be:
var1="$(echo "$var1" | sed '$!s/$/\\/')"
Either variant would convert $var1 to:
Do you hear the people sing?\
Singing a song of angry men?\
It is the music of a people\
Who will not be slaves again

bash while loop failing while using sed

I am facing an issue with sed in a while-loop.using sed. I want to read the 2nd column of file1, compare it with the content of file2, and if the string is matched, i want to replace the matched string of file1 with file2 string.
I tried with the following code, but it is not returning any output.
cat file1 | while read a b; do
sed -i "s/$b/$(grep $b file2)/g" file1 > file3;
done
Example input:
file_1 content:
1 1234
2 8765
file2 content:
12345
34567
87654
Expected output:
1 12345
2 87654
Your script is very inefficient. Using the while-loop you read each line of file1. This is N operations. Per line you process with the while loop, you reproscess the full file1, making it an N*N process. However, in the sed, you grep file2 constantly. If file2 has M lines, this becomes an N*N*M process. This is very inefficient.
On top of that there are some issues:
You updated file1 inplace because you use the -i flag. An inplace update does not provide any output, so file3 will be empty.
You are reading file1 with the while-loop and at the same time you update file1 with sed. I don't know how this will react, but I don't believe it is healthy.
If $b is not in file2 you would, according to your logic, have a line with only a single column. This is not what you expect.
A fix of your script, would be this:
while read -r a b; do
c=$(grep "$b" file2)
[[ "$c" == "" ]] || echo "$a $c"
done < file1 > file3
which is still not efficient, but it is already M*N. The best way is using awk
note: as a novice, always parse your script with http://www.shellcheck.net
note: as a professional, always parse your script with http://www.shellcheck.net
Could you please try following.
awk 'FNR==NR{a[$2]=$1;next} {for(i in a){if(match($0,"^"i)){print a[i],$0;continue}}}' file1 file2
Adding a non-one liner form of solution:
awk '
FNR==NR{
a[$2]=$1
next
}
{
for(i in a){
if(match($0,"^"i)){
print a[i],$0
continue
}
}
}
' Input_file1 Input_file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk code from here.
FNR==NR{ ##Checking condition if FNR==NR then do following.
a[$2]=$1 ##Creating array a whose index is $2 and value is $1.
next ##next will skip all further statements from here.
}
{ ##Statements from here will run for 2nd Input_file only.
for(i in a){ ##Traversing through array a all elements here.
if(match($0,"^"i)){ ##Checking condition if current line matches index of current item from array a then do following.
print a[i],$0 ##Printing array a whose index is i and current line here.
continue ##Again take cursor to for loop.
}
}
}
' Input_file1 Input_file2 ##Mentioning all Input_file names here.

Merging Lines using sed

I have text file that consists of 45999 lines. Each line has a word (unigram). I want to create two-sequential words (bigrams). For example:
apple
pie
red
vine
I want 'apple pie', 'pie red', 'red vine'. I tried with sed 'N;s/\n/ /' but it creates just 'apple pie' and 'red vine'. How can I solve this problem? Thank you..
Could you please try following if you are ok with awk.
awk -v RS="" '
BEGIN{
OFS=","
s1="\047"
}
{
for(i=2;i<=NF;i++){
print s1 $(i-1) s1, s1 $i s1
}
}' Input_file
Output will be as follows.
'apple','pie'
'pie','red'
'red','vine'
2nd solution: since output of OP is not clear so adding this one too.
awk -v RS="" '
BEGIN{
OFS=","
s1="\047"
}
{
for(i=2;i<=NF;i++){
val=(val?val OFS:"")s1 $(i-1) s1 OFS s1 $i s1
}
}
END{
print val
}' Input_file
Output will be as follows.
'apple','pie','pie','red','red','vine'
This might work for you (GNU sed):
sed -nE 'N;s/\n(.*)/ \1&/;P;D' file
Append the next line to the current line, then replace the newline by a space and append the second line again. Print/delete the first line and repeat.
N.B. This does not print the last line as it is not a pair, if the last line is needed use:
sed -E 'N;s/\n(.*)/ \1&/;P;D' file
If the output is to be printed as a single line with each pair surrounded by single quotes and separated by a comma, use:
sed -E ':a;$!N;s/(\S+)\n(.*)/'\''\1 \2'\'', \2/;ta;s/ (\S+)$/ '\''\1'\''/' file
Or:
sed -E ':a;$!N;s/(\S+)\n(.*)/'\''\1 \2'\'', \2/;ta;s/, \S+$/' file

SED code for removing newline

I am looking for sed command which will transform following line:
>AT1G01020.6 | ARV1 family protein | Chr1:6788-8737 REVERSE LENGTH=944 | 201606
AGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGAGAGAGAGAGAGAGAGA
GAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGT
CATTGTTCATTCAATACTCTCCGGGGAAATTGCAAGGAAGTAGCAGATGAGTACATCGAG
TGTGAACGCATGATTATTTTCATCGATTTAATCCTTCACAGACCAAAGGTATATAGACAC
into
>AT1G01020.6 | ARV1 family protein | Chr1:6788-8737 REVERSE LENGTH=944 | 201606
AGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGTCATTGTTCATTCAATACTCTCCGGGGAAATTGCAAGGAAGTAGCAGATGAGTACATCGAGTGTGAACGCATGATTATTTTCATCGATTTAATCCTTCACAGACCAAAGGTATATAGACAC
which means newline after > this character will remain unchanged, while on other cases newlines will be joined.
I have tried with the following line, but it is not working:
sed s/^!>\n$// <in.fasta>out.fasta
I have a 28MB fasta file which I need to transform.
sed is not a particularly good tool for this.
awk '/^>/ { if(prev) printf "\n"; print; next }
{ printf "%s", $0; prev = 1; }
END { if(prev) printf "\n" }' in.fasta >out.fasta
Using awk:
awk '/^>/{print (l?l ORS:"") $0;l="";next}{l=l $0}END{print l}' file
The line is printed if a > or the end of the file is reached, otherwise the line is buffered in the variable l.
Following awk may also help you here. Without using any array or variable's values solution.
awk 'BEGIN{ORS=""} /^>/{if(FNR==1){print $0 RS} else {print RS $0 RS};next}1' Input_file
OR
awk 'BEGIN{ORS=""} /^>/{printf("%s",FNR==1?$0 RS:RS $0 RS);next}1' Input_file

Replace a character with #(hash symble) only in 5th & 6th field

I am trying to Replace a character with #(hash symble) only in 5th & 6th field.
eg. I have to replace 'Z' with '#' only in 5th & 6th field (using perl or AWK script). And remaining fields containng 'Z' symbol should not be affected.
(just I'm updating the post to replace double quote(") instead of Z by #. Can I achive this? thanks for precious help)
eg: i/p file:
aa",bb,ccc,ddd,eee",ddd",fff
aa1",ba1,ccc1,"ddd1,eee"1,ddd1,fff1
z,aa2,bb2",ccc2,ddd2","eee2",ddd2,fff2"
Expected O/p file:
aa",bb,ccc,ddd,eee#,ddd#,fff
aa1",ba1,ccc1,#ddd1,eee#1,ddd1,fff1
aa2,bb2",ccc2,ddd2#,#eee2#,ddd2,fff2"
Thanks.
$ awk 'BEGIN{FS=OFS=","} {for (i=5;i<=6;i++) gsub(/Z/,"#",$i)} 1' file
x,aaZ,bb,ccc,ddd,eee#,dddZ,fff
y,aa1Z,ba1,ccc1,#ddd1,eee#1,ddd1,fff1
z,aa2,bb2Z,ccc2,ddd2#,#eee2,ddd2,fff2Z
Since its only two filed, loop can be omitted.
awk -F, -v OFS=, '{gsub(/Z/,"#",$5);gsub(/Z/,"#",$6)} 1' file
x,aaZ,bb,ccc,ddd,eee#,dddZ,fff
y,aa1Z,ba1,ccc1,#ddd1,eee#1,ddd1,fff1
z,aa2,bb2Z,ccc2,ddd2#,#eee2,ddd2,fff2Z
To replace " in fifth and sixth field:
awk -F, -v OFS=, '{gsub(/\"/,"#",$5);gsub(/\"/,"#",$6)} 1' file
aa",bb,ccc,ddd,eee#,ddd#,fff
aa1",ba1,ccc1,"ddd1,eee#1,ddd1,fff1
z,aa2,bb2",ccc2,ddd2#,#eee2#,ddd2,fff2"
Here is a Perl way to do the job:
perl -anF, -e '$"=","; s/Z/#/ for (#F)[4,5];print"#F";' < in1.txt
If you have mutiple Z in a field, you could use:
perl -anF, -e '$"=","; s/Z/#/g for (#F)[4,5];print"#F";' < in1.txt
Output:
aaZ,bb,ccc,ddd,eee#,ddd#,fff
aa1Z,ba1,ccc1,Zddd1,eee#1,ddd1,fff1
aa2,bb2Z,ccc2,ddd2Z,#eee2,ddd2,fff2Z
Edit according to comment:
in1.txt
aa",bb,ccc,ddd,eee",ddd",fff
aa1",ba1,ccc1,"ddd1,eee"1,ddd1,fff1
aa2,bb2",ccc2,ddd2","eee2,ddd2,fff2"
Command:
perl -anF'','' -e '$"=",";s/"/#/ for (#F)[4,5];print"#F";' < in1.txt
result:
aa",bb,ccc,ddd,eee#,ddd#,fff
aa1",ba1,ccc1,"ddd1,eee#1,ddd1,fff1
aa2,bb2",ccc2,ddd2",#eee2,ddd2,fff2"