while loop provide just one block of result using awk - command-line

I am processing my text file using awk. I have written the code below:
#!/bin/bash
l=1
while [ $l -lt 5 ]
do
echo $l
awk -v L=$l '/^BS[0-5]|^FG[2-7]/ && length<10 {i++}i==L {print}'
l=$(expr $l + 1)
done <input.txt
But, once i run the code I just get first awk output.
Would you please let me know how can I fix this code?

The while loop is taking its input from the file input.txt. awk is inheriting its input from the while loop, and it reads all of the input on the first iteration of the loop. In the 2nd iteration, awk gets no data. If you want to read from that file on each iteration, pass input.txt as an argument to awk.
Better yet, skip the while loop and do the whole thing in awk:
awk '/^BS[0-5]|^FG[2-7]/ && length<10 && i++ < 5' input.txt

You do not read the lines of our file. Do:
while read line && [ $l -lt 5 ]
do
...

Related

Validate if a text file contains identical records at specific line's number?

my command looks like:
for i in *.fasta ; do
parallel -j 10 python script.py $i > $i.out
done
I want to add a test condition to this loop where it only executes the parallel python script if there are no identical lines in the .fasta file
an example .fasta file below:
>ref2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGCGGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
>mut_1_2964_0
AAAAAAAAACGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGTTGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
an example .fasta file that I would like excluded because lines 2 and 4 are identical.
>ref2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGCGGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
>mut_1_2964_0
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGCGGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
The input files always have 4 lines exactly, and lines 2 and 4 are always the lines to be compared.
I've been using sort file.fasta | uniq -c to see if there are identical lines, but I don't know how to incorporate this into my bash loop.
EDIT:
command:
for i in read_00.fasta ; do lines=$(awk 'NR % 4 == 2' $i | sort | uniq -c | awk '$1 > 1'); if [ -z "$lines" ]; then echo $i >> not.identical.txt; fi;
read_00.fasta:
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAAAAATAACCACAGAAAGCATAGTAATATGGGGAAAGACTCCTAAATTTAAACTGCCCATACAAAAGGAAACATGGGAAACATGGTGGACAGAGTATTGGCAAGCCACCTGGATTCCTGAGTGGGAGTTTGTTAATACCCCTCCCTTAGTGAAATTATGGTACCAGTTAGA
>mut_1_2964_0
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAAAAATAACCACAGAAAGCATAGTAATATGGGGAAAGACTCCTAAATTTAAACTGCCCATACAAAAGGAAACATGGGAAACATGGTGGACAGAGTATTGGCAAGCCACCTGGATTCCTGAGTGGGAGTTTGTTAATACCCCTCCCTTAGTGAAATTATGGTACCAGTTAGA
Verify those specifc lines content with below awk and exit failure when lines were identical or exit success otherwise (instead of exit, you can do whatever you want to print/do for you);
awk 'NR==2{ prev=$0 } NR==4{ if(prev==$0) exit 1; else exit }' "./$yourFile"
or to output fileName instead when 2nd and 4th lines were differ:
awk 'NR==2{ prev=$0 } NR==4{ if(prev!=$0) print FILENAME; exit }' ./*.fasta
Using the exit-status of the first command then you can easily execute your next second command, like:
for file in ./*.fasta; do
awk 'NR==2{ prev=$0 } NR==4{ if(prev==$0) exit 1; else exit }' "$file" &&
{ parallel -j 10 python script.py "$file" > "$file.out"; }
done

How to apply one command into another sed command?

I have one command which is used to extract lines between two string patterns 'string1' and 'string2'. This is stored in variable called 'var1'.
var1=$(awk '/string1/{flag=1; next} /string2/{flag=0} flag' text.txt)
This command works well and the output is a set of lines.
Do you hear the people sing?
Singing a song of angry men?
It is the music of a people
Who will not be slaves again
I want the output of the above command to be inserted after a string pattern 'string3' in another file called stat.txt. I used sed as follows
sed '/string3/a'$var1'' stat.txt
I am having trouble getting the new output. Here, the $var1 seems to be working partially i.e. only one line -
string3
Do you hear the people sing?
Any other suggestions to solve this?
I would be tempted to use sed to extract the lines, and awk to insert them into the other text:
lines=$(sed -n '/string1/,/string2/ p' text.txt)
awk -v new="$lines" '{print} /string3/ {print new}' stat.txt
or perhaps both tasks in a single awk call
awk '
NR == FNR && /string1/ {flag = 1}
NR == FNR && /string2/ {flag = 0}
NR == FNR && flag {lines = lines $0 ORS}
NR == FNR {next}
{print}
/string3/ {printf "%s", lines} # it already ends with a newline
' text.txt stat.txt
It's a data format problem...
Appending a multi-line block of text with the sed append command requires that every line in the block to be appended ends with a \ -- except for the last line of that block. So if we take the two lines of code that didn't work in the question, and reformat the text as required by the append command, the original code should work as expected:
var1=$(awk '/string1/{flag=1; next} /string2/{flag=0} flag' text.txt)
var1="$(sed '$!s/$/\\/' <<< "$var1")"
sed '/string3/a'$var1'' stat.txt
Note that the 2nd line above contains a bashism. A more portable version would be:
var1="$(echo "$var1" | sed '$!s/$/\\/')"
Either variant would convert $var1 to:
Do you hear the people sing?\
Singing a song of angry men?\
It is the music of a people\
Who will not be slaves again

bash while loop failing while using sed

I am facing an issue with sed in a while-loop.using sed. I want to read the 2nd column of file1, compare it with the content of file2, and if the string is matched, i want to replace the matched string of file1 with file2 string.
I tried with the following code, but it is not returning any output.
cat file1 | while read a b; do
sed -i "s/$b/$(grep $b file2)/g" file1 > file3;
done
Example input:
file_1 content:
1 1234
2 8765
file2 content:
12345
34567
87654
Expected output:
1 12345
2 87654
Your script is very inefficient. Using the while-loop you read each line of file1. This is N operations. Per line you process with the while loop, you reproscess the full file1, making it an N*N process. However, in the sed, you grep file2 constantly. If file2 has M lines, this becomes an N*N*M process. This is very inefficient.
On top of that there are some issues:
You updated file1 inplace because you use the -i flag. An inplace update does not provide any output, so file3 will be empty.
You are reading file1 with the while-loop and at the same time you update file1 with sed. I don't know how this will react, but I don't believe it is healthy.
If $b is not in file2 you would, according to your logic, have a line with only a single column. This is not what you expect.
A fix of your script, would be this:
while read -r a b; do
c=$(grep "$b" file2)
[[ "$c" == "" ]] || echo "$a $c"
done < file1 > file3
which is still not efficient, but it is already M*N. The best way is using awk
note: as a novice, always parse your script with http://www.shellcheck.net
note: as a professional, always parse your script with http://www.shellcheck.net
Could you please try following.
awk 'FNR==NR{a[$2]=$1;next} {for(i in a){if(match($0,"^"i)){print a[i],$0;continue}}}' file1 file2
Adding a non-one liner form of solution:
awk '
FNR==NR{
a[$2]=$1
next
}
{
for(i in a){
if(match($0,"^"i)){
print a[i],$0
continue
}
}
}
' Input_file1 Input_file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk code from here.
FNR==NR{ ##Checking condition if FNR==NR then do following.
a[$2]=$1 ##Creating array a whose index is $2 and value is $1.
next ##next will skip all further statements from here.
}
{ ##Statements from here will run for 2nd Input_file only.
for(i in a){ ##Traversing through array a all elements here.
if(match($0,"^"i)){ ##Checking condition if current line matches index of current item from array a then do following.
print a[i],$0 ##Printing array a whose index is i and current line here.
continue ##Again take cursor to for loop.
}
}
}
' Input_file1 Input_file2 ##Mentioning all Input_file names here.

Required suggestions to optimize a piece of unix ksh code

I'm new to shell scripting and expecting some guidance on how to optimize the following piece of code to avoid unnecessary loops.
The file "DD.$BUS_DT.dat" is a pipe delimited file and contains 4 columns. Sample data in DD.2015-05-19.dat will be as follows
cust portal|10|10|0
sys-b|10|10|0
Code
i=0;
sed 's///g;s/[0-9]//g' ./DD.$BUS_DT.dat > ./temp-processed.dat
set -A sourceList
while read line
do
#echo $line
case $line in
'cust portal') sourceList[$i]=custportal;;
*) sourceList[$i]=${line};;
esac
(( i += 1));
done < ./temp-processed.dat;
echo ${sourceList[#]};
i=0;
while [[ i -lt ${#sourceList[#]} ]]; do
print ${sourceList[i]} >> ./processed-$BUS_DT.dat
(( i += 1))
done
My goal is to read the data from the first column of the file without spaces so that the output should be like ...
custportal
sys-b
Your help will be appreciated.
I haven't gone through all your script, but if you just want to get the first column on |-separated columns, stripping the spaces that they may have, you can use awk like this:
$ awk -F"|" '{sub(" ","",$1); print $1}' file
custportal
sys-b
This uses | as field separator and replaces all the spaces with an empty string. Then, it prints it.

Any way to find if two adjacent new lines start with certain words?

Say I have a file like so:
+jaklfjdskalfjkdsaj
fkldsjafkljdkaljfsd
-jslakflkdsalfkdls;
+sdjafkdjsakfjdskal
I only want to find and count the amount of times during this file a line that starts with - is immediately followed by a line that starts with +.
Rules:
No external scripts
Must be done from within a bash script
Must be inline
I could figure out how to do this in a Python script, for instance, but I've never had to do something this extensive in Bash.
Could anyone help me out? I figure it'll end up being grep, perl, or maybe a talented sed line -- but these are things I'm still learning.
Thank you all!
grep -A1 "^-" $file | grep "^+" | wc -l
The first grep finds all of the lines starting with -, and the -A1 causes it to also output the line after the match too.
We then grep that output for any lines starting with +. Logically:
We know the output of the first grep is only the -XXX lines and the following lines
We know that a +xxx line cannot also be a -xxx line
Therefore, any +xxx lines must be following lines, and should be counted, which we do with wc -l
Easy in Perl:
perl -lne '$c++ if $p and /^\+/; $p = /^-/ }{ print $c' FILE
awk one-liner:
awk -v FS='' '{x=x sprintf("%s", $1)}END{print gsub(/-\+/,"",x)}' file
e.g.
kent$ cat file
+jaklfjdskalfjkdsaj
fkldsjafkljdkaljfsd
-jslakflkdsalfkdls;
+sdjafkdjsakfjdskal
-
-
-
+
-
+
foo
+
kent$ awk -v FS='' '{x=x sprintf("%s", $1)}END{print gsub(/-\+/,"",x)}' file
3
Another Perl example. Not as terse as choroba's, but more transparent in how it works:
perl -e'while (<>) { $last = $cur; $cur = $_; print $last, $cur if substr($last, 0, 1) eq "-" && substr($cur, 0, 1) eq "+" }' < infile
Output:
-jslakflkdsalfkdls;
+sdjafkdjsakfjdskal
Pure bash:
unset c p
while read line ; do
[[ $line == +* && $p == 0 ]] && (( c++ ))
[[ $line == -* ]]
p=$?
done < FILE
echo $c