How to print some free text in addition to SED extract - sed

Well-known SED command to extract a first line and print to another file
sed -n '1 p' /p/raw.txt | cat >> /p/001.txt ;
gives an output in /p/001.txt like
John Doe
But how to modify this command above to add some free text and have, for example, the output like
Name: John Doe
Thanks for any hint to try.

You can do that in a single command (and no sub-shells):
sed 's/^/Name: /;q' /p/raw.txt >> /p/001.txt
This prefixes "Name: " in front of the first line, prints it, then quits so you don't process additional lines. Add a line number before the q to print all lines up to (and including) that number. The output is appended to /p/001.txt just like your original code.
If you want a range of lines:
sed -n '3,9{s/^/Name: /;p}9q' /p/raw.txt >> /p/001.txt
This reads from lines 3-9, performs the substitution, prints, then quits after line 9.
If you want specific lines, I recommend awk:
awk 'NR==3 || NR==9 { print "Name: " $0 } NR>=9 { exit }' /p/raw.txt >> /p/001.txt
This has two clauses. One says the number of record (line number) is either 3 or 9, in which case we print the prefix and the line. The other tells us to stop reading the file after the 9th record.
Here are two more commands to show how awk can act on just the first line(s) or a given range:
awk '{ print "Name: " $0 } NR >= 1 { exit }' /p/raw.txt >> /p/001.txt
awk '3 <= NR { print "Name: " $0 } NR >= 9 { exit }' /p/raw.txt >> /p/001.txt
It appears you're continuously building one file from the other. Consider:
tail -Fn0 /p/raw.txt |sed 's/^/Name: /' >> /p/001.txt
This will run continuously, adding only new entries (added after the command is run) to /p/001.txt
Perhaps you have lots of duplicates to resolve?
awk 'NR != FNR { $0 = "Name: " $0 } !s[$0]++' \
/p/001.txt /p/raw.txt > /tmp/001.txt && mv /tmp/001.txt /p/001.txt
This folds together the previously saved names with any new names, printing names only once (!s[$0]++ is true when s[$0] is zero (its default state), but after the evaluation, it increments to one, making it false on the second occurrence. When a bare clause has no action, the line is printed.) Because we're reading the output file, we need a temporary output. Upon its successful completion, we then move it atop the target output file.

printf "Name : %s\n" "$(sed -n '1p;q' /p/raw.txt)" >/p/001.txt
should do it. If sed is not a requirement do
echo -e "Name : $(sed -n '1p;q' /p/raw.txt)" >/p/001.txt
Note
The q option with the sed quits it without processing any more commands or input.
The -e option tells echo to interpret escape sequences. This is a peculiarity of bash shell.

Related

Validate if a text file contains identical records at specific line's number?

my command looks like:
for i in *.fasta ; do
parallel -j 10 python script.py $i > $i.out
done
I want to add a test condition to this loop where it only executes the parallel python script if there are no identical lines in the .fasta file
an example .fasta file below:
>ref2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGCGGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
>mut_1_2964_0
AAAAAAAAACGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGTTGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
an example .fasta file that I would like excluded because lines 2 and 4 are identical.
>ref2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGCGGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
>mut_1_2964_0
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGCGGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
The input files always have 4 lines exactly, and lines 2 and 4 are always the lines to be compared.
I've been using sort file.fasta | uniq -c to see if there are identical lines, but I don't know how to incorporate this into my bash loop.
EDIT:
command:
for i in read_00.fasta ; do lines=$(awk 'NR % 4 == 2' $i | sort | uniq -c | awk '$1 > 1'); if [ -z "$lines" ]; then echo $i >> not.identical.txt; fi;
read_00.fasta:
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAAAAATAACCACAGAAAGCATAGTAATATGGGGAAAGACTCCTAAATTTAAACTGCCCATACAAAAGGAAACATGGGAAACATGGTGGACAGAGTATTGGCAAGCCACCTGGATTCCTGAGTGGGAGTTTGTTAATACCCCTCCCTTAGTGAAATTATGGTACCAGTTAGA
>mut_1_2964_0
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAAAAATAACCACAGAAAGCATAGTAATATGGGGAAAGACTCCTAAATTTAAACTGCCCATACAAAAGGAAACATGGGAAACATGGTGGACAGAGTATTGGCAAGCCACCTGGATTCCTGAGTGGGAGTTTGTTAATACCCCTCCCTTAGTGAAATTATGGTACCAGTTAGA
Verify those specifc lines content with below awk and exit failure when lines were identical or exit success otherwise (instead of exit, you can do whatever you want to print/do for you);
awk 'NR==2{ prev=$0 } NR==4{ if(prev==$0) exit 1; else exit }' "./$yourFile"
or to output fileName instead when 2nd and 4th lines were differ:
awk 'NR==2{ prev=$0 } NR==4{ if(prev!=$0) print FILENAME; exit }' ./*.fasta
Using the exit-status of the first command then you can easily execute your next second command, like:
for file in ./*.fasta; do
awk 'NR==2{ prev=$0 } NR==4{ if(prev==$0) exit 1; else exit }' "$file" &&
{ parallel -j 10 python script.py "$file" > "$file.out"; }
done

How to apply one command into another sed command?

I have one command which is used to extract lines between two string patterns 'string1' and 'string2'. This is stored in variable called 'var1'.
var1=$(awk '/string1/{flag=1; next} /string2/{flag=0} flag' text.txt)
This command works well and the output is a set of lines.
Do you hear the people sing?
Singing a song of angry men?
It is the music of a people
Who will not be slaves again
I want the output of the above command to be inserted after a string pattern 'string3' in another file called stat.txt. I used sed as follows
sed '/string3/a'$var1'' stat.txt
I am having trouble getting the new output. Here, the $var1 seems to be working partially i.e. only one line -
string3
Do you hear the people sing?
Any other suggestions to solve this?
I would be tempted to use sed to extract the lines, and awk to insert them into the other text:
lines=$(sed -n '/string1/,/string2/ p' text.txt)
awk -v new="$lines" '{print} /string3/ {print new}' stat.txt
or perhaps both tasks in a single awk call
awk '
NR == FNR && /string1/ {flag = 1}
NR == FNR && /string2/ {flag = 0}
NR == FNR && flag {lines = lines $0 ORS}
NR == FNR {next}
{print}
/string3/ {printf "%s", lines} # it already ends with a newline
' text.txt stat.txt
It's a data format problem...
Appending a multi-line block of text with the sed append command requires that every line in the block to be appended ends with a \ -- except for the last line of that block. So if we take the two lines of code that didn't work in the question, and reformat the text as required by the append command, the original code should work as expected:
var1=$(awk '/string1/{flag=1; next} /string2/{flag=0} flag' text.txt)
var1="$(sed '$!s/$/\\/' <<< "$var1")"
sed '/string3/a'$var1'' stat.txt
Note that the 2nd line above contains a bashism. A more portable version would be:
var1="$(echo "$var1" | sed '$!s/$/\\/')"
Either variant would convert $var1 to:
Do you hear the people sing?\
Singing a song of angry men?\
It is the music of a people\
Who will not be slaves again

Conditional substitution of patterns in bash strings depending on the beginning of a string

I am new in bash, so excuse me if do not use the right terms.
I need to substitute certain patterns of six characters in a set of files. The order by patterns are substituted depends on the beginning of each string of text.
This is an example of input:
chr1:123-123 5GGGTTAGGGTTAGGGTTAGGGTTAGGGTTA3
chr1:456-456 5TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG3
chr1:789-789 5GGGCTAGGGTTAGGGTTAGGGTTA3
chr1:123-123 etc is the name of the string, they are separated from the string I need to work with by a tab. The string I need to work with is delimited by characters 5 and 3, but I can change them.
I want that all patterns containing T, A, G in anyone of these orders is substituted with X: TTAGGG, TAGGG, AGGGTT, GGGTTA, GGTTAG, GTTAGG.
Similarly, patterns containing CTAGGG, like row 3, in orders similar to the previous one will be substituted with a different character.
The game is repeated with some specific differences for all the 6 characters composing each pattern.
I started writing something like this:
#!/bin/bash
NORMAL=`echo "\033[m"`
RED=`echo "\033[31m"` #red
#read filename for the input file and create a copy and a folder for the output
read -p "Insert name for INPUT file: " INPUT
echo "Creating OUTPUT file " "${RED}"$INPUT"_sub.txt${NORMAL}"
mkdir -p ./"$INPUT"_OUTPUT
cp $INPUT.txt ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
echo
#start the first set of instructions
perfrep
#starting a second set of instructions to substitute pattern with one difference from TTAGGG
onemism
Instructions are
perfrep() {
sed -i -e 's/TTAGGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/TAGGGT/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/AGGGTT/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGGTTA/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGTTAG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GTTAGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
}
# starting a second set of instructions to substitute pattern with one difference from TTAGGG
onemism(){
sed -i -e 's/[GCA]TAGGG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/G[GCA]TAGG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GG[GCA]TAG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGG[GCA]TA/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/AGGG[GCA]T/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/TAGGG[GCA]/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
}
I will need to repeat also with T[GCA]AGGG, TT[TCG]GGG, TTA[ACT]GG, TTAG[ACT]G and TTAGG[ACT].
Using this procedure, I get for these results for the inputs shown
5GGGXXXXTTA3
5XXXXX3
5GGGLXXTTA3
In my point of view, for my job, the first and second string are both made by X repeated five times, and the order of characters is just slightly different. On the other hand, the third one could be masked like this:
5LXXX3
How do I tell the script that if the string starts with 5GGGTTA instead of 5TTAGGG must start to substitute with
sed -i -e 's/GGGTTA/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
instead of
sed -i -e 's/TTAGGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
?
I will need to repeat with all cases; for instance, if the string starts with GTTAGG I will need to start with
sed -i -e 's/GTTAGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
and so on, and add a couple of variation of my pattern.
I need to repeat the substitution with TTAGGG and the variations for all the rows of my input file.
Sorry for the very long question. Thank you all.
Adding information asked by Varun.
Patterns of 6 characters would be TTAGGG , [GCA]TAGGG , T[GCA]AGGG , TT[TCG]GGG , TTA[ACT]GG , TTAG[ACT]G , TTAGG[ACT].
Each one must be checked for a different frame, for instance for TTAGGG we have 6 frames TTAGGG , GTTAGG , GGTTAG, GGGTTA , AGGGTT , TAGGGT.
The same frames must be applied to the pattern containing a variable position.
I will have a total of 42 patterns to check, divided in 7 groups: one containing TTAGGG and derivative frames, 6 with the patterns with a variable position and their derivatives.
TTAGGG and derivatives are the most important and need to be checked first.
#! /usr/bin/awk -f
# generate a "frame" by moving the first char to the end
function rotate(base){ return substr(base,2) substr(base,1,1) }
# Unfortunately awk arrays do not store regexps
# so I am generating the list of derivative strings to match
function generate_derivative(frame,arr, i,j,k,head,read,tail) {
arr[i]=frame;
for(j=1; j<=length(frame); j++) {
head=substr(frame,1,j-1);
read=substr(frame,j,1);
tail=substr(frame,j+1);
for( k=1; k<=3; k++) {
# use a global index to simplify
arr[++Z]= head substr(snp[read],k,1) tail
}
}
}
BEGIN{
fs="\t";
# alternatives to a base
snp["A"]="TCG"; snp["T"]="ACG"; snp["G"]="ATC"; snp["C"]="ATG";
# the primary target
frame="TTAGGG";
Z=1; # warning GLOBAL
X[Z] = frame;
# primary derivatives
generate_derivative(frame, X);
xn = Z;
# secondary shifted targets and their derivatives
for(i=1; i<length(frame); i++){
frame = rotate(frame);
L[++Z] = frame;
generate_derivative(frame, L);
}
}
/^chr[0-9:-]*\t5[ACTG]*3$/ {
# because we care about the order of the prinary matches
for (i=1; i<=xn; i++) {gsub(X[i],"X",$2)}
# since we don't care about the order of the secondary matches
for (hit in L) {gsub(L[hit],"L",$2)}
print
}
END{
# print the matches in the order they are generated
#for (i=1; i<=xn; i++) {print X[i]};
#print ""
#for (i=1+xn; i<=Z; i++) {print L[i]};
}
IFF you can generate a static matching order you can live with then
something like the above Awk script could work. but you say the primary patterns should take precedence and that a secondary rule would be better applied first in some cases. (no can do).
If you need a more flexible matching pattern I would suggest looking at "recursive decent parsing with backtracking" Or "parsing expression grammars".
But then you are not in a bash shell anymore.

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile

How to delete multiple empty lines with SED?

I'm trying to compress a text document by deleting of duplicated empty lines, with sed. This is what I'm doing (to no avail):
sed -i -E 's/\n{3,}/\n/g' file.txt
I understand that it's not correct, according to this manual, but I can't figure out how to do it correctly. Thanks.
I think you want to replace spans of multiple blank lines with a single blank line, even though your example replaces multiple runs of \n with a single \n instead of \n\n. With that in mind, here are two solutions:
sed '/^$/{ :l
N; s/^\n$//; t l
p; d; }' input
In many implementations of sed, that can be all on one line, with the embedded newlines replaced by ;.
awk 't || !/^$/; { t = !/^$/ }'
As tripleee suggested above, I'm using Perl instead of sed:
perl -0777pi -e 's/\n{3,}/\n\n/g'
Use the translate function
tr -s '\n'
the -s or --squeeze-repeats reduces a sequence of repeated character to a single instance.
This is much better handled by tr -s '\n' or cat -s, but if you insist on sed, here's an example from section 4.17 of the GNU sed manual:
#!/usr/bin/sed -f
# on empty lines, join with next
# Note there is a star in the regexp
:x
/^\n*$/ {
N
bx
}
# now, squeeze all '\n', this can be also done by:
# s/^\(\n\)*/\1/
s/\n*/\
/
I am not sure this is what the OP wanted but using the awk solution by William Pursell here is the approach if you want to delete ALL empty lines in the file:
awk '!/^$/' file.txt
Explanation:
The awk pattern
'!/^$/'
is testing whether the current line is consisting only of the beginning of a line (symbolised by '^') and the end of a line (symbolised by '$'), in other words, whether the line is empty.
If this pattern is true awk applies its default and prints the current line.
HTH
I think OP wants to compress empty lines, e.g. where there are 9 consecutive emty lines, he wants to have just three.
I have written a little bash script that does just that:
#! /bin/bash
TOTALLINES="$(cat file.txt|wc -l)"
CURRENTLINE=1
while [ $CURRENTLINE -le $TOTALLINES ]
do
L1=$CURRENTLINE
L2=$(($L1 + 1))
L3=$(($L1 +2))
if [[ $(cat file.txt|head -$L1|tail +$L1) == "" ]]||[[ $(cat file.txt|head -$L1|tail +$L1) == " " ]]
then
L1EMPTY=true
else
L1EMPTY=false
fi
if [[ $(cat file.txt|head -$L2|tail +$L2) == "" ]]||[[ $(cat file.txt|head -$L2|tail +$L2) == " " ]]
then
L2EMPTY=true
else
L2EMPTY=false
fi
if [[ $(cat file.txt|head -$L3|tail +$L3) == "" ]]||[[ $(cat file.txt|head -$L3|tail +$L3) == " " ]]
then
L3EMPTY=true
else
L3EMPTY=false
fi
if [ $L1EMPTY = true ]&&[ $L2EMPTY = true ]&&[ $L3EMPTY = true ]
then
#do not cat line to temp file
echo "Skipping line "$CURRENTLINE
else
echo "$(cat file.txt|head -$CURRENTLINE|tail +$CURRENTLINE)">>temp.txt
echo "Writing line " $CURRENTLINE
fi
((CURRENTLINE++))
done
cat temp.txt>file.txt
rm -r temp.txt
FINALTOTALLINES="$(cat file.txt|wc -l)"
EMPTYLINELINT=$(( $CURRENTLINE - $FINALTOTALLINES ))
echo "Deleted " $EMPTYLINELINT " empty lines."