my command looks like:
for i in *.fasta ; do
parallel -j 10 python script.py $i > $i.out
done
I want to add a test condition to this loop where it only executes the parallel python script if there are no identical lines in the .fasta file
an example .fasta file below:
>ref2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGCGGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
>mut_1_2964_0
AAAAAAAAACGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGTTGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
an example .fasta file that I would like excluded because lines 2 and 4 are identical.
>ref2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGCGGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
>mut_1_2964_0
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAATCC
CCAAAGTCAAGGAGTAGTAGAATCTATGCGGAAAGAATTAAAGAAAATTATAGGACAGGT
AAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGC
The input files always have 4 lines exactly, and lines 2 and 4 are always the lines to be compared.
I've been using sort file.fasta | uniq -c to see if there are identical lines, but I don't know how to incorporate this into my bash loop.
EDIT:
command:
for i in read_00.fasta ; do lines=$(awk 'NR % 4 == 2' $i | sort | uniq -c | awk '$1 > 1'); if [ -z "$lines" ]; then echo $i >> not.identical.txt; fi;
read_00.fasta:
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAAAAATAACCACAGAAAGCATAGTAATATGGGGAAAGACTCCTAAATTTAAACTGCCCATACAAAAGGAAACATGGGAAACATGGTGGACAGAGTATTGGCAAGCCACCTGGATTCCTGAGTGGGAGTTTGTTAATACCCCTCCCTTAGTGAAATTATGGTACCAGTTAGA
>mut_1_2964_0
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAAAAATAACCACAGAAAGCATAGTAATATGGGGAAAGACTCCTAAATTTAAACTGCCCATACAAAAGGAAACATGGGAAACATGGTGGACAGAGTATTGGCAAGCCACCTGGATTCCTGAGTGGGAGTTTGTTAATACCCCTCCCTTAGTGAAATTATGGTACCAGTTAGA
Verify those specifc lines content with below awk and exit failure when lines were identical or exit success otherwise (instead of exit, you can do whatever you want to print/do for you);
awk 'NR==2{ prev=$0 } NR==4{ if(prev==$0) exit 1; else exit }' "./$yourFile"
or to output fileName instead when 2nd and 4th lines were differ:
awk 'NR==2{ prev=$0 } NR==4{ if(prev!=$0) print FILENAME; exit }' ./*.fasta
Using the exit-status of the first command then you can easily execute your next second command, like:
for file in ./*.fasta; do
awk 'NR==2{ prev=$0 } NR==4{ if(prev==$0) exit 1; else exit }' "$file" &&
{ parallel -j 10 python script.py "$file" > "$file.out"; }
done
I am writing a Tcl script which inserts some text in a file behind the matched line. The following are the basic codes in the script.
set test_lists [list "test_1"\
"test_2"\
"test_3"\
"test_4"\
"test_5"
]
foreach test $test_lists {
set content "
'some_data/$test'
"
exec sed -i "/dog/a$content" /Users/l/Documents/Codes/TCL/file.txt
}
However, when I run this script, it always shows me this error:
dyn-078192:TCL l$ tclsh test.tcl
sed: -e expression #1, char 12: unknown command: `''
while executing
"exec sed -i "/dog/a$content" /Users/l/Documents/Codes/TCL/file.txt"
("foreach" body line 5)
invoked from within
"foreach test $test_lists {
set content "
'some_data/$test'
"
exec sed -i "/dog/a$content" /Users/l/Documents/Codes/TCL/file.txt
}"
(file "test.tcl" line 8)
Somehow it always tried to evaluate the first word in $contentas a command.
Any idea what should I do here to make this work?
Thanks.
You first should decide exactly what characters need to be processed by sed. (See https://unix.stackexchange.com/questions/445531/how-to-chain-sed-append-commands for why this can matter…) They might possibly be:
/dog/a\
'some_data/test_1'
which would turn a file like:
abc
dog
hij
into
abc
dog
'some_data/test_1'
hij
If that's what you want, you can then proceed to the second stage: getting those characters from Tcl into sed.
# NB: *no* newline here!
set content "'some_data/$test'"
# NB: there's a quoted backslashes and two quoted newlines here
exec sed -i "/dog/a\\\n$content\n" /Users/l/Documents/Codes/TCL/file.txt
One of the few places where you need to be careful with quoting in Tcl is when you have backslashes and newlines in close proximity.
Why not perform the text transformation directly in Tcl itself? This might reverse the order of inserted lines compared to the original code. You can fix that by lreverseing the list at a convenient time, and perhaps you will also want to do further massaging of the text to insert. That's all refinements...
set test_lists [list "'some_data/test_1'"\
"'some_data/test_2'"\
"'some_data/test_3'"\
"'some_data/test_4'"\
"'some_data/test_5'"
]
set filename /Users/l/Documents/Codes/TCL/file.txt
set REGEXP "dog"
# Read in the data; this is good even for pretty large files
set f [open $filename]
set lines [split [read $f] "\n"]
close $f
# Search for first matching line by regular expression
set idx [lsearch -regexp $lines $REGEXP]
if {$idx >= 0} {
# Found something, so do the insert in the list of lines
set lines [linsert $lines [expr {$idx + 1}] {*}$test_lists]
# Write back to the file as we've made changes
set f [open $filename "w"]
puts -nonewline $f [join $lines "\n"]
close $f
}
(an extended comment, not an answer)
Running this in the shell to clarify your desired output: is this what you want?
$ cat file.txt
foo
dog A
dog B
dog C
dog D
dog E
bar
$ for test in test_{1..5}; do content="some_data/$test"; sed -i "/dog/a$content" file.txt; done
$ cat file.txt
foo
dog A
some_data/test_5
some_data/test_4
some_data/test_3
some_data/test_2
some_data/test_1
dog B
some_data/test_5
some_data/test_4
some_data/test_3
some_data/test_2
some_data/test_1
dog C
some_data/test_5
some_data/test_4
some_data/test_3
some_data/test_2
some_data/test_1
dog D
some_data/test_5
some_data/test_4
some_data/test_3
some_data/test_2
some_data/test_1
dog E
some_data/test_5
some_data/test_4
some_data/test_3
some_data/test_2
some_data/test_1
bar
Let's say I have two variables foo and bar containing the same number of newline separated strings, for instance
$ echo $foo
a
b
c
$ echo $bar
x
y
z
What is the simplest way to merge foo and bar to get the output below?
a x
b y
c z
If foo and bar were files I could do paste -d ' ' foo bar but in this case they are strings.
You can use process substitution in Bash to do this (not POSIX compliant):
foo=$'a\nb\nc'
bar=$'x\ny\nz'
paste -d ' ' <(printf '%s\n' "$foo") <(printf '%s\n' "$bar")
Outputs:
a x
b y
c z
An sh-compliant way seems a little convoluted:
foo=$'a\nb\nc'
bar=$'x\ny\nz'
res=$(while IFS=$'\n' read -u 3 -r f1 && IFS=$'\n' read -u 4 -r f2; do
printf '%s' "$f1"
printf ' %s\n' "$f2"
done 3<<<"$foo" 4<<<"$bar"
)
I have input (for example, from ifconfig run0 scan on OpenBSD) that has some fields that are separated by spaces, but some of the fields themselves contain spaces (luckily, such fields that contain spaces are always enclosed in quotes).
I need to distinguish between the spaces within the quotes, and the separator spaces. The idea is to replace spaces within quotes with underscores.
Sample data:
%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3
nwid Websense chan 6 bssid 00:22:7f:xx:xx:xx 59dB 54M short_preamble,short_slottime
nwid ZyXEL chan 8 bssid cc:5d:4e:xx:xx:xx 5dB 54M privacy,short_slottime
nwid "myTouch 4G Hotspot" chan 11 bssid d8:b3:77:xx:xx:xx 49dB 54M privacy,short_slottime
Which doesn't end up processed the way I want, since I haven't replaced the spaces within the quotes with the underscores yet:
%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 |\
cut -s -d ' ' -f 2,4,6,7,8 | sort -n -k4
"myTouch Hotspot" 11 bssid d8:b3:77:xx:xx:xx
ZyXEL 8 cc:5d:4e:xx:xx:xx 5dB 54M
Websense 6 00:22:7f:xx:xx:xx 59dB 54M
For a sed-only solution (which I don't necessarily advocate), try:
echo 'a b "c d e" f g "h i"' |\
sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta'
a b "c_d_e" f g "h_i"
Translation:
Start at the beginning of the line.
Look for the pattern junk"junk", repeated zero or more times, where junk doesn't have a quote, followed by junk"junk space.
Replace the final space with _.
If successful, jump back to the beginning.
try this:
awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" file
it works for multi quotation parts in a line:
echo '"first part" foo "2nd part" bar "the 3rd part comes" baz'| awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\""
"first_part" foo "2nd_part" bar "the_3rd_part_comes" baz
EDIT alternative form:
awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' file
Another awk to try:
awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\"
Removing the quotes:
awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=
Some additional testing with a triple size test file further to the earlier tests done by #steve. I had to transform the sed statement a little bit so that non-GNU seds could process it as well. I included awk (bwk) gawk3, gawk4 and mawk:
$ for i in {1..1500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' ; done > test
$ time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null
real 0m27.802s
user 0m27.588s
sys 0m0.177s
$ time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m6.565s
user 0m6.500s
sys 0m0.059s
$ time gawk3 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m21.486s
user 0m18.326s
sys 0m2.658s
$ time gawk4 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m14.270s
user 0m14.173s
sys 0m0.083s
$ time mawk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m4.251s
user 0m4.193s
sys 0m0.053s
$ time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m13.229s
user 0m13.141s
sys 0m0.075s
$ time gawk3 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m33.965s
user 0m26.822s
sys 0m7.108s
$ time gawk4 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m15.437s
user 0m15.328s
sys 0m0.087s
$ time mawk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m4.002s
user 0m3.948s
sys 0m0.051s
$ time sed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null
real 5m14.008s
user 5m13.082s
sys 0m0.580s
$ time gsed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null
real 4m11.026s
user 4m10.318s
sys 0m0.463s
mawk rendered the fastest results...
You'd be better off with perl. The code is much more readable and maintainable:
perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge'
With your input, the results are:
a b "c_d_e" f g "h_i"
Explanation:
-p # enable printing
-e # the following expression...
s # begin a substitution
: # the first substitution delimiter
"[^"]*" # match a double quote followed by anything not a double quote any
# number of times followed by a double quote
: # the second substitution delimiter
($x=$&)=~s/ /_/g; # copy the pattern match ($&) into a variable ($x), then
# substitute a space for an underscore globally on $x. The
# variable $x is needed because capture groups and
# patterns are read only variables.
$x # return $x as the replacement.
: # the last delimiter
g # perform the nested substitution globally
e # make sure that the replacement is handled as an expression
Some testing:
for i in {1..500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' >> test; done
time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null
real 0m8.301s
user 0m8.273s
sys 0m0.020s
time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m4.967s
user 0m4.924s
sys 0m0.036s
time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m4.336s
user 0m4.244s
sys 0m0.056s
time sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test >/dev/null
real 2m26.101s
user 2m25.925s
sys 0m0.100s
NOT AN ANSWER, just posting awk equivalent code for #steve's perl code in case anyone's interested (and to help me remember this in future):
#steve posted:
perl -pe 's:"[^\"]*":($x=$&)=~s/ /_/g;$x:ge'
and from reading #steve's explanation the briefest awk equivalent to that perl code (NOT the preferred awk solution - see #Kent's answer for that) would be the GNU awk:
gawk '{
head = ""
while ( match($0,"\"[^\"]*\"") ) {
head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}'
which we get to by starting from a POSIX awk solution with more variables:
awk '{
head = ""
tail = $0
while ( match(tail,"\"[^\"]*\"") ) {
x = substr(tail,RSTART,RLENGTH)
gsub(/ /,"_",x)
head = head substr(tail,1,RSTART-1) x
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
and saving a line with GNU awk's gensub():
gawk '{
head = ""
tail = $0
while ( match(tail,"\"[^\"]*\"") ) {
x = gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
head = head substr(tail,1,RSTART-1) x
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
and then getting rid of the variable x:
gawk '{
head = ""
tail = $0
while ( match(tail,"\"[^\"]*\"") ) {
head = head substr(tail,1,RSTART-1) gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
and then getting rid of the variable "tail" if you don't need $0, NF, etc, left hanging around after the loop:
gawk '{
head = ""
while ( match($0,"\"[^\"]*\"") ) {
head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}'
I have a file, xx.txt, like this.
1PPYA
2PPYB
1GBND
1CVHA
The first line of this file is "1PPYA". I would like to
Read the last character of "1PPYA." In this example, it's "A/"
Find "1PPY.txt" (the first four characters) from the "yy" directory.
Delete the lines start with "csh" which contain the "A" character.
Given the following "1PPY.txt" in the "yy" directory:
csh 1 A 1 27.704 6.347
csh 2 A 1 28.832 5.553
csh 3 A 1 28.324 4.589
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
csh 6 A 1 28.378 4.899
The required output would be:
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
Assuming your shell is bash
while read word; do
if [[ $word =~ ^(....)(.)$ ]]; then
filename="yy/${BASH_REMATCH[1]}.txt"
letter=${BASH_REMATCH[2]}
if [[ -f "$filename" ]]; then
sed "/^csh.*$letter/d" "$filename"
fi
fi
done < xx.txt
As you've tagged the question with awk:
awk '{
filename = "yy/" substr($1,1,4) ".txt"
letter = substr($1,5)
while (getline < filename)
if (! match($0, "^csh.*" letter))
print
close(filename)
}' xx.txt
This might work for you:
sed 's|^ *\(.*\)\(.\)$|sed -i.bak "/^ *csh.*\2/d" yy/\1.txt|' xx.txt | sh
N.B. I added a file backup. If this is not needed amend the -i.bak to -i
You can use this bash script:
while read f l
do
[[ -f $f ]] && awk -v l=$l '$3 != l' $f
done < <(awk '{len=length($0);l=substr($0,len);f=substr($0,0,len-1);print "yy/" f ".txt", l;}' xx.txt)
I posted this because you are a new user, however it will be much better to show us what you have tried and where you're stuck.
TXR:
#(next "xx.txt")
#(collect)
#*prefix#{suffix /./}
# (next `yy/#prefix.txt`)
# (collect)
# (all)
#{whole-line}
# (and)
# (none)
#shell #num #suffix #(skip)
# (end)
# (end)
# (do (put-string whole-line) (put-string "\n"))
# (end)
#(end)
Run:
$ txr del.txr
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
txr: unhandled exception of type file_error:
txr: (del.txr:5) could not open yy/2PPY.txt (error 2/No such file or directory)
Because of the outer #(collect)/#(end) (easily removed) this processes all of the lines from xx.txt, not just the first line, and so it blows up because I don't have 2PPY.txt.