diff between two files ignore empty lines - diff

I want to make diff between two files ignoring the empty lines, but preserve the original line number in the files.
File1:
hhhh
gggg
ffff
File2:
aaa
bbb
ccc
Diff:
1,6c1,3
< hhhh
<
<
<
< gggg
< ffff
---
> aaa
> bbb
> ccc
I want: (preserve 1,6c1,3)
1,6c1,3
< hhhh
< gggg
< ffff
---
> aaa
> bbb
> ccc
I've tried diff -B, diff -I "\n" but it doesn't work.
Does anyone know how I can do? Thanks.

Solved with perl:
diff file1 file2 | perl -ne ' print if (!/<\s*$/)'

Related

I need to print columns from two different files with the different numbers of rows into one file

File1.txt
123 321 231
234 432 342
345 543 453
file2.txt
abc bca cba
def efd fed
ghi hig ihg
jkl klj lkj
mno nom onm
pqr qrp rqp
I want output file like
Outfile.txt
123 321 231 abc bca cba
234 432 342 def efd fed
345 543 453 ghi hig ihg
jkl klj lkj
mno nom onm
pqr qrp rqp
Most simply:
sed 's/$/ /' file1 | paste -d '' - file2
This appends spaces to the end of lines in file1 and pastes the output of that together with file2 without a delimiter.
Alternatively, if you know that file2 is longer than file1,
awk 'NR == FNR { line1[NR] = $0 " "; next } { print line1[FNR] $0 }' file1 file2
or if you don't know it,
awk 'NR == FNR { n = NR; line1[n] = $0 " "; next } { print line1[FNR] $0 } END { for(i = FNR + 1; i <= n; ++i) print line1[i]; }' file1 file2
also works.

Splitting one file into multiple files

I have a large file like below, I want to split this file into multiple files. Each file should be break after ENDMDL. For the following file there will be three output files with name pose1.av, pose2.av and pose3.av.
MODEL 1
SML 170 O PRO A 17 16.893 3.030 0.799 1.00 1.00 O
SML 171 OXT PRO A 17 18.167 2.722 2.597 1.00 1.00 O
TER 172 PRO A 17
ENDMDL
MODEL 2
SML 4 CG ARG A 1 -2.171 -7.105 -4.278 1.00 1.00 C
SML 5 CD ARG A 1 -1.851 -8.581 -4.022 1.00 1.00 C
SML 113 HD1 HIS A 12 2.465 -8.206 5.062 1.00 1.00 H
TER 114 HIS A 12
ENDMDL
MODEL 3
SML 101 N HIS A 12 3.765 -3.995 7.233 1.00 1.00 N
SML 102 CA HIS A 12 2.584 -4.736 6.934 1.00 1.00 C
TER 103 HIS A 12
ENDMDL
A rather efficient one, using bash and sed:
n=0
while IFS= read -r firstline; do
{ echo "$firstline"; sed '/^ENDMDL$/q'; } > "pose$((++n)).av"
done < file
It's much more efficient than the other Bash answer: the output file is only opened once, and most of the parsing is done by sed, and not by bash.
csplit can do this out of the box
csplit -z -s -f pose -b "%01d.av" file '/^ENDMDL$/+1' '{*}'
Awk is a good choice for this task:
awk '{file="pose"++i;printf "%s%s",$0,RS > file;close(file)}' RS='ENDMDL\n' file
Using a perl one-liner
perl -ne '$fh or open $fh, "> pose".++$i".av"; print $fh $_; undef $fh if /^ENDMDL/' file.txt
In pure Bash:
cnt=1
while read line; do
echo "$line" >> pose${cnt}.av
[ "$line" == "ENDMDL" ] && let cnt+=1
done < filename.txt
awk '/^MODEL/{out="pose"++cnt".av"} {print > out}' file

How do I right justify columns in a file [duplicate]

This question already has answers here:
right text align - bash
(3 answers)
Closed 8 years ago.
How do I right justify the columns of a file in awk, sed, or bash ?
My file is currently left justified and space delimited.
Can I used printf or rev?
Here is what my file looks like :
$ cat file
14,107 aaa 12,436 0.0 0 0 313 0 373
3,806,201 bbb 1,573 0.0 0 0 -25 0 -25
And using rev doesn't give me the output I'm looking for.
$rev file | column -t | rev
14,107 aaa 12,436 0.0 0 0 313 0 373
3,806,201 bbb 1,573 0.0 0 0 -25 0 -25
In lieu of a specific example here is a general solution using a trick with rev:
$ cat file
a 10000.00 x
b 100 y
c 1 zzzZZ
$ rev file | column -t | rev
a 10000.00 x
b 100 y
c 1 zzzZZ
Where column -t is replaced by whatever you are trying to do.

Deleting lines with sed or awk

I have a file data.txt like this.
>1BN5.txt
207
208
211
>1B24.txt
88
92
I have a folder F1 that contains text files.
1BN5.txt file in F1 folder is shown below.
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 422 C SER A 248 70.124 -29.955 8.226 1.00 55.81 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
ATOM 626 N MET B 87 1.054 -3.071 -5.633 1.00 10.00 N
ATOM 627 CA MET B 87 -0.213 -2.354 -5.826 1.00 10.00 C
1B24.txt file in F1 folder is shown below.
ATOM 630 CB MET B 87 -0.476 -2.140 -7.318 1.00 10.00 C
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
ATOM 644 CA ALA B 94 -2.560 -5.149 -4.675 1.00 10.00 C
I need only the lines containing 207,208,211(6th column)in 1BN5.txt file. I want to delete other lines in 1BN5.txt file. Like this, I need only the lines containing 88,92 in 1B24.txt file.
Desired output
1BN5.txt file
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
1B24.txt file
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Here's one way using GNU awk. Run like:
awk -f script.awk data.txt
Contents of script.awk:
/^>/ {
file = substr($1,2)
next
}
{
a[file][$1]
}
END {
for (i in a) {
while ( ( getline line < ("./F1/" i) ) > 0 ) {
split(line,b)
for (j in a[i]) {
if (b[6]==j) {
print line > "./F1/" i ".new"
}
}
}
system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
}
}
Alternatively, here's the one-liner:
awk '/^>/ { file = substr($1,2); next } { a[file][$1] } END { for (i in a) { while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,b); for (j in a[i]) if (b[6]==j) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt
If you have an older version of awk, older than GNU Awk 4.0.0, you could try the following. Run like:
awk -f script.awk data.txt
Contents of script.awk:
/^>/ {
file = substr($1,2)
next
}
{
a[file]=( a[file] ? a[file] SUBSEP : "") $1
}
END {
for (i in a) {
split(a[i],b,SUBSEP)
while ( ( getline line < ("./F1/" i) ) > 0 ) {
split(line,c)
for (j in b) {
if (c[6]==b[j]) {
print line > "./F1/" i ".new"
}
}
}
system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
}
}
Alternatively, here's the one-liner:
awk '/^>/ { file = substr($1,2); next } { a[file]=( a[file] ? a[file] SUBSEP : "") $1 } END { for (i in a) { split(a[i],b,SUBSEP); while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,c); for (j in b) if (c[6]==b[j]) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt
Please note that this script does exactly as you describe. It expects files like 1BN5.txt and 1B24.txt to reside in the folder F1 in the present working directory. It will also overwrite your original files. If this is not the desired behavior, drop the system() call. HTH.
Results:
Contents of F1/1BN5.txt:
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
Contents of F1/1B24.txt:
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Don't try to delete lines from the existing file, try to create a new file with only the lines you want to have:
cat 1bn5.txt | awk '$6 == 207 || $6 == 208 || $6 == 211 { print }' > output.txt
assuming gnu awk, run this command from the directory containing data.txt:
awk -F">" '{if($2 != ""){fname=$2}if($2 == ""){term=$1;system("grep "term" F1/"fname" >>F1/"fname"_results");}}' data.txt
this parses data.txt for filenames and search terms, then calls grep from inside awk to append the matches from each file and term listed in data.txt to a new file in F1 called originalfilename.txt_results.
if you want to replace the original files completely, you could then run this command:
grep "^>.*$" data.txt | sed 's/>//' | xargs -I{} find F1 -name {}_results -exec mv F1/{}_results F1/{} \;
This will move all of the files in F1 to a tmp dir named "backup" and then re-create just the resultant non-empty files under F1
mv F1 backup &&
mkdir F1 &&
awk '
NF==FNR {
if (sub(/>/,"")) {
file=$0
ARGV[ARGC++] = "backup/" file
}
else {
tgt[file,$0] = "F1/" file
}
next
}
(FILENAME,$6) in tgt {
print > tgt[FILENAME,$6]
}
' data.txt &&
rm -rf backup
If you want the empty files too it's a trivial tweak and if you want to keep the backup dir just get rid of the "&& rm.." at the end (do that during testing anyway).
EDIT: FYI this is one case where you could argue the case for getline not being completely incorrect since it's parsing a first file that's totally unlike the rest of the files in structure and intent so parsing that one file differently from the rest isn't going to cause any maintenance headaches later:
mv F1 backup &&
mkdir F1 &&
awk -v data="data.txt" '
BEGIN {
while ( (getline line < data) > 0 ) {
if (sub(/>/,"",line)) {
file=line
ARGV[ARGC++] = "backup/" file
}
else {
tgt[file,line] = "F1/" file
}
}
}
(FILENAME,$6) in tgt {
print > tgt[FILENAME,$6]
}
' &&
rm -rf backup
but as you can see it makes the script a bit more complicated (though slightly more efficient as there's now no test for FNR==NR in the main body).
This solution plays some tricks with the record separator: "data.txt" uses > as the record separator, while the other files use newline.
awk '
BEGIN {RS=">"}
FNR == 1 {
# since the first char in data.txt is the record separator,
# there is an empty record before the real data starts
next
}
{
n = split($0, a, "\n")
file = "F1/" a[1]
newfile = file ".new"
RS="\n"
while (getline < file) {
for (i=2; i<n; i++) {
if ($6 == a[i]) {
print > newfile
break
}
}
}
RS=">"
system(sprintf("mv \"%s\" \"%s.bak\" && mv \"%s\" \"%s\"", file, file, newfile, file))
}
' data.txt
Definitely a job for awk:
$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
$ awk '$6==92||$6==88 { print }' 1B24.txt
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Redirect to save the output:
$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt > output.txt
I don't think you can do this with just sed alone. You need a loop to read your file data.txt. For example, using a bash script:
#!/bin/bash
# First remove all possible "problematic" characters from data.txt, storing result
# in data.clean.txt. This removes everything except A-Z, a-z, 0-9, leading >, and ..
sed 's/[^A-Za-z0-9>\.]//g;s/\(.\)>/\1/g;/^$/d' data.txt >| data.clean.txt
# Next determine which lines to keep:
cat data.clean.txt | while read line; do
if [[ "${line:0:1}" == ">" ]]; then
# If input starts with ">", set remainder to be the current file
file="${line:1}"
else
# If value is in sixth column, add "keep" to end of line
# Columns assumed separated by one or more spaces
# "+" is a GNU extension, so we need the -r switch
sed -i -r "/^[^ ]+ +[^ ]+ +[^ ]+ +[^ ]+ +$line +/s/$/keep/" $file
fi
done
# Finally delete the unwanted lines, i.e. those without "keep":
# (assumes each file appears only once in data.txt)
cat data.clean.txt | while read line; do
if [[ "${line:0:1}" == ">" ]]; then
sed -i -n "/keep/{s/keep//g;p;}" ${line:1}
fi
done

How to delete all characters but the last

I want to parse a file and delete all leading 0's of a number using sed. (of course if i have something like 0000 then results to 0) How to do that?
I think you may be searching for this.
Here lies your answer. You need to modify of course.
How to remove first/last character from a string using SED
This is probably over complicated, but it catches all the corner cases I tested:
sed 's/^\([^0-9]*\)0/\1\n0/;s/$/}/;s/\([^0-9\n]\)0/\1\n/g;s/\n0\+/\n/g;s/\n\([^0-9]\)/0\1/g;s/\n//g;s/}$//' inputfile
Explanation:
This uses the divide-and-conquer technique of inserting newlines to delimit segments of a line so they can be manipulated individually.
s/^\([^0-9]*\)0/\1\n0/ - insert a newline before the first zero
s/$/}/ - add a buffer character at the end
s/\([^0-9\n]\)0/\1\n/g - insert newlines before each leading zero (and remove the first)
s/\n0\+/\n/g - remove the remaining leading zeros
s/\n\([^0-9]\)/0\1/g - replace bare zeros
s/\n//g - remove the newlines
s/}$// - remove the end-of-line buffer
This file:
0 foo 1 bar 01 10 001 baz 010 100 qux 000 00 0001 0100 0010
100 | 00100
010 | 010
001 | 001
100 | 100
0 | 0
00 | 0
000 | 0
00 | 00
00 | 00
00 | 00 z
Becomes:
0 foo 1 bar 1 10 1 baz 10 100 qux 0 0 1 100 10
100 | 100
10 | 10
1 | 1
100 | 100
0 | 0
0 | 0
0 | 0
0 | 0
0 | 0
0 | 0 z
If you have leading zeroes and it is accompanied by string of numbers, all you have to do is to convert it into integer. Something like this
$ echo "000123 test " | awk '{$1=$1+0}1'
123 test
This will not require any significant amount of regex whether they are simple or overly complicated.
Similarly (Ruby1.9+)
$ echo "000123 test " | ruby -lane '$F[0]=$F[0].to_i; print $F.join(" ")'
123 test
For cases of all 0000's
$ echo "0000 test " | ruby -lane '$F[0]=$F[0].to_i; print $F.join(" ")'
0 test
$ echo "000 test " | awk '{$1=$1+0}1'
0 test