I am trying to go through a file and keep a consecutive group of 4 rows out of each consecutive group of 40 rows.
So in the whole file, I would keep rows 1-4, 41-44, 81-84, etc.
I tried using sed, but I am really only able to remove specific rows, not do a pattern like this.
Many thanks...
This awk should do:
awk 'NR%40==1 || NR%40==2 || NR%40==3 || NR%40==4' file
A loop version:
awk '{for (i=1;i<5;i++) if (NR%40==i) print $0}' file
Found this should work after I tested various solution:
awk 'NR%40~/^[1-4]$/' file
test
seq 1 100 > file
awk 'NR%40~/^[1-4]$/' file
1
2
3
4
41
42
43
44
81
82
83
84
This might work for you (GNU sed):
sed -n '1~40,+3p' file
Use a 40 line step starting at line 1 and range it over 4 lines.
You might be better off with awk. This is not the most concise solution, but should get you what you want. The variable NR represents the row number.
awk '(NR - 1) % 40 ==0 || (NR - 2) % 40 ==0 || (NR - 3) % 40 ==0 || (NR - 4) % 40 ==0 ' Input.txt
I tested this like this:
seq 1 50 > /tmp/Input.txt
awk '(NR - 1) % 40 ==0 || (NR - 2) % 40 ==0 || (NR - 3) % 40 ==0 || (NR - 4) % 40 ==0 ' /tmp/Input.txt
If you want to modify the original file, then output it to a temporary file and move it back.
awk '(NR - 1) % 40 ==0 || (NR - 2) % 40 ==0 || (NR - 3) % 40 ==0 || (NR - 4) % 40 ==0 ' Input.txt > /tmp/TempOutput
mv /tmp/TempOutput Input.txt
simple but do what is asked (thks to #Jotne for remark based on seq test)
sed -n 'N;N;N;p;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N;N' YourFile
just for fun
sed -n '
x
s/^$/ppppfffffffffffffffffffffffffffffffffff/
s/^p//
t keep
s/^f//
x
b
:keep
x
p' YourFile
there is another more traditionnal way by using only d, p and N but not funny at all :-).
I use a kind of template counter that print and forget lines (the ppppfffffffffffffffffffffffffffffffffff) keep in hold buffer
Related
I am able to delete lines with certain patterns and shorter sed '/^.\{,20\}$/d' -i FILE or longer sed '/^.\{25\}..*/d' -i FILE than certain length separately, but how do I unite pattern and length in sed?
Lines containing A should be between 20 and 25 characters
Lines containing B should be between 10 and 15 characters
Lines containing C should be between 3 and 8 characters
All other lines should be deleted from the file
1234567890 A 1234567890
12345 A 12345
1 A 1
1234567890 B 1234567890
12345 B 12345
1 B 1
1234567890 C 1234567890
12345 C 12345
1 C 1
So that the output should look like this
1234567890 A 1234567890
12345 B 12345
1 C 1
This is how you can do it with sed:
$ sed -ne '/A/ s/^\(.\{20,25\}\)$/\1/p; /B/ s/^\(.\{10,15\}\)$/\1/p; /C/ s/^\(.\{3,8\}\)$/\1/p;' file
1234567890 A 1234567890
12345 B 12345
1 C 1
How does it work:
-ne - suppress printing pattern
/A/ - look for pattern A
^\(.\{20,25\}\)$ - line with 20-25 characters
/\1/p - print pattern space
Use awk and you can simply write the conditions as a boolean expression, you're not stuck trying to make a condition out of a regexp:
$ awk '(/A/ && /^.{20,25}$/) || (/B/ && /^.{10,15}$/) || (/C/ && /^.{3,8}$/)' file
1234567890 A 1234567890
12345 B 12345
1 C 1
Here's an awk solution
awk '/.*A.*/ && length($0) > 19 && length($0) < 26 \
|| /.*B.*/ && length($0) > 9 && length($0) < 16 \
|| /.*C.*/ && length($0) > 2 && length($0) < 9' test1.dat
edit
And here's a more efficient version, where we only get the length($0) one time
awk '{len=length($0)}
/.*A.*/ && len > 19 && len < 26 \
|| /.*B.*/ && len > 9 && len < 16 \
|| /.*C.*/ && len > 2 && len < 9' test1.dat
output
1234567890 A 1234567890
12345 B 12345
1 C 1
I have incremented/decremented your boundary numbers by one to eliminate the need to test with <= and >= (Which are slightly more expensive tests. On a very large file it might cost you a 30 secs (just a guess!)).
(don't let any whitespace characters creep in after the \ at the end of those continued lines).
(Also, you can remove that \ chars and fold this up onto one-line if you need that.)
This can be enhanced to accept variable values, and I include a short example here, finishing it out to your needs can be seen as an opportunity for learning ;-)
awk -v lim1=10 -v lim2=26 '/.*A.*/ && length($0) > lim1 && length($0) < lim2 ...
IHTH
I'm trying to do some calculations on the columns of a tab delimited file using this perl one-liner:
perl -ape 'if (/^\d/) { s/$F[2]/$F[2]\/$F[4]/e && s/$F[3]/$F[3]\/$F[4]/e}' infile
the idea is to get A and B columns divided by C column
infile:
X Y A B C
5001 3 1.03333 0.652549 4215
6001 4 1.2 0.723137 4870
7001 2 1 0.807843 5153
8001 2 1 0.807843 5355
9001 2 1 0.807843 5389
10001 2 1 0.807843 4955
11001 7 1.7671 1.05573 4966
12001 17 8.18802 4.72554 5124
But the output is this:
X Y A B C
5001 3 0.000245155397390273 0.000154815895610913 4215
6001 4 0.000246406570841889 0.000148488090349076 4870
7000.000194061711624297 2 1 0.000156771395303707 5153
8000.000186741363211951 2 1 0.000150857703081232 5355
9000.000185563184264242 2 1 0.000149905919465578 5389
0.0002018163471241170001 2 1 0.000163035923309788 4955
11001 7 0.000355839710028192 0.000212591623036649 4966
12001 17 0.00159797423887588 0.000922236533957845 5124
What is going on on the 3rd to 6th lines? How can manage to fix this?
Thanks.
EDIT:
I removed the /e option from the substitute command and it seems that the calculation is being performed on the wrong column.
perl -ape 'if (/^\d/) { s/$F[2]/$F[2]\/$F[4]/ && s/$F[3]/$F[3]\/$F[4]/}' infile
X Y A B C
5001 3 1.03333/4215 0.652549/4215 4215
6001 4 1.2/4870 0.723137/4870 4870
7001/5153 2 1 0.807843/5153 5153
8001/5355 2 1 0.807843/5355 5355
9001/5389 2 1 0.807843/5389 5389
1/49550001 2 1 0.807843/4955 4955
11001 7 1.7671/4966 1.05573/4966 4966
12001 17 8.18802/5124 4.72554/5124 5124
13001 30 13.8763/5138 8.05385/5138 5138
After substitution and evaluation, you have something like s/1/0.000194061711624297/. So the s operator looks for a 1 and finds it as part of the first column. Whoops. If we add some \b word-boundary markers, we can force the match part of the s operators to match a complete column, never just part of a column:
perl -ape 'if (/^\d/) { s/\b$F[2]\b/$F[2]\/$F[4]/e && s/\b$F[3]\b/$F[3]\/$F[4]/e}' infile
But that's still going to run into issues if it's possible for column X to equal column A or B. Better to just do the calculations and then replace the entire line by assigning to $_:
perl -ape 'if (/^\d/) { $F[2] /= $F[4]; $F[3] /= $F[4]; $_ = join(" ", #F); }'
Use sprintf instead of join if you want a particular format to the output.
Your basic problem is that you are substituting the value that is in column 3 and 4 whereever they appear in the whole line. For row 3, for example, you are doing s/1/1\/5153/e which affects the first occurrence of the digit 1 in the line, not necessarily the 1 that happens to be in column 3.
Try this:
perl -lane 'if ($F[4] =~ /[1-9]/) { $F[2] /= $F[4]; $F[3] /= $F[4] } print join "\t", #F' infile
If you want to limit the precision, do something like $F[2] = sprintf "%f", $F[2]/$F[4]; ...
I have a large file consisting data in 2 columns
100 5
100 10
100 10
101 2
101 4
102 10
102 2
I want to sum the values in 2nd column with matching values in column 1. For this example, the output I'm expecting is
100 25
101 6
102 12
I'm trying to work on this using bash script preferably. Can someone explain me how can I do this
Using awk:
awk '{a[$1]+=$2}END{for(i in a){print i, a[i]}}' inputfile
For your input, it'd produce:
100 25
101 6
102 12
In a perl oneliner
perl -lane "$s{$F[0]} += $F[1]; END { print qq{$_ $s{$_}} for keys %s}" file.txt
You can use an associative array. The first column is the index and the second becomes what you add to it.
#!/bin/bash
declare -A columns=()
while read -r -a line ; do
columns[${line[0]}]=$((${columns[${line[0]}]} + ${line[1]}))
done < "${1}"
for idx in ${!columns[#]} ; do
echo "${idx} ${columns[${idx}]}"
done
Using awk and maintain the order:
awk '!($1 in a){a[$1]=$2; b[++i]=$1;next} {a[$1]+=$2} END{for (k=1; k<=i; k++) print b[k], a[b[k]]}' file
100 25
101 6
102 12
Python is my choice:
d = {}
for line in f.readlines():
key,value = line.split()
if d[key] == None:
d[key] = 0
d[key] += value
print d
Why would you want a bash script?
I've been trying to pull a field from a row in a file although each row may have plus or minus 2 or 3 fields per row. They aren't always equal in the number of fields per row.
Here is a snippet:
A orarpp 45286124 1 1 0 20 60 Nov 25 9-16:42:32 01:04:58 11176 117056 0 - oracleXXX (LOCAL=NO)
A orarpp 45351560 1 1 3 20 61 Nov 30 5-03:54:42 02:24:48 4804 110684 0 - ora_w002_XXX
A orarpp 45548236 1 1 22 20 71 Nov 26 8-19:36:28 00:56:18 10628 116508 0 - oracleXXX (LOCAL=NO)
A orarpp 45679190 1 1 0 20 60 Nov 28 6-23:42:20 00:37:59 10232 116112 0 - oracleXXX (LOCAL=NO)
A orarpp 45744808 1 1 0 20 60 10:52:19 23:08:12 00:04:58 11740 117620 0 - oracleXXX (LOCAL=NO)
A root 45810380 1 1 0 -- 39 Nov 25 9-19:54:34 00:00:00 448 448 0 - garbage
In the case of the first line, I'm interested in 9-16:42:32 and the similar fields for each row.
I've tried to pull it by using ':' as the field separator and then filter from there however, what I am trying to accomplish is to do something if the number before the dash (in the example it's 9) is greater than one.
cat file.txt | grep oracle | awk -F: '{print substr($1, length($1)-5)}'
This is because the number of fields on either side of the actual field I need can be different from line to line.
Definitely not the most efficient but I've been trying to do this with an awk one liner.
Hints or a direction would be appreciated to get me moving again. I am not opposed to doing in a better way than awk.
Thanks.
Maybe cut is the right tool for this job? For example, with your snippet:
$ cut -c 62-71 file.txt
9-16:42:32
5-03:54:42
8-19:36:28
6-23:42:20
23:08:12
9-19:54:34
The arguments tell cut to snip columns (-c) 62 through 71.
For additional processing, you can pipe it to awk.
You can also accomplish the whole thing in awk by accepting entire lines and then using substr to extract the columns you want. For example, this awk command produces the same output as the cut command above:
awk '{ print substr($0, 62, 10) }' file.txt
Whether you create a pipeline or do the processing entirely in awk is at least in part a matter of personal taste / style.
Would this do?
awk -F: '/oracle/ {print substr($0,62,10)}' file.txt
9-16:42:32
8-19:36:28
6-23:42:20
23:08:12
This search for oracle and then print 10 characters starting from position 62
You can grab those identifiers with one of
grep -o '[[:digit:]]\+-[[:digit:]]\{2\}:[[:digit:]]\{2\}:[[:digit:]]\{2\}'
grep -oP '\d+-\d\d:\d\d:\d\d' # GNU grep
It sounds like you want to do something with the lines, not just find the ids. Please elaborate.
Using GNU awk:
gawk --re-interval '
/oracle/ && \
match($0, /([[:digit:]]+)-([[:digit:]]{2}:){2}[[:digit:]]{2}/, a) && \
a[1]>1 {
# do something with the matching line
print
}
' file
Q1: Sed specify the whole line and if the line is nothing but the string then delete
I have a file that contains several of the following numbers:
1 1
3 1
12 1
1 12
25 24
23 24
I want to delete numbers that are the same in each line. For that I have either been using:
sed '/1 1/d' < old.file > new.file
OR
sed -n '/1 1/!p' < old.file > new.file
Here is the main problem. If I search for pattern '1 1' that means I get rid of '1 12' as well. So for I want the pattern to specify the whole line and if it does, to delete it.
Q2: Automation of question 1
I am also trying to automate this problem. The range of numbers in the first column and the second column could be from 1 to 25.
So far this is what I got:
for ((i=1;i<26;i++)); do
sed "/'$i' '$i'/d" < oldfile > newfile; mv newfile oldfile;
done
This does nothing to the oldfile in the end. :(
This would be more readable with awk:
awk '$1 == $2 {next} {print}' oldfile > newfile
Update based on comment:
If the requirement is to remove lines where the two values are within 1 of each other:
awk '{d = $1-$2; if (-1 <= d && d <= 1) next; else print}' oldfile
Unfortunately, awk does not have abs() (at least nawk and gawk don't)
Just put the first number in a group (\([0-9]*\)) and then look for it with a backreference (\1). Since the line to delete should contain only the group, repeated, use the ^ to mark the beginning of line and the $ to mark the end of line. For example, for the following file:
$ cat input
1 1
3 1
12 1
1 12
12 12
12 13
13 13
25 24
23 24
...the result is:
$ sed '/^\([0-9]*\) \1$/d' input
3 1
12 1
1 12
12 13
25 24
23 24
You can also do it with grep:
grep -E -v "([0-9])*\s\1" testfile
Look for multiple digits in a row and remember them, followed by a single whitespace, followed by whatever digits you remembered.