Print specific lines that have two or more occurrences of a particular character - sed

I have file with some text lines. I need to print lines 3-7 and 11 if it has two "b". I did
sed -n '/b\{2,\}/p' file but it printed lines where "b" occurs two times in a row

You can use
sed -n '3,7{/b[^b]*b/p};11{/b[^b]*b/p}' file
## that is equal to
sed -n '3,7{/b[^b]*b/p};11{//p}' file
Note that b[^b]*b matches b, then any zero or more chars other than b and then a b. The //p in the second part matches the most recent pattern , i.e. it matches the same b[^b]*b regex.
Note you might also use b.*b regex if you want, but the bracket expressions tend to word faster.
See an online demo, tested with sed (GNU sed) 4.7:
s='11bb1
b222b
b n b
ww
ee
bb
rrr
fff
999
10
11 b nnnn bb
www12'
sed -ne '3,7{/b[^b]*b/p};11{/b[^b]*b/p}' <<< "$s"
Output:
b n b
bb
11 b nnnn bb
Only lines 3, 6 and 11 are returned.

Just use awk for simplicity, clarity, portability, maintainability, etc. Using any awk in any shell on every Unix box:
awk '( (3<=NR && NR<=7) || (NR==11) ) && ( gsub(/b/,"&") >= 2 )' file
Notice how if you need to change a range, add a range, add other line numbers, change how many bs there are, add other chars and/or strings to match, add some completely different condition, etc. it's all absolutely clear and trivial.
For example, want to print the line if there's exactly either 13 or 27 bs instead of 2 or more:?
awk '( (3<=NR && NR<=7) || (NR==11) ) && ( gsub(/b/,"&") ~ /^(13|27)$/ )' file
Want to print the line if the line number is between 23 and 59 but isn't 34?
awk '( 23<=NR && NR<=59 && NR!=34 ) && ( gsub(/b/,"&") >= 2 )' file
Try making similar changes to a sed script. I'm not saying you can't force it to happen, but it's not nearly as trivial, clear, portable, etc. as it is using awk.

Related

xargs and sed to extract specific lines

I want to extract lines that have a particular pattern, in a certain column. For example, in my 'input.txt' file, I have many columns. I want to search the 25th column for 'foobar', and extract only those lines that have 'foobar' in the 25th column. I cannot do:
grep foobar input.txt
because other columns may also have 'foobar', and I don't want those lines. Also:
the 25th column will have 'foobar' as part of a string (i.e. it could be 'foobar ; muller' or 'max ; foobar ; john', or 'tom ; foobar35')
I would NOT want 'tom ; foobar35'
The word in column 25 must be an exact match for 'foobar' (and ; so using awk $25=='foobar' is not an option.
In other words, if column 25 had the following lines:
foobar ; muller
max ; foobar ; john
tom ; foobar35
I would want only lines 1 & 2.
How do I use xargs and sed to extract these lines? I am stuck at:
cut -f25 input.txt | grep -nw foobar | xargs -I linenumbers sed ???
thanks!
Do not use xargs and sed, use the other tool common on so many machines and do this:
awk '{if($25=="foobar"){print NR" "$0}}' input.txt
print NR prints the line number of the current match so the first column of the output will be the line number.
print $0 prints the current line. Change it to print $25 if you only want the matching column. If you only want the output, use this:
awk '{if($25=="foobar"){print $0}}' input.txt
EDIT1 to match extended question:
Use what #shellter and #Jotne suggested but add string delimiters.
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' '$25~/foobar/' input.txt
[^ ]* matches all characters that are not a space.
'[^']*' matches everything inside single quotes.
EDIT2 to exclude everything but foobar:
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$25~/[;' ]foobar[;' ]/" input.txt
[;' ] only allows ;, ' and in front and after foobar.
Tested with this file:
1 "1 ; 1" 4
2 'kom foobar' 33
3 "ll;3" 3
4 '1; foobar' asd
7 '5 ;foobar' 2
7 '5;foobar' 0
2 'kom foobar35' 33
2 'kom ; foobar' 33
2 'foobar ; john' 33
2 'foobar;paul' 33
2 'foobar1;paul' 33
2 'foobarli;paul' 33
2 'afoobar;paul' 33
and this command awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$2~/[;' ]foobar[;' ]/" input.txt
To get the line with foobar as part of the 25 field.
awk '$25=="foobar"' input.txt
$25 25th filed
== equal to
"foobar"
Since no action spesified, print the complete line will be done, same as {print $0}
Or
awk '$25~/^foobar$/' input.txt
This might work for you (GNU sed):
sed -En 's/\S+/\n&\n/25;s/\n(.*foobar.*)\n/\1/p' file
Surround the 25th field by newlines and pattern match for foobar between newlines.
If you only want to match the word foobar use:
sed -En 's/\S+/\n&\n/25;s/\n(.*\<foobar\>.*)\n/\1/p' file

join previous line with next depending of pattern of previous line

I have this Input:
1 a
a
2 b b
3 c
c
4 d d
5 e e
6 f
f
7 g
g
I want this output using sed command
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
I'm trying this without success
sed '/^[^0-9]/ x; N; { s/\n/ / }; n' file
Another in awk:
$ awk 'BEGIN{RS=""}{for(i=1;i<=NF;i+=3)print $i,$(i+1),$(i+2)}' file
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
Explained:
$ awk 'BEGIN {
RS="" # prime awk to read in a paragraph of data
}
{
for(i=1;i<=NF;i+=3) # jump forward 3 fields at a time
print $i,$(i+1),$(i+2) # print 3 fields
}' file
awk 'NR>1 && /^[0-9]/ {print substr(s,2); s=""} {s=s FS $0} END {print substr(s,2)}' file
NR>1 && /^[0-9]/: If a line is not the first and begins with a digit,
{print substr(s,2); s=""}: print "s" without the leading space, then clear it.
{s=s FS $0}: On every line, append the current line to the value of "s". FS is a space by default.
edit: Added END condition to catch last line, hated it, made a better separate answer.
Made it simpler with awk:
awk 'NF==2 {printf("%s ", $0); next} 1' file
Basically, "Don't print a newline if there are only exactly two fields."
This might work for you (GNU sed):
sed '/^[0-9]/{:a;N;s/\n\([^0-9]\)/ \1/;ta;P;D}' file
If the current line begins with an integer, append the following line. If that line does not begin with an integer, replace the newline by a space and repeat. Otherwise print/delete the first line in the pattern space and repeat.

sed — joining a range of selected lines

I'm a beginner to sed. I know that it's possible to apply a command (or a set of commands) to a certain range of lines like so
sed '/[begin]/,/[end]/ [some command]'
where [begin] is a regular expression that designates the beginning line of the range and [end] is a regular expression that designates the ending line of the range (but is included in the range).
I'm trying to use this to specify a range of lines in a file and join them all into one line. Here's my best try, which didn't work:
sed '/[begin]/,/[end]/ {
N
s/\n//
}
'
I'm able to select the set of lines I want without any problem, but I just can't seem to merge them all into one line. If anyone could point me in the right direction, I would be really grateful.
One way using GNU sed:
sed -n '/begin/,/end/ { H;g; s/^\n//; /end/s/\n/ /gp }' file.txt
This is straight forward if you want to select some lines and join them. Use Steve's answer or my pipe-to-tr alternative:
sed -n '/begin/,/end/p' | tr -d '\n'
It becomes a bit trickier if you want to keep the other lines as well. Here is how I would do it (with GNU sed):
join.sed
/\[begin\]/ {
:a
/\[end\]/! { N; ba }
s/\n/ /g
}
So the logic here is:
When [begin] line is encountered start collecting lines into pattern space with a loop.
When [end] is found stop collecting and join the lines.
Example:
seq 9 | sed -e '3s/^/[begin]\n/' -e '6s/$/\n[end]/' | sed -f join.sed
Output:
1
2
[begin] 3 4 5 6 [end]
7
8
9
I like your question. I also like Sed. Regrettably, I do not know how to answer your question in Sed; so, like you, I am watching here for the answer.
Since no Sed answer has yet appeared here, here is how to do it in Perl:
perl -wne 'my $flag = 0; while (<>) { chomp; if (/[begin]/) {$flag = 1;} print if $flag; if (/[end]/) {print "\n" if $flag; $flag = 0;} } print "\n" if $flag;'

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile

sed, awk or perl: Pattern range match, print 45 lines then add record delimiter

I have a file containing records delimited by the pattern /#matchee/. These records are of varying lengths ...say 45 - 75 lines. They need to ALL be 45 lines and still maintain the record delimiter. Records can be from different departments, department name is on line 2 following a blank line. So record delimiter could be thought of as simply /^#matchee/ or /^matchee/ followed by \n. There is a Deluxe edition of this problem and a Walmart edition ...
DELUXE EDITION
Pull each record by pattern range so I can sort records by department. Eg., with sed
sed -n '/^DEPARTMENT NAME/,/^#matchee/{p;}' mess-o-records.txt
Then, Print only the first 45 lines of each record in the file to conform to
the 45 line constraint.
Finally, make sure the result still has the record delimiter on line 45.
WALMART EDITION
Same as above, but instead of using a range, just use the record delimiter.
STATUS
My attempt at this might clarify what I'm trying to do.
sed -n -e '/^DEPARTMENT-A/,/^#matchee/{p;}' -e '45q' -e '$s/.*/#matchee/' mess-o-records.txt
This doesn't work, of course, because sed is operating on the entire file at each command.
I need it to operate on each range match not the whole file.
SAMPLE INPUT - 80 Lines ( truncated for space )
<blank line>
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<too many lines here>
<way too many lines here>
#matchee
SAMPLE OUTPUT - now only 45 lines
<blank line>
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<Record now equals exactly 45 lines>
<yet record delimiter is maintained>
#matchee
CLARIFICATION UPDATE
I will never need more than the first 40 lines if this makes things easier. Maybe the process would be:
Match pattern(s)
Print first 40 lines.
Pad to appropriate length. Eg., 45 lines.
Tack delimiter back on. Eg., #matchee
I think this would be more flexible -- Ie., can handle record shorter than 45 lines.
Here's a riff based on #Borodin's Perl example below:
my $count = 0;
$/ = "#matchee";
while (<>) {
if (/^REDUNDANCY.*DEPT/) {
print;
$count = 0;
}
else {
print if $count++ < 40;
print "\r\n" x 5;
print "#matchee\r\n";
}
}
This add 5 newlines to each record + the delimiting pattern /#matchee/. So it's wrong -- but it illustrates what I want.
Print 40 lines based on department -- pad -- tack delimiter back on.
I think I understand what you want. Not sure about the bit about pull each record by pattern range. Is #matchee always followed by a blank line and then the department line? So in fact record number 2?
This Perl fragment does what I understand you need.
If you prefer you can put the input file on the command line and drop the open call. Then the loop would have to be while (<>) { ... }.
Let us know if this is right so far, and what more you need from it.
use strict;
use warnings;
open my $fh, '<', 'mess-o-records.txt' or die $!;
my $count = 0;
while (<$fh>) {
if (/^#matchee/) {
print;
$count = 0;
}
else {
print if $count++ < 45;
}
}
I know this has already had an accepted answer, but I figured I'd post an awk example for anyone interested. It's not 100%, but it gets the job done.
Note This numbers the lines so you can verify the script is working as expected. Remove the i, from print i, current[i] to remove the line numbers.
dep.awk
BEGIN { RS = "#matchee\n\n" }
$0 ~ /[a-zA-Z0-9]+/ {
split($0, current, "\n")
for (i = 1; i <= 45; i++) {
print i, current[i];
}
print "#matchee\n"
}
In this example, you begin the script by setting the record separator (RS) to "#matchee\n\n". There are two newlines because the first ends the line on which #matchee occurs and the second is the blank line on its own.
The match validates that a record contains letters or numbers to be valid. You could also check that the match starts with 'DEPARTMENT-', but this would fail if there is a stray newline. Checking the content is the safest route. Because this uses a block record (i.e., DEPARTMENT-A through #matchee), you could either pass $0 through awk or sed again, or use the awk split function and loop through 45 lines. In awk, the arrays aren't zero-indexed.
The print function includes a newline, so the block ends with print "#matchee\n" only instead of the double \n in the record separator variable.
You could also drop the same awk script into a bash script and change the number of lines and field separator. Of course, you should add validations and whatnot, but here's the start:
dep.sh
#!/bin/bash
# prints the first n lines within every block of text delimited by splitter
splitter=$1
numlines=$2
awk 'BEGIN { RS="'$1'\n\n" }
$0 ~ /[a-zA-Z0-9]+/ {
split($0, current, "\n")
for(i=1;i<='$numlines';i++) {
print i, current[i]
}
print "'$splitter'", "\n"
}' $3
Make the script executable and run it.
./dep.sh '#matchee' 45 input.txt > output.txt
I added these files to a gist so you could also verify the output
This might work for you:
D="DEPARTMENT-A" M="#matchee"
sed '/'"$D/,/$M"'/{/'"$D"'/{h;d};H;/'"$M"'/{x;:a;s/\n/&'"$M"'/45;tb;s/'"$M"'/\n&/;ta;:b;s/\('"$M"'\).*/\1/;p};d}' file
Explanation:
Focus on range of lines /DEPARTMENT/,/#matchee/
At start of range move pattern space (PS) to hold space (HS) and delete PS /DEPARTMENT/{h;d}
All subsequent lines in the range append to HS and delete H....;d
At end of range:/#matchee/
Swap to HS x
Test for 45 lines in range and if successful append #matchee at the 45th line s/\n/&#matchee/45
If previous substitution was successful branch to label b. tb
If previous substitution was unsuccessful insert a linefeed before #matchee s/'"$M"'/\n&/ thus lengthening a short record to 45 lines.
Branch to label a and test for 45 lines etc . ta
Replace the first occurrence of #matchee to the end of the line by it's self. s/\('"$M"'\).*/\1/ thus shortening a long record to 45 lines.
Print the range of records. p
All non-range records pass through untouched.
TXR Solution ( http://www.nongnu.org/txr )
For illustration purposes using the fake data, I shorten the requirement from 40 lines to 12 lines. We find records beginning with a department name, delimited by #matchee. We dump them, chopped to no more than 12 lines, with #matchee added again.
#(collect)
# (all)
#dept
# (and)
# (collect)
#line
# (until)
#matchee
# (end)
# (end)
#(end)
#(output)
# (repeat)
#{line[0..12] "\n"}
#matchee
# (end)
#(end)
Here, the dept variable is expected to come from a -D command line option, but of course the code can be changed to accept it as an argument and put out a usage if it is missing.
Run on the sample data:
$ txr -Ddept=DEPARTMENT-A trim-extract.txr mess-o-records.txt
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<too many lines here>
#matchee
The blank lines before DEPARTMENT-A are gone, and there are exactly 12 lines, which happen to include one line of the <too many ...> junk.
Note that the semantics of #(until) is such that the #matchee is excluded from the collected material. So it is correct to unconditionally add it in the #(output) clause. This program will work even if a record happens to be shorter than 12 lines before #matchee is found.
It will not match a record if #matchee is not found.