Count the number of occurrences of a string using sed? - sed

I have a file which contains "title" written in it many times. How can I find the number of times "title" is written in that file using the sed command provided that "title" is the first string in a line? e.g.
# title
title
title
should output the count = 2 because in first line title is not the first string.
Update
I used awk to find the total number of occurrences as:
awk '$1 ~ /title/ {++c} END {print c}' FS=: myFile.txt
But how can I tell awk to count only those lines having title the first string as explained in example above?

Never say never. Pure sed (although it may require the GNU version).
#!/bin/sed -nf
# based on a script from the sed info file (info sed)
# section 4.8 Numbering Non-blank Lines (cat -b)
# modified to count lines that begin with "title"
/^title/! be
x
/^$/ s/^.*$/0/
/^9*$/ s/^/0/
s/.9*$/x&/
h
s/^.*x//
y/0123456789/1234567890/
x
s/x.*$//
G
s/\n//
h
:e
$ {x;p}
Explanation:
#!/bin/sed -nf
# run sed without printing output by default (-n)
# using the following file as the sed script (-f)
/^title/! be # if the current line doesn't begin with "title" branch to label e
x # swap the counter from hold space into pattern space
/^$/ s/^.*$/0/ # if pattern space is empty start the counter at zero
/^9*$/ s/^/0/ # if pattern space starts with a nine, prepend a zero
s/.9*$/x&/ # mark the position of the last digit before a sequence of nines (if any)
h # copy the marked counter to hold space
s/^.*x// # delete everything before the marker
y/0123456789/1234567890/ # increment the digits that were after the mark
x # swap pattern space and hold space
s/x.*$// # delete everything after the marker leaving the leading digits
G # append hold space to pattern space
s/\n// # remove the newline, leaving all the digits concatenated
h # save the counter into hold space
:e # label e
$ {x;p} # if this is the last line of input, swap in the counter and print it
Here are excerpts from a trace of the script using sedsed:
$ echo -e 'title\ntitle\nfoo\ntitle\nbar\ntitle\ntitle\ntitle\ntitle\ntitle\ntitle\ntitle\ntitle' | sedsed-1.0 -d -f ./counter
PATT:title$
HOLD:$
COMM:/^title/ !b e
COMM:x
PATT:$
HOLD:title$
COMM:/^$/ s/^.*$/0/
PATT:0$
HOLD:title$
COMM:/^9*$/ s/^/0/
PATT:0$
HOLD:title$
COMM:s/.9*$/x&/
PATT:x0$
HOLD:title$
COMM:h
PATT:x0$
HOLD:x0$
COMM:s/^.*x//
PATT:0$
HOLD:x0$
COMM:y/0123456789/1234567890/
PATT:1$
HOLD:x0$
COMM:x
PATT:x0$
HOLD:1$
COMM:s/x.*$//
PATT:$
HOLD:1$
COMM:G
PATT:\n1$
HOLD:1$
COMM:s/\n//
PATT:1$
HOLD:1$
COMM:h
PATT:1$
HOLD:1$
COMM::e
COMM:$ {
PATT:1$
HOLD:1$
PATT:title$
HOLD:1$
COMM:/^title/ !b e
COMM:x
PATT:1$
HOLD:title$
COMM:/^$/ s/^.*$/0/
PATT:1$
HOLD:title$
COMM:/^9*$/ s/^/0/
PATT:1$
HOLD:title$
COMM:s/.9*$/x&/
PATT:x1$
HOLD:title$
COMM:h
PATT:x1$
HOLD:x1$
COMM:s/^.*x//
PATT:1$
HOLD:x1$
COMM:y/0123456789/1234567890/
PATT:2$
HOLD:x1$
COMM:x
PATT:x1$
HOLD:2$
COMM:s/x.*$//
PATT:$
HOLD:2$
COMM:G
PATT:\n2$
HOLD:2$
COMM:s/\n//
PATT:2$
HOLD:2$
COMM:h
PATT:2$
HOLD:2$
COMM::e
COMM:$ {
PATT:2$
HOLD:2$
PATT:foo$
HOLD:2$
COMM:/^title/ !b e
COMM:$ {
PATT:foo$
HOLD:2$
. . .
PATT:10$
HOLD:10$
PATT:title$
HOLD:10$
COMM:/^title/ !b e
COMM:x
PATT:10$
HOLD:title$
COMM:/^$/ s/^.*$/0/
PATT:10$
HOLD:title$
COMM:/^9*$/ s/^/0/
PATT:10$
HOLD:title$
COMM:s/.9*$/x&/
PATT:1x0$
HOLD:title$
COMM:h
PATT:1x0$
HOLD:1x0$
COMM:s/^.*x//
PATT:0$
HOLD:1x0$
COMM:y/0123456789/1234567890/
PATT:1$
HOLD:1x0$
COMM:x
PATT:1x0$
HOLD:1$
COMM:s/x.*$//
PATT:1$
HOLD:1$
COMM:G
PATT:1\n1$
HOLD:1$
COMM:s/\n//
PATT:11$
HOLD:1$
COMM:h
PATT:11$
HOLD:11$
COMM::e
COMM:$ {
COMM:x
PATT:11$
HOLD:11$
COMM:p
11
PATT:11$
HOLD:11$
COMM:}
PATT:11$
HOLD:11$
The ellipsis represents lines of output I omitted here. The line with "11" on it by itself is where the final count is output. That's the only output you'd get when the sedsed debugger isn't being used.

Revised answer
Succinctly, you can't - sed is not the correct tool for the job (it cannot count).
sed -n '/^title/p' file | grep -c
This looks for lines starting title and prints them, feeding the output into grep to count them. Or, equivalently:
grep -c '^title' file
Original answer - before the question was edited
Succinctly, you can't - it is not the correct tool for the job.
grep -c title file
sed -n /title/p file | wc -l
The second uses sed as a surrogate for grep and sends the output to 'wc' to count lines. Both count the number of lines containing 'title', rather than the number of occurrences of title.
You could fix that with something like:
cat file |
tr ' ' '\n' |
grep -c title
The 'tr' command converts blanks into newlines, thus putting each space separated word on its own line, and therefore grep only gets to count lines containing the word title. That works unless you have sequences such as 'title-entitlement' where there's no space separating the two occurrences of title.

I don't think sed would be appropriate, unless you use it in a pipeline to convert your file so that the word you need appears on separate lines, and then use grep -c to count the occurrences.
I like Jonathan's idea of using tr to convert spaces to newlines. The beauty of this method is that successive spaces get converted to multiple blank lines but it doesn't matter because grep will be able to count just the lines with the single word 'title'.

just one gawk command will do. Don't use grep -c because it only counts line with "title" in it, regardless of how many "title"s there are in the line.
$ more file
# title
# title
one
two
#title
title title
three
title junk title
title
four
fivetitlesixtitle
last
$ awk '!/^#.*title/{m=gsub("title","");total+=m}END{print "total: "total}' file
total: 7
if you just want "title" as the first string, use "==" instead of ~
awk '$1 == "title"{++c}END{print c}' file

sed 's/title/title\n/g' file | grep -c title

This might work for you:
sed '/^title/!d' file | sed -n '$='

Related

Extract substrings between strings

I have a file with text as follows:
###interest1 moreinterest1### sometext ###interest2###
not-interesting-line
sometext ###interest3###
sometext ###interest4### sometext othertext ###interest5### sometext ###interest6###
I want to extract all strings between ### .
My desired output would be something like this:
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
I have tried the following:
grep '###' file.txt | sed -e 's/.*###\(.*\)###.*/\1/g'
This almost works but only seems to grab the first instance per line, so the first line in my output only grabs
interest1 moreinterest1
rather than
interest1 moreinterest1
interest2
Here is a single awk command to achieve this that makes ### field separator and prints each even numbered field:
awk -F '###' '{for (i=2; i<NF; i+=2) print $i}' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
Here is an alternative grep + sed solution:
grep -oE '###[^#]*###' file | sed -E 's/^###|###$//g'
This assumes there are no # characters in between ### markers.
With GNU awk for multi-char RS:
$ awk -v RS='###' '!(NR%2)' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
You can use pcregrep:
pcregrep -o1 '###(.*?)###' file
The regex - ###(.*?)### - matches ###, then captures into Group 1 any zero o more chars other than line break chars, as few as possible, and ### then matches ###.
o1 option will output Group 1 value only.
See the regex demo online.
sed 't x
s/###/\
/;D; :x
s//\
/;t y
D;:y
P;D' file
Replacing "###" with newline, D, then conditionally branching to P if a second replacement of "###" is successful.
This might work for you (GNU sed):
sed -n 's/###/\n/g;/[^\n]*\n/{s///;P;D}' file
Replace all occurrences of ###'s by newlines.
If a line contains a newline, remove any characters before and including the first newline, print the details up to and including the following newline, delete those details and repeat.

xargs and sed to extract specific lines

I want to extract lines that have a particular pattern, in a certain column. For example, in my 'input.txt' file, I have many columns. I want to search the 25th column for 'foobar', and extract only those lines that have 'foobar' in the 25th column. I cannot do:
grep foobar input.txt
because other columns may also have 'foobar', and I don't want those lines. Also:
the 25th column will have 'foobar' as part of a string (i.e. it could be 'foobar ; muller' or 'max ; foobar ; john', or 'tom ; foobar35')
I would NOT want 'tom ; foobar35'
The word in column 25 must be an exact match for 'foobar' (and ; so using awk $25=='foobar' is not an option.
In other words, if column 25 had the following lines:
foobar ; muller
max ; foobar ; john
tom ; foobar35
I would want only lines 1 & 2.
How do I use xargs and sed to extract these lines? I am stuck at:
cut -f25 input.txt | grep -nw foobar | xargs -I linenumbers sed ???
thanks!
Do not use xargs and sed, use the other tool common on so many machines and do this:
awk '{if($25=="foobar"){print NR" "$0}}' input.txt
print NR prints the line number of the current match so the first column of the output will be the line number.
print $0 prints the current line. Change it to print $25 if you only want the matching column. If you only want the output, use this:
awk '{if($25=="foobar"){print $0}}' input.txt
EDIT1 to match extended question:
Use what #shellter and #Jotne suggested but add string delimiters.
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' '$25~/foobar/' input.txt
[^ ]* matches all characters that are not a space.
'[^']*' matches everything inside single quotes.
EDIT2 to exclude everything but foobar:
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$25~/[;' ]foobar[;' ]/" input.txt
[;' ] only allows ;, ' and in front and after foobar.
Tested with this file:
1 "1 ; 1" 4
2 'kom foobar' 33
3 "ll;3" 3
4 '1; foobar' asd
7 '5 ;foobar' 2
7 '5;foobar' 0
2 'kom foobar35' 33
2 'kom ; foobar' 33
2 'foobar ; john' 33
2 'foobar;paul' 33
2 'foobar1;paul' 33
2 'foobarli;paul' 33
2 'afoobar;paul' 33
and this command awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$2~/[;' ]foobar[;' ]/" input.txt
To get the line with foobar as part of the 25 field.
awk '$25=="foobar"' input.txt
$25 25th filed
== equal to
"foobar"
Since no action spesified, print the complete line will be done, same as {print $0}
Or
awk '$25~/^foobar$/' input.txt
This might work for you (GNU sed):
sed -En 's/\S+/\n&\n/25;s/\n(.*foobar.*)\n/\1/p' file
Surround the 25th field by newlines and pattern match for foobar between newlines.
If you only want to match the word foobar use:
sed -En 's/\S+/\n&\n/25;s/\n(.*\<foobar\>.*)\n/\1/p' file

How to print some free text in addition to SED extract

Well-known SED command to extract a first line and print to another file
sed -n '1 p' /p/raw.txt | cat >> /p/001.txt ;
gives an output in /p/001.txt like
John Doe
But how to modify this command above to add some free text and have, for example, the output like
Name: John Doe
Thanks for any hint to try.
You can do that in a single command (and no sub-shells):
sed 's/^/Name: /;q' /p/raw.txt >> /p/001.txt
This prefixes "Name: " in front of the first line, prints it, then quits so you don't process additional lines. Add a line number before the q to print all lines up to (and including) that number. The output is appended to /p/001.txt just like your original code.
If you want a range of lines:
sed -n '3,9{s/^/Name: /;p}9q' /p/raw.txt >> /p/001.txt
This reads from lines 3-9, performs the substitution, prints, then quits after line 9.
If you want specific lines, I recommend awk:
awk 'NR==3 || NR==9 { print "Name: " $0 } NR>=9 { exit }' /p/raw.txt >> /p/001.txt
This has two clauses. One says the number of record (line number) is either 3 or 9, in which case we print the prefix and the line. The other tells us to stop reading the file after the 9th record.
Here are two more commands to show how awk can act on just the first line(s) or a given range:
awk '{ print "Name: " $0 } NR >= 1 { exit }' /p/raw.txt >> /p/001.txt
awk '3 <= NR { print "Name: " $0 } NR >= 9 { exit }' /p/raw.txt >> /p/001.txt
It appears you're continuously building one file from the other. Consider:
tail -Fn0 /p/raw.txt |sed 's/^/Name: /' >> /p/001.txt
This will run continuously, adding only new entries (added after the command is run) to /p/001.txt
Perhaps you have lots of duplicates to resolve?
awk 'NR != FNR { $0 = "Name: " $0 } !s[$0]++' \
/p/001.txt /p/raw.txt > /tmp/001.txt && mv /tmp/001.txt /p/001.txt
This folds together the previously saved names with any new names, printing names only once (!s[$0]++ is true when s[$0] is zero (its default state), but after the evaluation, it increments to one, making it false on the second occurrence. When a bare clause has no action, the line is printed.) Because we're reading the output file, we need a temporary output. Upon its successful completion, we then move it atop the target output file.
printf "Name : %s\n" "$(sed -n '1p;q' /p/raw.txt)" >/p/001.txt
should do it. If sed is not a requirement do
echo -e "Name : $(sed -n '1p;q' /p/raw.txt)" >/p/001.txt
Note
The q option with the sed quits it without processing any more commands or input.
The -e option tells echo to interpret escape sequences. This is a peculiarity of bash shell.

Remove newline depending on the format of the next line

I have a special file with this kind of format :
title1
_1 texthere
title2
_2 texthere
I would like all newlines starting with "_" to be placed as a second column to the line before
I tried to do that using sed with this command :
sed 's/_\n/ /g' filename
but it is not giving me what I want to do (doing nothing basically)
Can anyone point me to the right way of doing it ?
Thanks
Try following solution:
In sed the loop is done creating a label (:a), and while not match last line ($!) append next one (N) and return to label a:
:a
$! {
N
b a
}
After this we have the whole file into memory, so do a global substitution for each _ preceded by a newline:
s/\n_/ _/g
p
All together is:
sed -ne ':a ; $! { N ; ba }; s/\n_/ _/g ; p' infile
That yields:
title1 _1 texthere
title2 _2 texthere
If your whole file is like your sample (pairs of lines), then the simplest answer is
paste - - < file
Otherwise
awk '
NR > 1 && /^_/ {printf "%s", OFS}
NR > 1 && !/^_/ {print ""}
{printf "%s", $0}
END {print ""}
' file
This might work for you (GNU sed):
sed ':a;N;s/\n_/ /;ta;P;D' file
This avoids slurping the file into memory.
or:
sed -e ':a' -e 'N' -e 's/\n_/ /' -e 'ta' -e 'P' -e 'D' file
A Perl approach:
perl -00pe 's/\n_/ /g' file
Here, the -00 causes perl to read the file in paragraph mode where a "line" is defined by two consecutive newlines. In your example, it will read the entire file into memory and therefore, a simple global substitution of \n_ with a space will work.
That is not very efficient for very large files though. If your data is too large to fit in memory, use this:
perl -ne 'chomp;
s/^_// ? print "$l " : print "$l\n" if $. > 1;
$l=$_;
END{print "$l\n"}' file
Here, the file is read line by line (-n) and the trailing newline removed from all lines (chomp). At the end of each iteration, the current line is saved as $l ($l=$_). At each line, if the substitution is successful and a _ was removed from the beginning of the line (s/^_//), then the previous line is printed with a space in place of a newline print "$l ". If the substitution failed, the previous line is printed with a newline. The END{} block just prints the final line of the file.

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile