sed, xargs and stdbuf - how to get only first n matches of a pattern from a file - sed

I have a file with patterns (1 line=1 pattern) I want to look for on a big text file - only one (or none) pattern will be found in each line of the infile. Once found a match, I want to retrieve the characters immediately before the match. The first part is to acquire the patterns for sed
cat patterns.txt | xargs -I '{}' sed -n 's/{}.*$//p' bigtext.txt
That works ok - the downside being that potentially I'll have hundreds of thousands of matches. I don't want/need all the matches - a fair representation of 1K hits would be enough. And here is where I struggle: I've read that in order to limit the number of hits of sed, I should use stdbuf (gstdbuf in my case) and pipe the whole thing through head. But I am not sure where to place the stdbuf command:
cat patterns.txt | xargs -I '{}' gstdbuf -oL -eL sed -n 's/{}.*$//p' bigtext.txt | head -n100
When I tried this, the process takes as long as if it was running sed on the whole file and then getting the head of that output, while my wish is to stop searching after 100 or 1000 matches. Any ideas on the best way of accomplishing this?

Is the oneliner you have provided really what you wanted? Esp. since you mention a fair sample. Because as it is stands right now, it feeds patterns.txt into xargs... which will go ahead and invoke sed for each pattern individually, one after another. And the whole output of xargs is fed to head which chops it of after n lines. In other words, your first pattern can already exhaust all the lines you wanted to see, even though the other patterns could have matched any number of times on lines occurring before the matches presented to you. Detailed example between horizontal rulers.
If I have patterns.txt of:
_Pat1
_Pat2
_Pat3
And bigtext.txt with:
1matchx_Pat1x
2matchx_Pat2x
2matchy_Pat2y
2matchz_Pat2z
3matchx_Pat3x
3matchy_Pat3y
3matchz_Pat3z
1matchy_Pat1y
1matchz_Pat1z
And I run your oneliner limited to five hits, I do not get result of (first five matches for all three patterns as found in the file):
1matchx
2matchx
2matchy
2matchz
3matchx
But (all (3) patches for _Pat1 plus 2 matches for _Pat2 after which I've ran out of output lines):
1matchx
1matchy
1matchz
2matchx
2matchy
Now to your performance problem which is partially related. I have to admit that I could not reproduce it. I've taken your example from the comment, blew the "big" file up to a 1GB in size by repeating the pattern and ran your oneliner:
$ time { cat patterns.txt | xargs -I '{}' stdbuf -oL sed -n 's/{}.*$//p' bigtext.txt | head -5 ; }
1aaa
2aaabbb
3aaaccc
1aaa
2aaabbb
xargs: stdbuf: terminated by signal 13
real 0m0.012s
user 0m0.013s
sys 0m0.008s
Note I've dropped the -eL, stderr is usually unbuffered (which is what you usually want) and doesn't play any role here really. Also note I've ran stdbuf without the "g" prefix, which tells me you're probably on a system where GNU tools are not the default... and probably the reasons why you get different behavior. I'll try to explain what is going on and venture few guesses... and conclude with a suggestion. Also note, I really did not need to use stdbuf (manipulate buffering) at all or rather it had no appreciable impact on the result, but again, this could be platform and tools (as well as scenario) specific.
When you read the line from its end, head reads standard input as it is being piped in from xargs (and by extension the sed (or stdbuf wrapping) runs which xargs forks, they are both attached to its writing end) until limit of lines to print has been reached and then head terminates. Doing so "breaks" the pipe and xargs and sed (or stdbuf which it was wrapped in) receive SIGPIPE signal and by default they as well terminate (that you can see in the output of my run: xargs: stdbuf: terminated by signal 13).
What the stdbuf -oL does and why someone might have suggested it. When no longer using console for reading/writing, which would usually be line buffered, and using pipes we would usually see buffered I/O instead. stdbuf -oL changes that back to line buffered. Without it, the process involved would communicate in larger chunk and it could take head longer to realize, it is done and needs no further input, while sed keeps running to see if there are any further suitable matches. As mentioned, on my systems (4K buffer) and with that (repeating pattern) example, this made no real difference. Also note, while it decreases the risk of not knowing we could be done, line buffering does increase overhead involved in communication between the processes.
So why would these mechanics not yield the same expected results for you? Couple options come to mind:
since you fork and run sed once per pattern, whole file each time. It could happen you get series of several runs without any hits. I'd guess this is actually likely the case.
since you give sed file to read from, you may have different implementation of sed that tries to read a lot more in before taking action on the file content (mine reads 4K at a time). Not a likely cause, but in theory you could also feed sed line by line to force smaller chunks and getting that SIGPIPE sooner.
Now assuming that sequential pattern by pattern matching is actually not desired, summary of all of above would be: process your patterns first into a single one and then perform a single pass over the "big" file (optionally capping the output of course). It might be worth switching from shell mostly to something a bit more comfortable to use, or at least not to keep the oneliner format which is likely to turn confusing.
Not true to my own recommendation, awk script called like this prints first 5 hits and quits:
awk -v patts="$(cat patterns.txt)" -v last=5 'BEGIN{patts="(" patts ; gsub(/\n/, "|", patts) ; sub(/.$/, ")", patts); cnt=1 ;} $0~patts{sub(patts ".*", ""); print; cnt++;} cnt>last{exit;}' bigtext.txt

You can specify a file that has patterns to match to the grep command with a -f file. You can also specify the number of matches to find before quiting -m count
So this command will get you the first 5 lines that match:
grep -f patterns.txt -m 5 bigtext.txt
Now to trim the match to the end of the line, is a bit more difficult.
Assuming you use bash, we can build a regex from the file, like this:
while IFS='' read -r line || [[ -n "$line" ]]; do
subRegex="s/$line.*//;"${subRegex}
done < patterns.txt
Then use this in a sed command. The resulting code becomes:
while IFS='' read -r line || [[ -n "$line" ]]; do
subRegex="s/$line.*//;"${subRegex}
done < patterns.txt
grep -f patterns.txt -m 5 bigtext.txt | sed "$subRegex"
The sed command is only running on the lines that have already matched from the grep, so it should be fairly performant.
Now if you call this a lot you could put it in a function
function findMatches() {
local matchCount=${1:-5} # default to 5 matches
local subRegex
while IFS='' read -r line || [[ -n "$line" ]]; do
subRegex="s/$line.*//;"${subRegex}
done < patterns.txt
grep -f patterns.txt -m ${matchCount} bigtext.txt | sed "${subRegex}"
}
Then you can call it like this
findMatches 5
findMatches 100
Update
Based on the sample files you gave, this solution does produce the expected result 1aaa 2aaabbb 3aaaccc 4aaa 5aaa
However, given your comment on the length of each pattern being 120 characters, and each line of the bigfile being 250 characters, 10 GB file size.
You didn't mention how many patterns you might have. So I tested and it seems that the sed command done inline falls apart someplace before 50 patterns.
(Of course, if your samples are really how the data look, then you could do your trimming of each line to be based bases on non-AGCT and not based on the patterns file. Which would be much quicker)
But based on the original question. You can generate a sed script in a separate file based on patterns.txt. Like this:
sed -e "s/^/s\//g;s/$/.*\$\/\/g/g;" patterns.txt > temp.sed
then use this temp file on the sed command.
grep -f patterns.txt -m 5 bigtext.txt | sed -f temp.sed
The grep stops after finding X matches, and the sed trims those... The new function runs on my machine in a couple seconds.
For testing I created a 2GB file of 250 character AGCT combos. And another file with 50+ patterns, 120 characters each with a few of these patterns taken from random lines of the bigtext file.
function findMatches() {
sed -e "s/^/s\//g;s/$/.*\$\/\/g/g;" patterns.txt > temp.sed
grep -f patterns.txt -m ${1:-5} bigtext.txt | sed -f temp.sed
}

Related

Sed inside a while read loop

I have been reading a lot of questions and answers about using sed within a while loop. I think I have the command down correctly, but I seem to get no output once I put all of the pieces together. Can someone tell me what I am missing?
I have an input file with 700 variables, one on each line. I need to use each of these 700 variables within a sed command. I run the following command to verify variables are outputting correctly:
cat Input_File.txt | while read var; do echo $var; done
I then try to add in the sed command as follows:
cat Input_File.txt | while read var; do sed -n "/$var/,+10p" Multi-BLAST_5814.txt >> Multi_BLAST_Subset; done
This command leaves me without an error, but a blinking cursor as if this is an infinite loop. It should use each of the 700 variables, find the corresponding line in Multi_BLAST_5814.txt and output the search variable line and the 10 lines after the search term into a new file, appending each as it goes. I can execute the sed command alone with a manually set single value variable successfully and I can execute the while loop successfully using the input file. Anyone have a thought as to why this is not working?
User, that is exactly what I have done to this point.
I have a large text file (128 MB) with BLAST output. I need to search through this for a subset of results for 769 samples (Out of the 5814 samples that are in the file).
I have created a .txt file with those 769 sample names.
To test grep and sed, I manually assigned a variable with one of the 769 samples names I need to search and can get the results I need as follows:
$ Otu="S41_Folmer_Otu96;size=12;"
$ grep $Otu -A 10 Multi_BLAST_5814.txt
OR
$ sed -n "/$Otu/,+10p" Multi_BLAST_5814.txt
The OUTPUT is exactly what I want as follows:
Query= S41_Folmer_Otu96;size=12;
Length=101
Sequences producing significant alignments: Score(Bits) E Value
gi|58397553|gb|AY830431.1| Scopelocheirus schellenbergi clone... 180 1E-41
gi|306447543|gb|HQ018876.1| Liposcelis paeta isolate CZ cytoc... 174 6E-40
gi|306447533|gb|HQ018871.1| Liposcelis decolor isolate CQ cyt... 104 9E-19
gi|1043259532|gb|KX130860.1| Batocera rufomaculata isolate Br... 99 4E-17
gi|987210821|gb|KR141076.1| Psocoptera sp. BOLD:ACO1391 vouch... 81 1E-11
To Test to make sure the input file contains the correct variables I run the following:
$ Cat Input_File.txt
$ while read Otu; do echo $Otu; done <Input_File.txt
S41_Folmer_Otu96;size=12;
S78_Folmer_Otu15;size=538;
S73_Leray_Otu52;size=6;
S66_Leray_Otu93;size=6;
S10_Folmer_Otu10;size=1612;
... All 769 variables
Again, this is exactly what I expect and is correct.
But, When I do either of the following commands, nothing is printed to the screen (if I leave off the write file/append action) or to the file I need to create.
$ cat Input_File.txt | while read Otu; do grep "$Otu" -A 10 Multi_BLAST_5814.txt >> Multi_BLAST_Subset.txt; done
$ cat Input_File.txt | while read Otu; do sed -n "/$Otu/,+10p" Multi_BLAST_5814.txt >> Multi_BLAST_Subset.txt; done
Sed hangs and never closes, leaving me at a blinking cursor. Grep finishes but also gives no output. I am at a loss as to why this is not working. Everything works inidividually, so I may be left with manually searching all 769 samples copy/paste.
If you have access to GNU grep no need for a sed command, grep "$var" -A 10 will do the same thing and won't break if $var contains the delimiter used in your sed command.
From man grep :
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
Not sure whether you have already attempted it but try breaking the problem into smaller chunks. Simple example below :
$ cat Input_File.txt
one
two
three
$
$ cat file.txt
This is line one
This is line two
This is line three
This is another four
This is another five
This is another six
This is another seven
$
$ cat Input_File.txt | while read var ; do echo $var ; sed -n "/$var/,+1p" file.txt ; done
one
This is line one
This is line two
two
This is line two
This is line three
three
This is line three
This is another four
$

Echo or preview text changed using sed -i

Using sed you can easily change text in multiple files, eg:
sed -i 's/cashtestUS/cheque_usd/g' *.xml
The problem is that this has tremendous power, and a complex regular expression could easily have unforeseen consequences.
Is there a simple way to do either:
1) Echo the changes made
2) Run sed in a preview mode, so that the potential changes can be previewed
Run in preview mode without the -i:
sed -e 's/cashtestUS/cheque_usd/g' *.xml
(The -e is not necessary; it just says the next argument is the sed script, or one part of the sed script.) This writes all the output to standard output. You'd probably pipe it through less (or more), or pass it through grep to see that the changed lines were those you expected. Or you might process one file at a time and run a difference:
for file in *.xml
do
echo "$file"
sed -e 's/cashtestUS/cheque_usd/g' "$file" | diff -u "$file" -
done
Or …
sed have several 'debug/display action'
= display the current line number
l display the current working buffer content with a $ at the end of the content
i and a could be used to show a trace like i \
Debug trace here
if holding buffer is not used a h;s/.*/Debug Trace here/;g is usefull and does not appear at end of line treatment like ior a
sample:
echo "line 1
and two" | sed ':a
=;h;s/.*/Before substitution/;g;l
s/..$/-/
=;l
t a'

sed with filename from pipe

In a folder I have many files with several parameters in filenames, e.g (just with one parameter) file_a1.0.txt, file_a1.2.txt etc.
These are generated by a c++ code and I'd need to take the last one (in time) generated. I don't know a priori what will be the value of this parameter when the code is terminated. After that I need to copy the 2nd line of this last file.
To copy the 2nd line of the any file, I know that this sed command works:
sed -n 2p filename
I know also how to find the last generated file:
ls -rtl file_a*.txt | tail -1
Question:
how to combine these two operation? Certainly it is possible to pipe the 2nd operation to that sed operation but I dont know how to include filename from pipe as input to that sed command.
You can use this,
ls -rt1 file_a*.txt | tail -1 | xargs sed -n '2p'
(OR)
sed -n '2p' `ls -rt1 file_a*.txt | tail -1`
sed -n '2p' $(ls -rt1 file_a*.txt | tail -1)
Typically you can put a command in back ticks to put its output at a particular point in another command - so
sed -n 2p `ls -rt name*.txt | tail -1 `
Alternatively - and preferred, because it is easier to nest etc -
sed -n 2p $(ls -rt name*.txt | tail -1)
-r in ls is reverse order.
-r, --reverse
reverse order while sorting
But it is not good idea when used it with tail -1.
With below change (head -1 without r option in ls), performance will be better, that you needn't wait to list all files then pipe to tail command
sed -n 2p $(ls -t1 name*.txt | head -1 )
I was looking for a similar solution: taking the file names from a pipe of grep results to feed to sed. I've copied my answer here for the search & replace, but perhaps this example can help as it calls sed for each of the names found in the pipe:
this command to simply find all the files:
grep -i -l -r foo ./*
this one to exclude this_shell.sh (in case you put the command in a script called this_shell.sh), tee the output to the console to see what happened, and then use sed on each file name found to replace the text foo with bar:
grep -i -l -r --exclude "this_shell.sh" foo ./* | tee /dev/fd/2 | while read -r x; do sed -b -i 's/foo/bar/gi' "$x"; done
I chose this method, as I didn't like having all the timestamps changed for files not modified. Feeding the grep result allows only the files with target text to be looked at (thus likely may improve performance / speed as well)
be sure to backup your files & test before using. May not work in some environments for files with embedded spaces. (?)
fwiw - I had some problems using the tail method, it seems that the entire dataset was generated before calling tail on just the last item.

Awk inside of qsub

I have a bash script in which I have a few qsubs. Each of them are waiting for a preivous qsub to be done before starting.
My first qsub consist of sending files in a certain directory to a perl program and having the outfiles printed in a new directory. At the end, I echo the array with all my jobs names. This script works as intented.
mkdir -p /perl_files_dir
for ID_FILES in `ls Infiles_dir/*.txt`;
do
JOB_ID=`echo "perl perl_scirpt.pl $ID_FILES" | qsub -j oe `
JOB_ID_ARRAY="${JOB_ID_ARRAY}:$JOB_ID"
done
echo $JOB_ID_ARRAY
My second qsub is meant to sort all my previous files made with my perl script in a new outfile and to start after all these jobs are done (about 100 jobs) with depend=afterany. Again, this part is working fine.
SORT_JOB=`echo "sort -m -n perl_files_dir/*.txt >>sorted_file.txt" | qsub -j oe -W depend=afterany$JOB_ID_ARRAY`
SORT_ARRAY="${SORT_ARRAY}:$SORT_JOB"
My issue is that in my sorted file, I have a few columns I wish to remove (2 to 6), so I came up with this last line using awk piped to sed with another depend=afterany
SED=`echo "awk '{\$2="";\$3="";\$4="";\$5="";\$6=""; print \$0}' sorted_file.txt \
| sed 's/ //g' >final_file.txt" | qsub -j oe -W depend=afterany$SORT_ARRAY`
This last step creates final_file.txt, but leaves it empty. I added SED= before my echo because it would otherwise give me Command not found.
I tried without the pipe so it would just print everything. Unfortunately it prints nothing.
I assume it is not opening my sorted file and this is why my final file is empty after my sed. If it's the case, then why won't awk read it?
In my script, I am using variables to define my directories and files (with the correct path). I know my issue is not about find my files or directories since they are perfectly defined at the beginning and used throughout the script. I tried to write the whole path instead of a variable and I get the same results.
for ID_FILES in `ls Infiles_dir/*.txt`
Simplify this to
for ID_FILES in Infiles_dir/*.txt
ls lists the files you pass it (except when you pass it directories, then it lists their content). Rather than telling it to display a list of files and parse the output, use the list of files you already have! This is more reliable (parsing the output of ls will fail if the file names contain whitespace or wildcard characters), clearer and faster. Don't parse the output of ls.
SORT_JOB=`echo "sort -m -n perl_files_dir/*.txt >>sorted_file.txt" | qsub -j oe -W depend=afterany$JOB_ID_ARRAY`
You'd make your life simpler if you used the right form of quoting in the right place. Don't use backquotes, because it's difficult to know how to quote things inside. Use $(…) instead, it's exactly equivalent except that it is parsed in a sane way.
I recommend using a here document for the shell snippet that you're feeding to qsub. You have fewer quoting issues to worry about, and it's more readable.
While we're at it, always put double quotes around variable substitutions and command substitutions: "$some_variable", "$(some_command)". Annoyingly, $var in shell syntax doesn't mean “take the value of the variable var”, it means “take the value of the variable var, parse it as a list of wildcard patterns, and replace each pattern by the list of matching files if there are matching files”. This extra stuff is turned off if the substitution happens inside double quotes (or in a here document, by the way): "$var" means “take the value of the variable var”.
SORT_JOB=$(qsub -j oe -W depend="afterany$JOB_ID_ARRAY" <<'EOF'
sort -m -n perl_files_dir/*.txt >>sorted_file.txt
EOF
)
We now get to the snippet where the quoting was actually causing a problem.
SED=`echo "awk '{\$2="";\$3="";\$4="";\$5="";\$6=""; print \$0}' sorted_file.txt \
| sed 's/ //g' >final_file.txt" | qsub -j oe -W depend=afterany$SORT_ARRAY`
The string that becomes the argument to the echo command is:
awk '{$2=;$3=;$4=;$5=;$6=; print $0}' sorted_file.txt | sed 's/ //g' >final_file.txt
This is syntactically incorrect, and that's why you're not getting any output.
You didn't escape the double quotes inside what was meant to be the awk snippet. It's a lot clearer if you use a here document. Also, you don't need the SED= part. You added it because you had a command substitution (a command between …), which substitutes the output of a command. But since you aren't interested in the output of the qsub command, don't take its output, just execute it.
qsub -j oe -W depend="afterany$SORT_ARRAY" <<'EOF'
awk '{$2="";$3="";$4="";$5="";$6=""; print $0}' sorted_file.txt |
sed 's/ //g' >final_file.txt
EOF
I'm not familiar with qsub, but presumably there's a way to get the error output and the return status of the commands it runs. Inspect that error output, you should have seen the errors from awk.
The version of awk that I am using, does not like the character escapes
awk --version
GNU Awk 3.1.7
spuder#cent64$ awk '{\$2="";\$3="";\$4=""; print \$0}' foo.txt
awk: {\$2="";\$3="";\$4=""; print \$0}
awk: ^ backslash not last character on line
Try the following syntax
awk '{for(i=2;i<=7;i++) $i="";print}' foo.txt
As a side note, if you are using Torque 4.x you may not be able to use a comma separated list of jobs with -W depend=, instead you may need to create a new PBS declarative (-W) for each job.
eg...
#Invalid syntax in newer versions of torque
qsub -W depend=foo,bar
Resources
backslash in gawk fields
Print all but the first three columns
http://docs.adaptivecomputing.com/torque/help.htm#topics/commands/qsub.htm#-W

Filter function for less +F

When watching a growing log file with e.g. "less -iS +F service.log" I want
to limit the display to lines matching a certain pattern.
I tried something like
less +F service.log | grep <pattern> | less +F
which doesn't work. Also
cat < service.log | grep <pattern> | less +F
doesn't do what I want. It looks like the input is already closed and
less doesn't show changes.
How can I limit the display to lines matching a certain pattern?
This question is ages old, but I still think it's worth adding a solution.
Instead of trying to grep first and then use less, what about using filtering inside less?
In brief:
use less +F on your file
CTRL-C to temporarily break the "following" action
Type & and your pattern to enable filtering
Issue +F to re-enable the "following" action
More details on this answer on the Unix&Linux StackExchange
I haven't yet worked out how to do this without a temp file, but here is a script that demonstrates a functional grep-filtered less +F (which cleans up its temp file). I call it lessf.
One of the key elements is the --line-buffered argument to grep that permits tail output to continue to flow through the pipeline (the unbuffer command provided by expect provides similar functionality for any program).
#!/bin/sh
LOGFILE=$1
shift
PATTERN=$#
TEMP_DIR=/tmp/lesstmp
TEMP_FILE="$TEMP_DIR/$(basename $LOGFILE)"
[ ! -d $TEMP_DIR ] && mkdir $TEMP_DIR
trap 'rm -rf "$TEMP_DIR"; exit' INT TERM EXIT
( tail -f "$LOGFILE" | grep --line-buffered $PATTERN ) > "$TEMP_FILE" | less +F "$TEMP_FILE"
trap - INT TERM EXIT
Example usage:
lessf /var/log/system.log foobar
lessf /var/log/system.log -v nobar
If you don't mind spawning and tearing down a couple of processes each line, use a read while loop
tail -f filename.log|while read line; do echo $line | grep pattern; done
tail -f service.log | grep <pattern>
The solution seemed simply
LESSOPEN='|grep <pattern> %s' less +F service.log
but less doesn't continue to read new lines from the growing log file.