There are 84 PATTERN need to be check, i store them in file name pattern.txt.
Is silver-searcher (also named Ag) able to obtain these patterns from pattern.txt?
grep has -f options to read pattern from file, but the man page of silver-searcher mention nothing about it.
No, there isn't a similar -f option in ag. The simple approach is to use loop to pass the patterns to ag; for instance you could use a while loop to read the patterns like this:
while read pattern; do ag "$pattern" -G '.*.txt' ; done < patterns.txt
I suggest the faster approach of using GNU parallel with ag. Parallel and ag work very well together:
< patterns.txt | parallel 'ag --filename --parallel --color "{}" '
Here, I'm passing each pattern to parallel which in turn spawns a number of ag processes which search for their own pattern matches. Parallel is somewhat smart about how many processes to start, but you can tweak it to your heart's content (https://www.gnu.org/software/parallel/man.html). In short, you'll rip through your 84 patterns far faster with parallelization.
Joining the lines in the pattern file to create a regex group:
ag "($(paste -sd "|" pattern.txt))" .
Related
parallel -a input --colsep ' ' --jobs 100 -I {} sed -i 's/{1}/{2}/g' file
input is a file delimited by space, where the first column is pattern and the second column is replacement.
The problem is that after I ran the command, not all patterns were replaced in file. Then I ran the same command again, more patterns were replaced, but still not all.
However, if I change --jobs 100 to --jobs 1, it will work as expected (but much slower).
Is there any parameter necessary missing in my command?
Sounds more like you have a race condition. If you have several sed processes writing to the file, one will win, and the other(s) will lose.
Having multiple processes process the same file is hugely suboptimal anyway; just generate a single sed script and then run it once. Or if you really want to parallelize, split the input file into smaller pieces, run the generated sed script on each in parallel, and then concatenate them back when you are done.
Parallel processing helps when your task is CPU bound, but this one is I/O bound; you are simply creating congestion by having several processes fight over the access to bytes from the disk, and then in this case also fighting over write access back to the same file.
There are many examples of how to generate a sed script; here's a quick and dirty one which will however not work on some platforms where sed -f - does not read the script from standard input.
sed 's%^\([^ ]*\) \([^ ]*\)$%s/\1/\2/g%' input |
sed -f - file >temp # or sed -f - -i file
I omitted the -i option so that you can check that this does what you want before plunging ahead and deploying it in production. The commented-out version is what you would use once you are satisfied that this really does what you want.
There is still the question of replacement precedence. If you have s/a/b/ and s/b/c/ then do you want effectively s/a/c/, or the opposite? If you have s/abc/x/ and s/abcdef/y/, should abcdef always become y, or is xdef what you expect? A common hack is to sort the replacements by length so that the longer ones always get executed before the shorter ones; then at least you know what to expect.
Let us assume that input is big and file is huge.
You really do not want to read file more than once.
First you need to convert input into a single big sed script.
cat input | parallel --colsep ' ' echo s/{1}/{2}/g >bigsed
As #tripleee says, you may need to sort this, so the longest source string is first.
Then you need to split file into one chunk per CPU thread, run the script on each chunk and finally append the replaced chunks back in order:
parallel --pipepart -a file -k sed -f bigsed > replaced
You will need that /tmp has enough free space to contain replaced or set $TMPDIR to a dir that is.
I know how to run a subscript in shell on all files of a similar type. I do:
for filePath in path/*.extension; do
script.py $filepath
done
Currently I have about nine pairs of files with the same extension and very similar base names (think xxx_R1 and xxx_R2). I have a script I want to run that takes in pairs of files. How can I run a script on all those pairs using shell?
I would list the files matching one pattern, strip off the suffix to form a list of the "base" names, then re-append both suffixes. Something like this:
for base in $(ls *_R1 | sed 's/_R1$//')
do
f1=${base}_R1
f2=${base}_R2
script2.py $f1 $f2
done
Alternatively, you could accomplish the same thing by letting sed do the selection as well as the stripping:
for base in $(ls | sed -n '/_R1$/s///p')
...
Both of these are somewhat simplistic, and can fall down if you have files with "funny" names, such as embedded spaces. If that's a possibility for you, you can use some more sophisticated (albeit less obvious) techniques to get around them. Several are mentioned in links #tripleee has posted. An incremental improvement I like to use, to avoid the improper word splitting that a for ... in loop can do, is to use while and read instead:
ls | sed -n '/_R1$/s///p' | while read base
do
f1=${base}_R1
f2=${base}_R2
script2.py "$f1" "$f2"
done
This still isn't perfect, and will fall down if you happen to have a file with a newline in its name (although personally, I believe that if you have a file with a newline in its name, you deserve whatever miseries befall you :-) ).
Again, if you want something perfect, see the links posted elsewhere.
[Mac OS]
It seems that sed requires an input file, and that I cannot pipe grep to it. Although sed can match like grep does, it can make the sed operation very complex if it's handling both a find and replace.
For example, if I wanted to remove the 3rd word of every line that started with 'T', it's much more convenient to separate the find/replace commands than to create a complex regex.
Looking through SO answers, there doesn't seem to be an elegant solution where you can pipe grep to sed without new files being involved. I did find this, which almost does what I want:
sed -i "s/$(grep 'old' input.txt)/new/g" input.txt
But it doesn't handle multiple matches well.
I'll generalize:
Is there a better way to find specific lines in a text file and modify those lines in-place? Preferably cli, or as low-level as possible.
I have extracted log files from servers based on my date and time requirement and after extraction it has hundreds of HTTP requests (URLs). Each request may or may not contain various parameters a,b,c,d,e,f,g etc.,
For example:
http:///abcd.com/blah/blah/blah%20a=10&b=20ORC
http:///abcd.com/blah/blah/blahsomeotherword%20a=30&b=40ORC%26D
http:///abcd.com/blah/blah/blahORsomeORANDworda=30%20b=40%20C%26D
http:///abcd.com/blah/blah/"blah"%20b=40ORCANDD%20G%20F
I wrote a shell script to profile this log file in a while loop, grep for different parameters a,b,c,d,e. If they contain respective parameter then what is the value for that parameter, or TRUE or FALSE.
while read line ; do
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ms.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ABC.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*?q=/ /;s/AUTH_TYPE:.*//'>> output.txt
echo -n -e "\t" >> output.txt
echo " " >> output.txt
done < queries.csv
My question is, my cygwin is taking lot of time (an hour or so) to execute on a log file containing 70k-80k requests. Is there a best way to write this script so that it executes asap? I'm okay with perl too. But my concern is, the script is flexible enough to execute and extract parameters.
Like #reinerpost already pointed out, the loop-internal redirection is probably the #1 killer issue here. You might be able to reap significant gains already by switching from
while read line; do
something >>file
something else too >>file
done <input
to instead do a single redirection after done:
while read line; do
something
something else too
done <input >file
Notice how this also simplifies the loop body, and allows you to overwrite the file when you (re)start the script, instead of separately needing to clean out any old results. As also suggested by #reinerpost, not hard-coding the output file would also make your script more general; simply print to standard output, and let the invoker decide what to do with the results. So maybe just remove the redirections altogether.
(Incidentally, you should switch to read -r unless you specifically want the shell to interpret backslashes and other slightly weird legacy behavior.)
Additionally, collecting results and doing a single print at the end would probably be a lot more efficient than the repeated unbuffered echo -n -e writes. (And again, printf would probably be preferrable to echo for both portability and usability reasons.)
The current script could be reimplemented in sed quite easily. You collect portions of the input URL and write each segment to a separate field. This is easily done in sed with the following logic: Save the input to the hold space. Swap the hold space and the current pattern space, perform the substitution you want, append to the hold space, and swap back the input into the pattern space. Repeat as necessary.
Because your earlier script was somewhat more involved, I'm suggesting to use Awk instead. Here is a crude skeleton for doing things you seem to be wanting to do with your data.
awk '# Make output tab-delimited
BEGIN { OFS="\t" }
{ xyz_ms = $0; sub("^.*XYX:", " ", xyz_ms); sub("ms.*$", "", xyz_ms);
xyz_abc = $0; sub("^.*XYZ:", " ", xyz_abc); sub("ABC.*$", "", xyz_abc);
q = $0; sub("^.*?q=", " ", q); sub("AUTH_TYPE:.*$", "", q);
# ....
# Demonstration of how to count something
n = split($0, _, "&"); ampersand_count = n-1;
# ...
# Done: Now print
print xyz_mx, xyz_abc, q, " " }' queries.csv
Notice how we collect stuff in variables and print only at the end. This is less crucial here than it would have been in your earlier shell script, though.
The big savings here is avoiding to spawn a large number of subprocesses for each input line. Awk is also better optimized for doing this sort of processing quickly.
If Perl is more convenient for you, converting the entire script to Perl should produce similar benefits, and be somewhat more compatible with the sed-centric syntax you have already. Perl is bigger and sometimes slower than Awk, but in the grand scheme of things, not by much. If you really need to optimize, do both and measure.
Problems with your script:
The main problem: you append to a file in every statement. This means the file has to be opened and closed in every statement, which is extremely inefficient.
You hardcode the name of the output file in your script. This is a bad practice. Your script will be much more versatile if it just writes its output to stdout. Leave it to the call to specify where to direct the output. That will also get rid of the previous problem.
bash is interpreted and not optimized for text manipulation: it is bound to be slow, and complex text filtering won't be very readable. Using awk instead will probably make it more concise and more readable (once you know the language); however, if you don't know it yet, I advise learning Perl instead, which is good at what awk is good at but is also general-purpose language: it makes you much far more flexible, allows you to make it even more readable (those who complain about Perl being hard to read have never seen nontrivial shell scripts), and probably makes it a lot faster, because perl compiles scripts prior to running them. If you'd rather invest your efforts into a more popular language than Perl, try Python.
I am developing a tomcat application and would like to be able to search for specific things and highlight it when viewing the log. I want something like an alias that takes a parameter (regex) as input and highlight the matching string.
So far, I've figured this works, but its not practical enough to have to change a small part of it for every time I want something new:
tail -n 100 -f /opt/apache-tomcat-6.0.26/logs/catalina.out | perl -pe 's/null/\e[1;31m$&\e[0m/g'
This is what I thought would work:
logColor(){
x="'s/"
y="/\e[1;31m$&\e[0m/g'"
tail -n 100 -f /opt/apache-tomcat-6.0.26/logs/catalina.out | perl -pe $x$1$y
}
alias logC=logColor
I've tested that this prints out the two same lines:
logColorTest(){
x="'s/"
y="/\e[1;31m$&\e[0m/g'"
echo $x$1$y
echo "'s/null/\e[1;31m$&\e[0m/g'"
}
alias logCT=logColorTest
logCT null
So I am lost on why this does not work and would appreciate input from someone who knows how this works :)
Problem with grep is that, you get only matching lines & other lines are filtered out. (That's what is grep supposed to do anyway.) Many times however, we need all the output, but with some particular strings highlighted.
I have this small bash function in my .bashrc for such requirement:
mark ()
{
local searchExpr=${1/\//\\\/};
sed "s/$searchExpr/"`echo -n -e "\e[91;1m"`'&'`echo -n -e "\e[0m"`'/gi' $2
}
Usage:
command | mark some_string # OR
mark some_string some_file
Rename to suitable function name if required.
NOTE: There is a great command called highlight. Hence I could not use that as my function name.
As #fedorqui pointed out, you can use grep to do this:
grep --colour 'null\|$'
This will match and highlight null or the end of a line, meaning all lines are shown.
Using the GREP_COLORS environment variable you can control how different parts are highlighted, e.g mark matched text in yellow:
export GREP_COLORS='ms=1;33'