Perl / xargs terrible performance with xargs -n1/-i - perl

I have a little perl one-liner I wrote:
find . -name '*.cpp' -print0 2>/dev/null | xargs -0 -i perl -ne 'if (/\+\+\S*[cC]ursor\S*/ && !/[!=]=\s*DB_NULL_CURSOR/) {print "$ARGV:$.\n $_\n";}' {}
In the directory I'm running this, the find portion returns 5802 results.
Now, I understand xargs -i (or -n1) is going to have a performance impact, but with -i:
find . -name '*.cpp' -print0 2> /dev/null 0.33s user 1.12s system 0% cpu 3:12.57 total
xargs -0 -i perl -ne {} 4.12s user 32.80s system 16% cpu 3:42.22 total
And without:
find . -name '*.cpp' -print0 2> /dev/null 0.27s user 1.22s system 95% cpu 1.556 total
xargs -0 perl -ne 0.62s user 0.69s system 61% cpu 2.117 total
Minutes vs. a couple seconds (order of testing confirmed not to matter). The actual perl results are identical other than the line numbers which are obviously incorrect in the second instance.
Behavior is identical in Cygwin/bash/perl5v26, and WSL Ubuntu 16.04/zsh/perl5v22. File system is NTFS in both cases. But...I'm kind of assuming the little one-liner I wrote must have some sort of bug in it and that stuff is irrelevant?
EDIT: It occurred to me that disabling sitecustomize.pl at startup with -f --an option I'd vaguely remembered seeing with perl --help--might help. It did not. Also, I'm aware that the performance impact of -i is going to be significant due to perl compiling the regex. This still seems out of control.

xargs will invoke a new process for every line it processes, so in your case it will be spinning up perl 5802 times and doing this in series
You could try in parallel
You might be using xargs to invoke a compute intensive command for
every line of input. Wouldn’t it be nice if xargs allowed you to take
advantage of the multiple cores in your machine? That’s what -P is
for. It allows xargs to invoke the specified command multiple times in
parallel. You might use this for example to run multiple ffmpeg
encodes in parallel. However I’m just going to show you yet another
contrived example.
Or on the other hand, you could use sed which is much lighter to spin up

Okay, my fundamental misunderstanding was an assumption that the max command line length would be something in the 2000 range. So I was assuming a perl instance for every 20 files or so (being about 120 characters each). This was incredibly incorrect.
getconf ARG_MAX shows you the actual acceptable length. In my case:
2097152
So, I'm looking at 1 perl instance vs. 5802 instances. The only perl solution I can think of would be to remove -n and implement the loop by hand, explicitly closing each file.
Better solutions, I think, are awk:
find . -name '*.cpp' 2>/dev/null -print0 | xargs -0 awk '{if (/\+\+\S*[cC]ursor\S*/ && !/[!=]=\s*DB_NULL_CURSOR/) {print FILENAME ":" FNR " " $0}}'
or grep:
find . -name '*.cpp' 2>/dev/null -print0 | xargs -0 grep -nE '\+\+\S*[cC]ursor\S*' | grep -v '[!=]=\s*DB_NULL_CURSOR'
Both of which execute in the 2 or 3 second range.

Related

sed, xargs and stdbuf - how to get only first n matches of a pattern from a file

I have a file with patterns (1 line=1 pattern) I want to look for on a big text file - only one (or none) pattern will be found in each line of the infile. Once found a match, I want to retrieve the characters immediately before the match. The first part is to acquire the patterns for sed
cat patterns.txt | xargs -I '{}' sed -n 's/{}.*$//p' bigtext.txt
That works ok - the downside being that potentially I'll have hundreds of thousands of matches. I don't want/need all the matches - a fair representation of 1K hits would be enough. And here is where I struggle: I've read that in order to limit the number of hits of sed, I should use stdbuf (gstdbuf in my case) and pipe the whole thing through head. But I am not sure where to place the stdbuf command:
cat patterns.txt | xargs -I '{}' gstdbuf -oL -eL sed -n 's/{}.*$//p' bigtext.txt | head -n100
When I tried this, the process takes as long as if it was running sed on the whole file and then getting the head of that output, while my wish is to stop searching after 100 or 1000 matches. Any ideas on the best way of accomplishing this?
Is the oneliner you have provided really what you wanted? Esp. since you mention a fair sample. Because as it is stands right now, it feeds patterns.txt into xargs... which will go ahead and invoke sed for each pattern individually, one after another. And the whole output of xargs is fed to head which chops it of after n lines. In other words, your first pattern can already exhaust all the lines you wanted to see, even though the other patterns could have matched any number of times on lines occurring before the matches presented to you. Detailed example between horizontal rulers.
If I have patterns.txt of:
_Pat1
_Pat2
_Pat3
And bigtext.txt with:
1matchx_Pat1x
2matchx_Pat2x
2matchy_Pat2y
2matchz_Pat2z
3matchx_Pat3x
3matchy_Pat3y
3matchz_Pat3z
1matchy_Pat1y
1matchz_Pat1z
And I run your oneliner limited to five hits, I do not get result of (first five matches for all three patterns as found in the file):
1matchx
2matchx
2matchy
2matchz
3matchx
But (all (3) patches for _Pat1 plus 2 matches for _Pat2 after which I've ran out of output lines):
1matchx
1matchy
1matchz
2matchx
2matchy
Now to your performance problem which is partially related. I have to admit that I could not reproduce it. I've taken your example from the comment, blew the "big" file up to a 1GB in size by repeating the pattern and ran your oneliner:
$ time { cat patterns.txt | xargs -I '{}' stdbuf -oL sed -n 's/{}.*$//p' bigtext.txt | head -5 ; }
1aaa
2aaabbb
3aaaccc
1aaa
2aaabbb
xargs: stdbuf: terminated by signal 13
real 0m0.012s
user 0m0.013s
sys 0m0.008s
Note I've dropped the -eL, stderr is usually unbuffered (which is what you usually want) and doesn't play any role here really. Also note I've ran stdbuf without the "g" prefix, which tells me you're probably on a system where GNU tools are not the default... and probably the reasons why you get different behavior. I'll try to explain what is going on and venture few guesses... and conclude with a suggestion. Also note, I really did not need to use stdbuf (manipulate buffering) at all or rather it had no appreciable impact on the result, but again, this could be platform and tools (as well as scenario) specific.
When you read the line from its end, head reads standard input as it is being piped in from xargs (and by extension the sed (or stdbuf wrapping) runs which xargs forks, they are both attached to its writing end) until limit of lines to print has been reached and then head terminates. Doing so "breaks" the pipe and xargs and sed (or stdbuf which it was wrapped in) receive SIGPIPE signal and by default they as well terminate (that you can see in the output of my run: xargs: stdbuf: terminated by signal 13).
What the stdbuf -oL does and why someone might have suggested it. When no longer using console for reading/writing, which would usually be line buffered, and using pipes we would usually see buffered I/O instead. stdbuf -oL changes that back to line buffered. Without it, the process involved would communicate in larger chunk and it could take head longer to realize, it is done and needs no further input, while sed keeps running to see if there are any further suitable matches. As mentioned, on my systems (4K buffer) and with that (repeating pattern) example, this made no real difference. Also note, while it decreases the risk of not knowing we could be done, line buffering does increase overhead involved in communication between the processes.
So why would these mechanics not yield the same expected results for you? Couple options come to mind:
since you fork and run sed once per pattern, whole file each time. It could happen you get series of several runs without any hits. I'd guess this is actually likely the case.
since you give sed file to read from, you may have different implementation of sed that tries to read a lot more in before taking action on the file content (mine reads 4K at a time). Not a likely cause, but in theory you could also feed sed line by line to force smaller chunks and getting that SIGPIPE sooner.
Now assuming that sequential pattern by pattern matching is actually not desired, summary of all of above would be: process your patterns first into a single one and then perform a single pass over the "big" file (optionally capping the output of course). It might be worth switching from shell mostly to something a bit more comfortable to use, or at least not to keep the oneliner format which is likely to turn confusing.
Not true to my own recommendation, awk script called like this prints first 5 hits and quits:
awk -v patts="$(cat patterns.txt)" -v last=5 'BEGIN{patts="(" patts ; gsub(/\n/, "|", patts) ; sub(/.$/, ")", patts); cnt=1 ;} $0~patts{sub(patts ".*", ""); print; cnt++;} cnt>last{exit;}' bigtext.txt
You can specify a file that has patterns to match to the grep command with a -f file. You can also specify the number of matches to find before quiting -m count
So this command will get you the first 5 lines that match:
grep -f patterns.txt -m 5 bigtext.txt
Now to trim the match to the end of the line, is a bit more difficult.
Assuming you use bash, we can build a regex from the file, like this:
while IFS='' read -r line || [[ -n "$line" ]]; do
subRegex="s/$line.*//;"${subRegex}
done < patterns.txt
Then use this in a sed command. The resulting code becomes:
while IFS='' read -r line || [[ -n "$line" ]]; do
subRegex="s/$line.*//;"${subRegex}
done < patterns.txt
grep -f patterns.txt -m 5 bigtext.txt | sed "$subRegex"
The sed command is only running on the lines that have already matched from the grep, so it should be fairly performant.
Now if you call this a lot you could put it in a function
function findMatches() {
local matchCount=${1:-5} # default to 5 matches
local subRegex
while IFS='' read -r line || [[ -n "$line" ]]; do
subRegex="s/$line.*//;"${subRegex}
done < patterns.txt
grep -f patterns.txt -m ${matchCount} bigtext.txt | sed "${subRegex}"
}
Then you can call it like this
findMatches 5
findMatches 100
Update
Based on the sample files you gave, this solution does produce the expected result 1aaa 2aaabbb 3aaaccc 4aaa 5aaa
However, given your comment on the length of each pattern being 120 characters, and each line of the bigfile being 250 characters, 10 GB file size.
You didn't mention how many patterns you might have. So I tested and it seems that the sed command done inline falls apart someplace before 50 patterns.
(Of course, if your samples are really how the data look, then you could do your trimming of each line to be based bases on non-AGCT and not based on the patterns file. Which would be much quicker)
But based on the original question. You can generate a sed script in a separate file based on patterns.txt. Like this:
sed -e "s/^/s\//g;s/$/.*\$\/\/g/g;" patterns.txt > temp.sed
then use this temp file on the sed command.
grep -f patterns.txt -m 5 bigtext.txt | sed -f temp.sed
The grep stops after finding X matches, and the sed trims those... The new function runs on my machine in a couple seconds.
For testing I created a 2GB file of 250 character AGCT combos. And another file with 50+ patterns, 120 characters each with a few of these patterns taken from random lines of the bigtext file.
function findMatches() {
sed -e "s/^/s\//g;s/$/.*\$\/\/g/g;" patterns.txt > temp.sed
grep -f patterns.txt -m ${1:-5} bigtext.txt | sed -f temp.sed
}

Using xargs arguments twice

I need to check if local file is same as remote host file.
The file locations are like below:
File1 at Local machine
./remotehostname/home/a/b/scripts/xyz.cpp
File2 at remote machine
remotehostname:/home/a/b/scripts/xyz.cpp
I intend to compare these 2 files, using the command
diff ./remotehostname/home/a/b/scripts/xyz.cpp remotehostname:/home/a/b/scripts/xyz.cpp
find . -type f | grep -v .svn |xargs -I % diff %
I need to change % to take remotehost and compare the file.
Not sure how to apply sed on %. Or is there a better way to compare such files.
One way could be to save the list of files and then apply sed on that file, but I think there should be an even better way. Also the diff doesnt work on remote hosts, maybe I need to use output of dry rsync?
This can be done with xargs, but I prefer to use while read in bash.
xargs method
find . -type f | grep -v .svn | sed 's/.*/& remotehostname:&/' | xargs -n2 diff
The sed command duplicates the input and makes whatever modifications you need. The xargs then passes the inputs to diff two at a time. This will not work if any filename contain spaces.
bash method
find . -type f | grep -v .svn | while read line; do
diff "$line" "remotehostname:$line"
done
The bash read command reads a line from stdin, places it in the name variable, $line, and returns true. You can then put whatever you like inside the loop, so you get total freedom to rewrite the filename however you need. When the input runs out, read returns false, and the loop exits.
Note that piping things into loops has some interesting side effects that are not relevant here, but might bite you one day.
If you are interested in the actual difference (and not just whether they differ - which rsync is brilliant for telling you) then you can do this using GNU Parallel:
find . -type f | grep -v .svn |
parallel diff {} '<(ssh {= s:./::;s:/.*:: =} cat {= s:([^/]+/){2,2}::;$_=::shell_quote_scalar($_) =})'
s:./::;s:/.*:: = hostname from path
s:([^/]+/){2,2}:: = rest of path
::shell_quote_scalar = \-quote special chars as needed by the shell
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Perl: edit file, not just output to shell

I found a little one-liner of perl code that will change the serial in my zone-files on my Bind server.
However it wont change the actual file, it just gives me the output directly to the shell.
This is what I run:
find . -type f -print0 | xargs -0 perl -e "while(<>){ s/\d+(\s*;\s*[sS]erial)/2015050466\1/; print; }"
This gives me the correct output to the shell and if I remove the print; at the end of the perl line nothing happens and I want it to actually change the files to the output I got.
I'm a total noob when it comes to Perl so this might be a simple fix so any answer would be appreciated.
I am assuming you want to replace the string inside the files found by find.
Command example below will change in-place (-i) any "foo" with "bar" for all *.txt files from curent directory.
find . -type f -name '*.txt' -print0 | xargs -0 perl -p -i -e 's/foo/bar/g;'
And for your question, you should be able to get it with this command:
find . -type f -print0 | xargs -0 perl -p -i -e 's/\d+(\s*;\s*[sS]erial)/2015050466\1/;'
Note: It is good habit to always use single quotes rather than double quotes. This is because inside double quotes, a \, $, etc. may be processed by the shell before passed to Perl. See Bash manual.

Change multiple files

The following command is correctly changing the contents of 2 files.
sed -i 's/abc/xyz/g' xaa1 xab1
But what I need to do is to change several such files dynamically and I do not know the file names. I want to write a command that will read all the files from current directory starting with xa* and sed should change the file contents.
I'm surprised nobody has mentioned the -exec argument to find, which is intended for this type of use-case, although it will start a process for each matching file name:
find . -type f -name 'xa*' -exec sed -i 's/asd/dsg/g' {} \;
Alternatively, one could use xargs, which will invoke fewer processes:
find . -type f -name 'xa*' | xargs sed -i 's/asd/dsg/g'
Or more simply use the + exec variant instead of ; in find to allow find to provide more than one file per subprocess call:
find . -type f -name 'xa*' -exec sed -i 's/asd/dsg/g' {} +
Better yet:
for i in xa*; do
sed -i 's/asd/dfg/g' $i
done
because nobody knows how many files are there, and it's easy to break command line limits.
Here's what happens when there are too many files:
# grep -c aaa *
-bash: /bin/grep: Argument list too long
# for i in *; do grep -c aaa $i; done
0
... (output skipped)
#
You could use grep and sed together. This allows you to search subdirectories recursively.
Linux: grep -r -l <old> * | xargs sed -i 's/<old>/<new>/g'
OS X: grep -r -l <old> * | xargs sed -i '' 's/<old>/<new>/g'
For grep:
-r recursively searches subdirectories
-l prints file names that contain matches
For sed:
-i extension (Note: An argument needs to be provided on OS X)
Those commands won't work in the default sed that comes with Mac OS X.
From man 1 sed:
-i extension
Edit files in-place, saving backups with the specified
extension. If a zero-length extension is given, no backup
will be saved. It is not recommended to give a zero-length
extension when in-place editing files, as you risk corruption
or partial content in situations where disk space is exhausted, etc.
Tried
sed -i '.bak' 's/old/new/g' logfile*
and
for i in logfile*; do sed -i '.bak' 's/old/new/g' $i; done
Both work fine.
#PaulR posted this as a comment, but people should view it as an answer (and this answer works best for my needs):
sed -i 's/abc/xyz/g' xa*
This will work for a moderate amount of files, probably on the order of tens, but probably not on the order of millions.
Another more versatile way is to use find:
sed -i 's/asd/dsg/g' $(find . -type f -name 'xa*')
I'm using find for similar task. It is quite simple: you have to pass it as an argument for sed like this:
sed -i 's/EXPRESSION/REPLACEMENT/g' `find -name "FILE.REGEX"`
This way you don't have to write complex loops, and it is simple to see, which files you are going to change, just run find before you run sed.
u can make
'xxxx' text u search and will replace it with 'yyyy'
grep -Rn '**xxxx**' /path | awk -F: '{print $1}' | xargs sed -i 's/**xxxx**/**yyyy**/'
There's some good answers above. I thought I'd throw in one more that is succinct and parallelizable, using GNU parallel, which I often prefer to xargs:
parallel sed -i 's/abc/xyz/g' {} ::: xa*
Combine this with the -j N option to run N jobs in parallel.
If you are able to run a script, here is what I did for a similar situation:
Using a dictionary/hashMap (associative array) and variables for the sed command, we can loop through the array to replace several strings. Including a wildcard in the name_pattern will allow to replace in-place in files with a pattern (this could be something like name_pattern='File*.txt' ) in a specific directory (source_dir).
All the changes are written in the logfile in the destin_dir
#!/bin/bash
source_dir=source_path
destin_dir=destin_path
logfile='sedOutput.txt'
name_pattern='File.txt'
echo "--Begin $(date)--" | tee -a $destin_dir/$logfile
echo "Source_DIR=$source_dir destin_DIR=$destin_dir "
declare -A pairs=(
['WHAT1']='FOR1'
['OTHER_string_to replace']='string replaced'
)
for i in "${!pairs[#]}"; do
j=${pairs[$i]}
echo "[$i]=$j"
replace_what=$i
replace_for=$j
echo " "
echo "Replace: $replace_what for: $replace_for"
find $source_dir -name $name_pattern | xargs sed -i "s/$replace_what/$replace_for/g"
find $source_dir -name $name_pattern | xargs -I{} grep -n "$replace_for" {} /dev/null | tee -a $destin_dir/$logfile
done
echo " "
echo "----End $(date)---" | tee -a $destin_dir/$logfile
First, the pairs array is declared, each pair is a replacement string, then WHAT1 will be replaced for FOR1 and OTHER_string_to replace will be replaced for string replaced in the file File.txt. In the loop the array is read, the first member of the pair is retrieved as replace_what=$i and the second as replace_for=$j. The find command searches in the directory the filename (that may contain a wildcard) and the sed -i command replaces in the same file(s) what was previously defined. Finally I added a grep redirected to the logfile to log the changes made in the file(s).
This worked for me in GNU Bash 4.3 sed 4.2.2 and based upon VasyaNovikov's answer for Loop over tuples in bash.
The Silver Searcher Solution
I'm adding another option for those people who don't know about the amazing tool called The Silver Searcher (command line tool is ag).
Note: You can use grep and other tools to do the same thing here, but The Silver Searcher is fantastic :)
TLDR
ag -l 'abc' | xargs sed -i 's/abc/xyz/g'
Install The Silver Searcher
sudo apt install silversearcher-ag # Debian / Ubuntu
sudo pacman -S the_silver_searcher # Arch / EndeavourOS
sudo yum install epel-release the_silver_searcher # RHEL / CentOS
Demo Files
Paste the following into your terminal to create some demonstration files:
mkdir /tmp/food
cd /tmp/food
content="Everybody loves to abc this food!"
echo "$content" > ./milk
echo "$content" > ./bread
mkdir ./fastfood
echo "$content" > ./fastfood/pizza
echo "$content" > ./fastfood/burger
mkdir ./fruit
echo "$content" > ./fruit/apple
echo "$content" > ./fruit/apricot
Using 'ag'
The following ag command will recursively find all the files that contain the string 'abc'. It ignores the .git directory, .gitignore files, and other ignore files:
$ ag 'abc'
milk
1:Everybody loves to abc this food!
bread
1:Everybody loves to abc this food!
fastfood/burger
1:Everybody loves to abc this food!
fastfood/pizza
1:Everybody loves to abc this food!
fruit/apple
1:Everybody loves to abc this food!
fruit/apricot
1:Everybody loves to abc this food!
To just list the files that contain the string 'abc', use the -l switch:
$ ag -l 'abc'
bread
fastfood/burger
fastfood/pizza
fruit/apricot
milk
fruit/apple
Changing Multiple Files
Finally, using xargs and sed, we can replace the 'abc' string with another string:
ag -l 'abc' | xargs sed -i 's/abc/eat/g'
In the above command, ag is listing all the files that contain the string 'abc'. The xargs command is splitting the file names and piping them individually into the sed command.

Why does "find . -name *.txt | xargs du -hc" give multiple totals?

I have a large set of directories for which I'm trying to calculate the sum total size of several hundred .txt files. I tried this, which mostly works:
find . -name *.txt | xargs du -hc
But instead of giving me one total at the end, I get several. My guess is that the pipe will only pass on so many lines of find's output at a time, and du just operates on each batch as it comes. Is there a way around this?
Thanks!
Alex
How about using the --files0-from option to du? You'd have to generate the null-terminated file output appropriately:
find . -name "*txt" -exec echo -n -e {}"\0" \; | du -hc --files0-from=-
works correctly on my system.
find . -print0 -iname '*.txt' | du --files0-from=-
and if you want to have several different extensions to search for it's best to do:
find . -type f -print0 | grep -azZEi '\.(te?xt|rtf|docx?|wps)$' | du --files0-from=-
The xargs program breaks things up into batches, to account for the limits due to the maximum length of a unix command line. It's still more efficient than running your subcommand one at a time but, for a long list of inputs, it will run the command enough times that each "run" is short enough that it won't cause issues.
Because of this, you're likely seeing one output line per "batch" that xargs needs to run.
Because you may find it useful/interesting, the man page can be found online here: http://unixhelp.ed.ac.uk/CGI/man-cgi?xargs
One other thing to note (and this may be a typo in your post or my misunderstanding) is that you have the "*.txt" unescaped/quoted. Ie, you have
find . -name *.txt | xargs du -hc
where you probably want
find . -name \*.txt | xargs du -hc
The difference being that the command line may be expanding the * into the list of filenames that match... rather than passing the * into find, which will use it as a pattern.
Another simple solution:
find . -name *.txt -print0 | xargs -0 du -hc
One alternate solution is to use bash for loop:
for i in `find . -name '*.txt'`; do du -hc $i | grep -v 'total'; done
This is good for when you need more control of what happens in the loop.
xargs busts its input into reasonable-sized chunks - what you're seeing are totals for each of those chunks. Check the man page for xargs on ways to configure its handling of input.
One alternate solution is to use awk:
find . -name "*.txt" -exec ls -lt {} \; | awk -F " " 'BEGIN { sum=0 } { sum+=$5 } END { print sum }'