What's the correct usage of sed with parallel --jobs option? - sed

parallel -a input --colsep ' ' --jobs 100 -I {} sed -i 's/{1}/{2}/g' file
input is a file delimited by space, where the first column is pattern and the second column is replacement.
The problem is that after I ran the command, not all patterns were replaced in file. Then I ran the same command again, more patterns were replaced, but still not all.
However, if I change --jobs 100 to --jobs 1, it will work as expected (but much slower).
Is there any parameter necessary missing in my command?

Sounds more like you have a race condition. If you have several sed processes writing to the file, one will win, and the other(s) will lose.
Having multiple processes process the same file is hugely suboptimal anyway; just generate a single sed script and then run it once. Or if you really want to parallelize, split the input file into smaller pieces, run the generated sed script on each in parallel, and then concatenate them back when you are done.
Parallel processing helps when your task is CPU bound, but this one is I/O bound; you are simply creating congestion by having several processes fight over the access to bytes from the disk, and then in this case also fighting over write access back to the same file.
There are many examples of how to generate a sed script; here's a quick and dirty one which will however not work on some platforms where sed -f - does not read the script from standard input.
sed 's%^\([^ ]*\) \([^ ]*\)$%s/\1/\2/g%' input |
sed -f - file >temp # or sed -f - -i file
I omitted the -i option so that you can check that this does what you want before plunging ahead and deploying it in production. The commented-out version is what you would use once you are satisfied that this really does what you want.
There is still the question of replacement precedence. If you have s/a/b/ and s/b/c/ then do you want effectively s/a/c/, or the opposite? If you have s/abc/x/ and s/abcdef/y/, should abcdef always become y, or is xdef what you expect? A common hack is to sort the replacements by length so that the longer ones always get executed before the shorter ones; then at least you know what to expect.

Let us assume that input is big and file is huge.
You really do not want to read file more than once.
First you need to convert input into a single big sed script.
cat input | parallel --colsep ' ' echo s/{1}/{2}/g >bigsed
As #tripleee says, you may need to sort this, so the longest source string is first.
Then you need to split file into one chunk per CPU thread, run the script on each chunk and finally append the replaced chunks back in order:
parallel --pipepart -a file -k sed -f bigsed > replaced
You will need that /tmp has enough free space to contain replaced or set $TMPDIR to a dir that is.

Related

In Shell, how to run a command on two files in a directory at once?

I know how to run a subscript in shell on all files of a similar type. I do:
for filePath in path/*.extension; do
script.py $filepath
done
Currently I have about nine pairs of files with the same extension and very similar base names (think xxx_R1 and xxx_R2). I have a script I want to run that takes in pairs of files. How can I run a script on all those pairs using shell?
I would list the files matching one pattern, strip off the suffix to form a list of the "base" names, then re-append both suffixes. Something like this:
for base in $(ls *_R1 | sed 's/_R1$//')
do
f1=${base}_R1
f2=${base}_R2
script2.py $f1 $f2
done
Alternatively, you could accomplish the same thing by letting sed do the selection as well as the stripping:
for base in $(ls | sed -n '/_R1$/s///p')
...
Both of these are somewhat simplistic, and can fall down if you have files with "funny" names, such as embedded spaces. If that's a possibility for you, you can use some more sophisticated (albeit less obvious) techniques to get around them. Several are mentioned in links #tripleee has posted. An incremental improvement I like to use, to avoid the improper word splitting that a for ... in loop can do, is to use while and read instead:
ls | sed -n '/_R1$/s///p' | while read base
do
f1=${base}_R1
f2=${base}_R2
script2.py "$f1" "$f2"
done
This still isn't perfect, and will fall down if you happen to have a file with a newline in its name (although personally, I believe that if you have a file with a newline in its name, you deserve whatever miseries befall you :-) ).
Again, if you want something perfect, see the links posted elsewhere.

Is silver-searcher able to obtain PATTERN from file?

There are 84 PATTERN need to be check, i store them in file name pattern.txt.
Is silver-searcher (also named Ag) able to obtain these patterns from pattern.txt?
grep has -f options to read pattern from file, but the man page of silver-searcher mention nothing about it.
No, there isn't a similar -f option in ag. The simple approach is to use loop to pass the patterns to ag; for instance you could use a while loop to read the patterns like this:
while read pattern; do ag "$pattern" -G '.*.txt' ; done < patterns.txt
I suggest the faster approach of using GNU parallel with ag. Parallel and ag work very well together:
< patterns.txt | parallel 'ag --filename --parallel --color "{}" '
Here, I'm passing each pattern to parallel which in turn spawns a number of ag processes which search for their own pattern matches. Parallel is somewhat smart about how many processes to start, but you can tweak it to your heart's content (https://www.gnu.org/software/parallel/man.html). In short, you'll rip through your 84 patterns far faster with parallelization.
Joining the lines in the pattern file to create a regex group:
ag "($(paste -sd "|" pattern.txt))" .

Best scripting tool to profile a log file

I have extracted log files from servers based on my date and time requirement and after extraction it has hundreds of HTTP requests (URLs). Each request may or may not contain various parameters a,b,c,d,e,f,g etc.,
For example:
http:///abcd.com/blah/blah/blah%20a=10&b=20ORC
http:///abcd.com/blah/blah/blahsomeotherword%20a=30&b=40ORC%26D
http:///abcd.com/blah/blah/blahORsomeORANDworda=30%20b=40%20C%26D
http:///abcd.com/blah/blah/"blah"%20b=40ORCANDD%20G%20F
I wrote a shell script to profile this log file in a while loop, grep for different parameters a,b,c,d,e. If they contain respective parameter then what is the value for that parameter, or TRUE or FALSE.
while read line ; do
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ms.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ABC.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*?q=/ /;s/AUTH_TYPE:.*//'>> output.txt
echo -n -e "\t" >> output.txt
echo " " >> output.txt
done < queries.csv
My question is, my cygwin is taking lot of time (an hour or so) to execute on a log file containing 70k-80k requests. Is there a best way to write this script so that it executes asap? I'm okay with perl too. But my concern is, the script is flexible enough to execute and extract parameters.
Like #reinerpost already pointed out, the loop-internal redirection is probably the #1 killer issue here. You might be able to reap significant gains already by switching from
while read line; do
something >>file
something else too >>file
done <input
to instead do a single redirection after done:
while read line; do
something
something else too
done <input >file
Notice how this also simplifies the loop body, and allows you to overwrite the file when you (re)start the script, instead of separately needing to clean out any old results. As also suggested by #reinerpost, not hard-coding the output file would also make your script more general; simply print to standard output, and let the invoker decide what to do with the results. So maybe just remove the redirections altogether.
(Incidentally, you should switch to read -r unless you specifically want the shell to interpret backslashes and other slightly weird legacy behavior.)
Additionally, collecting results and doing a single print at the end would probably be a lot more efficient than the repeated unbuffered echo -n -e writes. (And again, printf would probably be preferrable to echo for both portability and usability reasons.)
The current script could be reimplemented in sed quite easily. You collect portions of the input URL and write each segment to a separate field. This is easily done in sed with the following logic: Save the input to the hold space. Swap the hold space and the current pattern space, perform the substitution you want, append to the hold space, and swap back the input into the pattern space. Repeat as necessary.
Because your earlier script was somewhat more involved, I'm suggesting to use Awk instead. Here is a crude skeleton for doing things you seem to be wanting to do with your data.
awk '# Make output tab-delimited
BEGIN { OFS="\t" }
{ xyz_ms = $0; sub("^.*XYX:", " ", xyz_ms); sub("ms.*$", "", xyz_ms);
xyz_abc = $0; sub("^.*XYZ:", " ", xyz_abc); sub("ABC.*$", "", xyz_abc);
q = $0; sub("^.*?q=", " ", q); sub("AUTH_TYPE:.*$", "", q);
# ....
# Demonstration of how to count something
n = split($0, _, "&"); ampersand_count = n-1;
# ...
# Done: Now print
print xyz_mx, xyz_abc, q, " " }' queries.csv
Notice how we collect stuff in variables and print only at the end. This is less crucial here than it would have been in your earlier shell script, though.
The big savings here is avoiding to spawn a large number of subprocesses for each input line. Awk is also better optimized for doing this sort of processing quickly.
If Perl is more convenient for you, converting the entire script to Perl should produce similar benefits, and be somewhat more compatible with the sed-centric syntax you have already. Perl is bigger and sometimes slower than Awk, but in the grand scheme of things, not by much. If you really need to optimize, do both and measure.
Problems with your script:
The main problem: you append to a file in every statement. This means the file has to be opened and closed in every statement, which is extremely inefficient.
You hardcode the name of the output file in your script. This is a bad practice. Your script will be much more versatile if it just writes its output to stdout. Leave it to the call to specify where to direct the output. That will also get rid of the previous problem.
bash is interpreted and not optimized for text manipulation: it is bound to be slow, and complex text filtering won't be very readable. Using awk instead will probably make it more concise and more readable (once you know the language); however, if you don't know it yet, I advise learning Perl instead, which is good at what awk is good at but is also general-purpose language: it makes you much far more flexible, allows you to make it even more readable (those who complain about Perl being hard to read have never seen nontrivial shell scripts), and probably makes it a lot faster, because perl compiles scripts prior to running them. If you'd rather invest your efforts into a more popular language than Perl, try Python.

how many instances did a command affect?

Is there a way when using sed from the cli to return how many lines were affected, or better yet how many instances were affected by a command that might have multiple affects per line if the global param is used? Pretty much, for me, that would mean how many substitutions were made.
I guess one could output to a new file and then run a diff on the two files afterward, but my need to know how many instances a command affects is not that great to do that. I just wondered if there might be a feature native to sed that can be employed.
As far as I know, sed has no native feature to manipulate variables (e.g. to increase an internal counter). Definitely, that is one of the features brought by awk that were lacking in sed. So my advice would be that you switch to awk, and then you can easily use an awk script such as:
BEGIN { counter = 0 }
/mypattern/ { do-whatever-you-want; counter++ }
END { print counter }
You ask for a sed solution. Here is a pure sed approach that does some of what you want:
sed 's/old/new/;t change;b;:change w changes'
After executing this, the changed lines, if any, are written to the file changes.
How it works:
s/old/new/;
Replace this with whatever substitution you want to do.
t change;
This tells sed to jump to the label change if the preceding s command made any changes.
b;
If the preceding jump did not happen, then this b command is executed which ends the processing of this line.
:change w changes
This tells sed to write the current line, as changed by your s command, to the file changes.
The Next Step
The next step here would be to count the changes. To this end, sed can do arithmetic but it is not for the faint of heart.
OSX
As I recall, the version of sed on Mac OSX does not support chaining commands together with semicolons. Instead, try:
sed -e 's/old/new/' -e 't change' -e b -e ':change w changes'

How can I remove all non-word characters except the newline?

I have a file like this:
my line - some words & text
oh lóok i've got some characters
I want to 'normalize' it and remove all the non-word characters. I want to end up with something like this:
mylinesomewordstext
ohlóokivegotsomecharacters
I'm using Linux on the command line at the moment, and I'm hoping there's some one-liner I can use.
I tried this:
cat file | perl -pe 's/\W//'
But that removed all the newlines and put everything one line. Is there someway I can tell Perl to not include newlines in the \W? Or is there some other way?
This removes characters that don't match \w or \n:
cat file | perl -C -pe 's/[^\w\n]//g'
#sth's solution uses Perl, which is (at least on my system) not Unicode compatible, thus it loses the accented o character.
On the other hand, sed is Unicode compatible (according to the lists on this page), and gives a correct result:
$ sed 's/\W//g' a.txt
mylinesomewordstext
ohlóokivegotsomecharacters
In Perl, I'd just add the -l switch, which re-adds the newline by appending it to the end of every print():
perl -ple 's/\W//g' file
Notice that you don't need the cat.
The previous response isn't echoing the "ó" character. At least in my case.
sed 's/\W//g' file
Best practices for shell scripting dictate that you should use the tr program for replacing single characters instead of sed, because it's faster and more efficient. Obviously use sed if replacing longer strings.
tr -d '[:blank:][:punct:]' < file
When run with time I get:
real 0m0.003s
user 0m0.000s
sys 0m0.004s
When I run the sed answer (sed -e 's/\W//g' file) with time I get:
real 0m0.003s
user 0m0.004s
sys 0m0.004s
While not a "huge" difference, you'll notice the difference when running against larger data sets. Also please notice how I didn't pipe cat's output into tr, instead using I/O redirection (one less process to spawn).