Bash pipe vs input redirect vs process substitution performance

Bash pipe vs input redirect vs process substitution performance - redirect

On one of my systems during processing some text files I noticed very strange thing. Somehow input redirect and process substitution are much slower than piping. Here is example with 1000 lines 137KB text file:
$ time cat 1000.log | while read -r line; do echo -n ''; done
real 0m0.159s
user 0m0.040s
sys 0m0.133s
$ time while read -r line; do echo -n ''; done < 1000.log
real 2m20.143s
user 0m55.205s
sys 1m44.233s
$ time while read -r line; do echo -n ''; done < <(cat 1000.log)
real 2m10.385s
user 0m49.853s
sys 1m38.208s
This is crazy ~88000% difference!
Another weird example:
$ time for i in {1..100}; do echo $i; done
1
2
3
...
99
100
real 0m6.773s
user 0m5.372s
sys 0m2.424s
On other hosts it takes miliseconds...
What may be wrong with that exact system (SUSE Linux Enterprise Server 12 SP2)?

Related

How to execute this command in systemd servicefile?

Ok, so I have this command that turns off my touchscreen. It works when I execute it in a root shell.
So this works:
sudo su
/usr/bin/echo $(ls /sys/bus/hid/drivers/hid-multitouch | awk NR==1'{print $1}') > /sys/bus/hid/drivers/hid-multitouch/unbind
And then my touchscreen stops working, which is the result that I wanted.
Now I want to make a touchscreen.service file to execute this on every boot. So in the service file I include:
ExecStart=/usr/bin/echo $(ls /sys/bus/hid/drivers/hid-multitouch | awk NR==1'{print $1}') > /sys/bus/hid/drivers/hid-multitouch/unbind
However it isn't working > nor throwing any errors that I've been able to catch.
I do know from earlier fidlings with .service files that I might actually need to use /usr/bin/sh -c, so I have also tried:
ExecStart=/usr/bin/sh -c "/usr/bin/echo $(ls /sys/bus/hid/drivers/hid-multitouch | awk NR==1'{print $1}') > /sys/bus/hid/drivers/hid-multitouch/unbind"
Yet this also doesn't work.. maybe because of the awk NR==1'{print $1}'part? I have also tried replacing it with awk NR==1'\''{print $1}'\''but again it fails to work.
Does anyone have any ideas on how to get the command that is working in my root cli environment to also work as a systemd service?

To start with,
The syntax of the awk command is just wrong. The quotes are incorrectly placed. The part NR == 1 is part of the awk command to indicate the first line record in the file, i.e.
awk NR==1'{print $1}'
# ^^^^^^^ should be within quotes
awk 'NR == 1 { print $1 }'
Your sequence of echo, ls and the command substitution $(..) doesn't look right. You are effectively echo-ing the literal string /sys/bus/hid/drivers/hid-multitouch (if ls finds the file at that path) over to the pipe and awk just writes that to the /sys/bus/hid/drivers/hid-multitouch/unbind file which might not be your desired action. You just needed to do run the command on the file directly as
awk 'NR == 1 { print $1 }' /sys/bus/hid/drivers/hid-multitouch > /sys/bus/hid/drivers/hid-multitouch/unbind
Now that, that the awk command is fixed, you have two options to run the above command as part of systemd, either put your command in a script or run the command directly. For putting it in a script refer to the Unix.SE answer Where do I put scripts executed by systemd units?. As for running the command directly in ExecStart. Aside from using /bin/sh also use the path /bin/awk
So putting it together and using /bin/ over /usr/bin, you can do below. This command uses ".." over awk script and needs escape of $1
ExecStart=/bin/sh -c '/bin/awk "NR == 1 { print \$1 }" /sys/bus/hid/drivers/hid-multitouch > /sys/bus/hid/drivers/hid-multitouch/unbind'

sed, xargs and stdbuf - how to get only first n matches of a pattern from a file

I have a file with patterns (1 line=1 pattern) I want to look for on a big text file - only one (or none) pattern will be found in each line of the infile. Once found a match, I want to retrieve the characters immediately before the match. The first part is to acquire the patterns for sed
cat patterns.txt | xargs -I '{}' sed -n 's/{}.*$//p' bigtext.txt
That works ok - the downside being that potentially I'll have hundreds of thousands of matches. I don't want/need all the matches - a fair representation of 1K hits would be enough. And here is where I struggle: I've read that in order to limit the number of hits of sed, I should use stdbuf (gstdbuf in my case) and pipe the whole thing through head. But I am not sure where to place the stdbuf command:
cat patterns.txt | xargs -I '{}' gstdbuf -oL -eL sed -n 's/{}.*$//p' bigtext.txt | head -n100
When I tried this, the process takes as long as if it was running sed on the whole file and then getting the head of that output, while my wish is to stop searching after 100 or 1000 matches. Any ideas on the best way of accomplishing this?

Is the oneliner you have provided really what you wanted? Esp. since you mention a fair sample. Because as it is stands right now, it feeds patterns.txt into xargs... which will go ahead and invoke sed for each pattern individually, one after another. And the whole output of xargs is fed to head which chops it of after n lines. In other words, your first pattern can already exhaust all the lines you wanted to see, even though the other patterns could have matched any number of times on lines occurring before the matches presented to you. Detailed example between horizontal rulers.
If I have patterns.txt of:
_Pat1
_Pat2
_Pat3
And bigtext.txt with:
1matchx_Pat1x
2matchx_Pat2x
2matchy_Pat2y
2matchz_Pat2z
3matchx_Pat3x
3matchy_Pat3y
3matchz_Pat3z
1matchy_Pat1y
1matchz_Pat1z
And I run your oneliner limited to five hits, I do not get result of (first five matches for all three patterns as found in the file):
1matchx
2matchx
2matchy
2matchz
3matchx
But (all (3) patches for _Pat1 plus 2 matches for _Pat2 after which I've ran out of output lines):
1matchx
1matchy
1matchz
2matchx
2matchy
Now to your performance problem which is partially related. I have to admit that I could not reproduce it. I've taken your example from the comment, blew the "big" file up to a 1GB in size by repeating the pattern and ran your oneliner:
$ time { cat patterns.txt | xargs -I '{}' stdbuf -oL sed -n 's/{}.*$//p' bigtext.txt | head -5 ; }
1aaa
2aaabbb
3aaaccc
1aaa
2aaabbb
xargs: stdbuf: terminated by signal 13
real 0m0.012s
user 0m0.013s
sys 0m0.008s
Note I've dropped the -eL, stderr is usually unbuffered (which is what you usually want) and doesn't play any role here really. Also note I've ran stdbuf without the "g" prefix, which tells me you're probably on a system where GNU tools are not the default... and probably the reasons why you get different behavior. I'll try to explain what is going on and venture few guesses... and conclude with a suggestion. Also note, I really did not need to use stdbuf (manipulate buffering) at all or rather it had no appreciable impact on the result, but again, this could be platform and tools (as well as scenario) specific.
When you read the line from its end, head reads standard input as it is being piped in from xargs (and by extension the sed (or stdbuf wrapping) runs which xargs forks, they are both attached to its writing end) until limit of lines to print has been reached and then head terminates. Doing so "breaks" the pipe and xargs and sed (or stdbuf which it was wrapped in) receive SIGPIPE signal and by default they as well terminate (that you can see in the output of my run: xargs: stdbuf: terminated by signal 13).
What the stdbuf -oL does and why someone might have suggested it. When no longer using console for reading/writing, which would usually be line buffered, and using pipes we would usually see buffered I/O instead. stdbuf -oL changes that back to line buffered. Without it, the process involved would communicate in larger chunk and it could take head longer to realize, it is done and needs no further input, while sed keeps running to see if there are any further suitable matches. As mentioned, on my systems (4K buffer) and with that (repeating pattern) example, this made no real difference. Also note, while it decreases the risk of not knowing we could be done, line buffering does increase overhead involved in communication between the processes.
So why would these mechanics not yield the same expected results for you? Couple options come to mind:
since you fork and run sed once per pattern, whole file each time. It could happen you get series of several runs without any hits. I'd guess this is actually likely the case.
since you give sed file to read from, you may have different implementation of sed that tries to read a lot more in before taking action on the file content (mine reads 4K at a time). Not a likely cause, but in theory you could also feed sed line by line to force smaller chunks and getting that SIGPIPE sooner.
Now assuming that sequential pattern by pattern matching is actually not desired, summary of all of above would be: process your patterns first into a single one and then perform a single pass over the "big" file (optionally capping the output of course). It might be worth switching from shell mostly to something a bit more comfortable to use, or at least not to keep the oneliner format which is likely to turn confusing.
Not true to my own recommendation, awk script called like this prints first 5 hits and quits:
awk -v patts="$(cat patterns.txt)" -v last=5 'BEGIN{patts="(" patts ; gsub(/\n/, "|", patts) ; sub(/.$/, ")", patts); cnt=1 ;} $0~patts{sub(patts ".*", ""); print; cnt++;} cnt>last{exit;}' bigtext.txt

You can specify a file that has patterns to match to the grep command with a -f file. You can also specify the number of matches to find before quiting -m count
So this command will get you the first 5 lines that match:
grep -f patterns.txt -m 5 bigtext.txt
Now to trim the match to the end of the line, is a bit more difficult.
Assuming you use bash, we can build a regex from the file, like this:
while IFS='' read -r line || [[ -n "$line" ]]; do
subRegex="s/$line.*//;"${subRegex}
done < patterns.txt
Then use this in a sed command. The resulting code becomes:
while IFS='' read -r line || [[ -n "$line" ]]; do
subRegex="s/$line.*//;"${subRegex}
done < patterns.txt
grep -f patterns.txt -m 5 bigtext.txt | sed "$subRegex"
The sed command is only running on the lines that have already matched from the grep, so it should be fairly performant.
Now if you call this a lot you could put it in a function
function findMatches() {
local matchCount=${1:-5} # default to 5 matches
local subRegex
while IFS='' read -r line || [[ -n "$line" ]]; do
subRegex="s/$line.*//;"${subRegex}
done < patterns.txt
grep -f patterns.txt -m ${matchCount} bigtext.txt | sed "${subRegex}"
}
Then you can call it like this
findMatches 5
findMatches 100
Update
Based on the sample files you gave, this solution does produce the expected result 1aaa 2aaabbb 3aaaccc 4aaa 5aaa
However, given your comment on the length of each pattern being 120 characters, and each line of the bigfile being 250 characters, 10 GB file size.
You didn't mention how many patterns you might have. So I tested and it seems that the sed command done inline falls apart someplace before 50 patterns.
(Of course, if your samples are really how the data look, then you could do your trimming of each line to be based bases on non-AGCT and not based on the patterns file. Which would be much quicker)
But based on the original question. You can generate a sed script in a separate file based on patterns.txt. Like this:
sed -e "s/^/s\//g;s/$/.*\$\/\/g/g;" patterns.txt > temp.sed
then use this temp file on the sed command.
grep -f patterns.txt -m 5 bigtext.txt | sed -f temp.sed
The grep stops after finding X matches, and the sed trims those... The new function runs on my machine in a couple seconds.
For testing I created a 2GB file of 250 character AGCT combos. And another file with 50+ patterns, 120 characters each with a few of these patterns taken from random lines of the bigtext file.
function findMatches() {
sed -e "s/^/s\//g;s/$/.*\$\/\/g/g;" patterns.txt > temp.sed
grep -f patterns.txt -m ${1:-5} bigtext.txt | sed -f temp.sed
}

Perl processes disappears after some days

I have a perl file (test.pl).
It will work in recurring manner.
The purpose of the file is send emails from DB
Following is the code in test.pl
sub send_mail{
$db->connect();
# Some DB operations #
# Send mail #
$db->disconnect();
sleep(5);
send_mail();
}
send_mail();
Iam executing 5 instance of this file ,like as below
perl test.pl >> /var/www/html/emailerrorlog/error1.log 2>&1 &
perl test.pl >> /var/www/html/emailerrorlog/error2.log 2>&1 &
perl test.pl >> /var/www/html/emailerrorlog/error3.log 2>&1 &
perl test.pl >> /var/www/html/emailerrorlog/error4.log 2>&1 &
perl test.pl >> /var/www/html/emailerrorlog/error5.log 2>&1 &
if i execute the command ps -ef | grep perl | grep -v grep
I can see 5 instances of above mentioned perl file
That file will work perfectly for some days
But after some days, the perl processes will start to disappear one by one .
After some days all process will disappear.
Now. if i execute the command ps -ef | grep perl | grep -v grep ,I can't see any process,
I can't see any error log in the log files.
So, what may be the chances for disappearing the perl processes?
How can i debugg it ?
Where can i see the perl error log?
It has the same issue in Centos and Red Hat Linux
Any one have idea?

I'm not 100% sure if that is the problem but it would probably help if you avoid recursion in a permanently executing process... That slowly increases the stack use and will eventually kill the process when the stack size limit is reached.
try something like this instead:
sub send_mail{
$db->connect();
# Some DB operations #
# Send mail #
$db->disconnect();
}
while (1) {
send_mail();
sleep(5);
}

How is this bash script resulting in an infinite loop?

From some Googling (I'm no bash expert by any means) I was able to put together a bash script that allows me to run a test suite and output a status bar at the bottom while it runs. It typically takes about 10 hours, and the status bar tells me how many tests passed and how many failed.
It works great sometimes, however occasionally I will run into an infinite loop, which is bad (mmm-kay?). Here's the code I'm using:
#!/bin/bash
WHITE="\033[0m"
GREEN="\033[32m"
RED="\033[31m"
(run_test_suite 2>&1) | tee out.txt |
while IFS=read -r line;
do
printf "%$(tput cols)s\r" " ";
printf "%s\n" "$line";
printf "${WHITE}Passing Tests: ${GREEN}$(grep -c passed out.txt)\t" 2>&1;
printf "${WHITE}Failed Tests: ${RED}$( grep -c FAILED out.txt)${WHITE}\r" 2>&1;
done
What happens when I encounter the bug is I'll have an error message repeat infinitely, causing the log file (out.txt) to become some multi-megabyte monstrosity (I think it got into the GB's once). Here's an example error that repeats (with four lines of whitespace between each set):
warning caused by MY::Custom::Perl::Module::TEST_FUNCTION
print() on closed filehandle GEN3663 at /some/CPAN/Perl/Module.pm line 123.
I've tried taking out the 2>&1 redirect, and I've tried changing while IFS=read -r line; to while read -r line;, but I keep getting the infinite loop. What's stranger is this seems to happen most of the time, but there have been times I finish the long test suite without any problems.
EDIT:
The reason I'm writing this is to upgrade from a black & white test suite to a color-coded test suite (hence the ANSI codes). Previously, I would run the test suite using
run_test_suite > out.txt 2>&1 &
watch 'grep -c FAILED out.txt; grep -c passed out.txt; tail -20 out.txt'
Running it this way gets the same warning from Perl, but prints it to the file and moves on, rather than getting stuck in an infinite loop. Using watch, also prints stuff like [32m rather than actually rendering the text as green.

I was able to fix the perl errors and the bash script seems to work well now after a few modifications. However, it seems this would be a safer way to run the test suite in case something like that were to happen in the future:
#!/bin/bash
WHITE="\033[0m"
GREEN="\033[32m"
RED="\033[31m"
run_full_test > out.txt 2>&1 &
tail -f out.txt | while IFS= read line; do
printf "%$(tput cols)s\r" " ";
printf "%s\n" "$line";
printf "${WHITE}Passing Tests: ${GREEN}$(grep -c passed out.txt)\t" 2>&1;
printf "${WHITE}Failed Tests: ${RED}$( grep -c 'FAILED!!' out.txt)${WHITE}\r" 2>&1;
done
There are some downsides to this. Mainly, if I hit Ctrl-C to stop the test, it appears to have stopped, but really run_full_test is still running in the background and I need to remember to kill it manually. Also, when the test is finished tail -f is still running. In other words there are two processes running here and they are not in sync.
Here is the original script, slightly modified, which addresses those problems, but isn't foolproof (i.e. can get stuck in an infinite loop if run_full_test has issues):
#!/bin/bash
WHITE="\033[0m"
GREEN="\033[32m"
RED="\033[31m"
(run_full_test 2>&1) | tee out.txt | while IFS= read line; do
printf "%$(tput cols)s\r" " ";
printf "%s\n" "$line";
printf "${WHITE}Passing Tests: ${GREEN}$(grep -c passed out.txt)\t" 2>&1;
printf "${WHITE}Failed Tests: ${RED}$( grep -c 'FAILED!!' out.txt)${WHITE}\r" 2>&1;
done

The bug is in your script. That's not an IO error; that's an illegal argument error. That error happens when the variable you provide as a handle isn't a handle at all, or is one that you've closed.
Writing to a broken pipe results in the process being killed by SIGPIPE or in print returning false with $! set to EPIPE.

Get current playing file in MPlayer slave mode

Problem: I can't find any way to reliably get the current playing file in an MPlayer playlist.
Here is how far I have gotten. This working ash script monitors a text file with the path to the current playlist. When I update the file, the script closes the old instance of MPlayer and opens a new one with the new playlist:
# POLL PLAYLIST FILE FOR CHANGES
CURRENTPLAYLISTPATH=/home/tc/currentplaylist
INFIFO=/tmp/mplayer-in
CURRENTPLAYLIST="NEVERMATCHAPLAYLIST"
FIRSTRUN=1
while [ 1 ];
do
# CHECK FOR NEW PLAYLIST
NEWPLAYLIST=$(head -n 1 $CURRENTPLAYLISTPATH)
if [[ "$NEWPLAYLIST" != "$CURRENTPLAYLIST" ]]; then
if [ "$FIRSTRUN" == 0 ]; then
echo "quit" > "$INFIFO"
fi
# CREATE NAMED PIPE, IF NEEDED
trap "rm -f $INFIFO" EXIT
if [ ! -p $INFIFO ]; then
mkfifo $INFIFO
fi
# START MPLAYER
mplayer -fixed-vo -nolirc -vc ffmpeg12vdpau,ffh264vdpau, -playlist $NEWPLAYLIST -loop 0 -geometry 1696x954 -slave -idle -input file=$INFIFO -quiet -msglevel all=0 -identify | tee -a /home/tc/mplayer.log &
CURRENTPLAYLIST=$NEWPLAYLIST
FIRSTRUN=0
fi
sleep 5;
done
My original plan was just to use the "-identify" flag and parse the log file. This actually works really well up until I need to truncate the log file to keep it from getting too large. As soon as my truncating script is run, MPlayer stops writing to the log file:
FILENAME=/home/tc/mplayer.log
MAXCOUNT=100
if [ -f "$FILENAME" ]; then
LINECOUNT=`wc -l "$FILENAME" | awk '{print $1}'`
if [ "$LINECOUNT" -gt "$MAXCOUNT" ]; then
REMOVECOUNT=`expr $LINECOUNT - $MAXCOUNT`
sed -i 1,"$REMOVECOUNT"d "$FILENAME"
fi
fi
I have searched and searched but have been unable to find any other way of getting the current playing file that works.
I have tried piping the output to another named pipe and then monitoring it, but only works for a few seconds, then MPlayer completely freezes.
I have also tried using bash (instead of ash) and piping the output to a function like the following, but get the same freezing problem:
function parseOutput()
{
while read LINE
do
echo "get_file_name" > /tmp/mplayer-in
if [[ "$LINE" == *ANS_FILENAME* ]]
then
echo ${LINE##ANS_FILENAME=} > "$CURRENTFILEPATH"
fi
sleep 1
done
}
# START MPLAYER
mplayer -fixed-vo -nolirc -vc ffmpeg12vdpau,ffh264vdpau, -playlist $NEWPLAYLIST -loop 0 -geometry 1696x954 -slave -idle -input file=/tmp/mplayer-in -quiet | parseOutput &
I suspect I am missing something very obvious here, so any help, ideas, points in the right direction would be greatly appreciated.
fodder

Alright then, so I'll post mine too.
Give this one a try (assuming there is only one instance running, like on fodder's machine):
basename "$(readlink /proc/$(pidof mplayer)/fd/* | grep -v '\(/dev/\|pipe:\|socket:\)')"
This is probably the safer way, since the file descriptors might not always be in the same order on all systems.
However, this can be shortened, with a little risk:
basename "$(readlink /proc/$(pidof mplayer)/fd/*)" | head -1
You might probably like to install this, too:
http://mplayer-tools.sourceforge.net/

Well, I gave up on getting the track from MPlayer itself.
My 'solution' is probably too hackish, but works for my needs since I know my machine will only ever have one instance of MPlayer running:
lsof -p $(pidof mplayer) | grep -o "/path/to/my/assets/.*"
If anyone has a better option I'm certainly still interested in doing this the right way, I just couldn't make any of the methods work.
fodder

You can use the run command.
Put this in ~/.mplayer/input.conf:
DEL run "echo ${filename} ${stream_pos} >> /home/knarf/out"
Now if you press the delete key while playing a file it will do what you expect i.e. append the current file playing and the position in the stream to the ~/out file. You can replace echo with your program.
See slave mod docs for more info (Ctrl-F somevar).

About getting properties from MPlayer
I have used a non-elegant solution, but it is working for me.
stdbuf -oL mplayer --slave --input=file=$FIFO awesome_awesome.mp3 |
{
while IFS= read -r line
do
if [[ "${line}" == ANS_* ]]; then
echo "${line#*=}" > ${line%=*} # echo property_value > property_name
fi
done
} &
mplayer_pid=&!
read filename < ./ANS_FILENAME
read timeLength < ./ANS_LENGTH
echo ($timeLength) $filename
and so on..
It is in another proccess, that's why I've used files to bring properties
'stdbuf' is for not to miss anything

I started putting together a bash library to handle tasks like this. Basically, you can accomplish this by dumping the mplayer output to a file. Then you grep that dump for "Playing " and take the last result with tail. This should give you the name of the file that's currently playing or that last finished playing.
Take a look at my bash code. You'll want to modify the playMediaFile function to your needs, but the getMediaFileName function should do exactly what you're asking. You'll find the code on my github.