What can awk do that sed can't? - sed

I used sed for a batch ptovess where I could not do it with awk. Vould awk have done it? Or is it more a matter of choice and call awk and sed equivalent for the usage. They both do the common search replace similar with i/o. Is there a good example what can't be done with one that the other can?

One main difference is that an awk program can maintain state and can operate using multiple passes over the same data. A sed invocation is necessarily stateless single-pass because sed (Stream EDitor) is inherently stream-oriented. The advantage, though, is that this makes sed simpler and more appropriate for using in pipe chains.

In the original (and still the best)
book, The AWK Programming Language, the following are implemented (among many other things):
a simple assembler
recursive descent compiler
a text indexing program
Try doing that with sed.

G'day,
Awk is more powerful. sed tends to be more limited in what it can do.
Sed is good for line-based changes to data. It has some simple looping constructs, the usual ed/ex/vi regexp stuff and substitution things, compound statements, decisions etc. Most people use it for modifying piped data.
Awk is good for filtering or rearranging data. It mostly gets used for reporting.
I'd suggest having a look at Dale Dougherty's excellent book "sed & awk" (sanitised Amazon link). BTW It's got one of the best explanations of regexps in there as well.
Many people would say use Perl anyway! (-:
Edit: Forgot to say that the awk language is quite C like which is no surprise given that the 'k' in awk stands for Brian Kernighan. Yes. That Brian Kernighan!
Also, sed only works on data streams whereas awk works on both data streams and files.
HTH
cheers,Sed only works on data streams whereas awk works on both data streams and files.

The only thing I can think of is that sed can do little changes in less char count the awk. So for quick tweaks on a live shell, it's faster to type.

SED is a stream editor, and therefore does not have variables and a few other constructs like AWK has. AWK is a fully fledged language.

Related

Deleting lines of a file with sed - unexpected behaviour

I noticed something a bit odd while fooling around with sed. If you try to remove multiple line intervals (by number) from a file, but any interval specified later in the list is fully contained within an interval earlier in the list, then an additional single line is removed after the specified (larger) interval.
seq 10 > foo.txt
sed '2,7d;3,6d' foo.txt
1
9
10
This behaviour was behind an annoying bug for me, since in my script I generated the interval endpoints on the fly, and in some cases the intervals produced were redundant. I can clean this up, but I can't think of a good reason why sed would behave this way on purpose.
Since this question was highlighted as needing an answer in the Stack Overflow Weekly Newsletter email for 2015-02-24, I'm converting the comments above (which provide the answer) into a formal answer. Unattributed comments here were made by me in essentially equivalent form.
Thank you for a concise, complete question. The result is interesting. I can reproduce it with your script. Intriguingly, sed '3,6d;2,7d' foo.txt (with the delete operations in the reverse order) produces the expected answer with 8 included in the output. That makes it look like it might be a reportable bug in (GNU) sed, especially as BSD sed (on Mac OS X 10.10.2 Yosemite) works correctly with the operations in either order. I tested using 'sed (GNU sed) 4.2.2' from an Ubuntu 14.04 derivative.
More data points for you/them. Both of these include 8 in the output:
sed -e '/2/,/7/d' -e '/3/,/6/d' foo.txt
sed -e '2,7d' -e '/3/,/6/d' foo.txt
By contrast, this does not:
sed -e '/2/,/7/d' -e '3,6d' foo.txt
The latter surprised me (even accepting the basic bug).
Beats me. I thought given some of sed's arcane constructs that you might be missing the batman symbol or something from the middle of your command but sed -e '2,7d' -e '3,6d' foo.txt behaves the same way and swapping the order produces the expected results (GNU sed 4.2.2 on Cygwin). /bin/sed on Solaris always produces the expected result and interestingly so does GNU sed 3.02. Ed Morton
More data: it only seems to happen with sed 4.2.2 if the 2nd range is a subset of the first: sed '2,5d;2,5d' shows the bug, sed '2,5d;1,5d' and sed '2,5d;2,6d' do not. glenn jackman
The GNU sed home page says "Please send bug reports to bug-sed at gnu.org" (except it has an # in place of ' at '). You've got a good reproduction; be explicit about the output you expect vs the output you get (they'll get the point, but it's best to make sure they can't misunderstand). Point out that the reverse ordering of the commands works as expected, and give the various other commands as examples of working or not working. (You could even give this Q&A URL as a cross-reference, but make sure that the bug report is self-contained so that it can be understood even if no-one follows the URL.)
You can also point to BSD sed (and the Solaris version, and the older GNU 3.02 sed) as behaving as expected. With the old version GNU sed working, it means this is arguably a regression. […After a little experimentation…] The breakage occurred in the 4.1 release; the 4.0.9 release is OK. (I also checked 4.1.5 and 4.2.1; both are broken.) That will help the maintainers if they want to find the trouble by looking at what changed.
The OP noted:
Thanks everyone for comments and additional tests. I'll submit a bug report to GNU sed and post their response. santayana

Perl: console / command-line tool for interactive code evaluation and testing

Python offers an interactive interpreter allowing the evaluation of little code snippets by submitting a couple of lines of code to the console. I was wondering if a tool with similar functionality (e.g. including a history accessible with the arrow keys) also exists for Perl?
There seem to be all kinds of solutions out there, but I can't seem to find any good recommendations. I.e. lots of tools are mentioned, but I'm interested in which tools people actually use and why. So, do you have any good recommendations, excluding the standard perl debugging (perl -d -e 1)?
Here are some interesting pages I've had a look at:
a question in the official Perl FAQ
another Stackoverflow question, where the answer mostly is the perl debugger and several links are broken
Perl Console
Perl Shell
perl -d -e 1
Is perfectly suitable, I've been using it for years and years. But if you just can't,
then you can check out Devel::REPL
If your problem with perl -d -e 1 is that it lacks command line history, then you should install Term::ReadLine::Perl which the debugger will use when installed.
Even though this question has plenty of answers, I'll add my two cents on the topic. My approach to the problem is easy if you are a ViM user, but I guess it can be done from other editors as well:
Open your ViM, and type your code. You don't need to save it on any file.
:w !perl for evaluation (:w !COMMAND pipes the buffer to the process obtained by running COMMAND. In this case the mighty perl interpreter!)
Take a look at the output
This approach is good for any interpreted language, not just for Perl.
In the case of Perl it is extremely convenient when you are writing your own modules, since in my experience the perl interpreter will refuse to reload a module (even when loading was attempted and failed). On the minus side, you will loose all your context every time, so if you are doing some heavy or slow operation, you need to save some intermediate results (whilst the perl console approach preserves the previously computed data).
If you just need the evaluation of an expression - which is the other use case for a perl console program - another good alternative is seeing the evaluation out of a perl -e command. It's fast to launch, but you have to deal with escaping (for this thing the $'...' syntax of Bash does the job pretty well.
Just use to get history and arrows:
rlwrap perl -de1

General help for deciphering/explaining sed one-liners?

I've just stumbled upon some cryptic sed expression in a legacy script. Could you give me some hints how to start decoding it?
Best thing would be some automatic tool translating sed incantations to English, but for a close runner up, I'd be very grateful for some nice index of (all) sed commands. Otherwise, I'm certainly highly interested in any help at all on how to quickly attack the problem (other than having to read the manual cover to cover...).
(Side note: as you may have guessed, I don't want to just paste the expression here, as I'd like to be able to do it easier and faster next time I stumble on some similar line noise...)
I'd be very grateful for help!
Edit: regexps themselves aren't problem, by the way, I'm good enough at them.
i don't think there is automatic tool that can 'transalte' sed commands to english. however you may want to check http://aurelio.net/sedsed/ . it will help you to understand one sed script, what it does, and how.
anyway, if you list some examples would be good.
This might work for you.
Unix in a Nutshell by Robbins has a very nice chapter on sed. Clear and concise descriptions of the commands.
Your best bet would be to learn the sed language in-depth. Unforunately, the sed documentation is more like a reference. Here's a nice step by step guide that doesn't take too long to read.
I found "Sed One-Liners Explained" to be very informative as well as fun.

Is it possible to write a shell script which is faster than the equivalent script in Perl? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I wrote multiple scripts in Perl and shell and I have compared the real execution time. In all the cases, the Perl script was more than 10 times faster than the shell script.
So I wondered if it possible to write a shell script which is faster than the same script in Perl? And why is Perl faster than shell although I use the system function in Perl script?
There are few ways to make your shell (eg Bash) execute faster.
Try to use less of external commands if Bash's internals can do the task
for you. Eg, excessive use of sed , grep, awk et for string/text
manipulation.
If you are manipulating relatively BIG files, don't use bash's while read loop.
Use awk. If you are manipulating really BIG files, you can use grep to search for the patterns you want, and then pass them to awk to "edit". grep's searching algorithm is very good and fast. If you want to get only front or end of the file, use head and tail.
file manipulation tools such as sed, cut, grep, wc, etc all can be done
with one awk script or using Bash internals if not complicated. Therefore, you can try to cut down the use of these tools that overlap in their functions.
Unix pipes/chaining is excellent, but using too many of them,
eg
command|grep|grep|cut|sed makes your code slow. Each pipe is an overhead.
For this example, just one awk does them all.
command | awk '{do everything here}'
The closest tool you can use which can match Perl's speed for certain tasks, eg string manipulation or maths, is awk. Here's a fun benchmark for this solution. There are around 9million numbers in the file
Output
$ head -5 file
1
2
3
34
42
$ wc -l <file
8999987
# time perl -nle '$sum += $_ } END { print $sum' file
290980117
real 0m13.532s
user 0m11.454s
sys 0m0.624s
$ time awk '{ sum += $1 } END { print sum }' file
290980117
real 0m9.271s
user 0m7.754s
sys 0m0.415s
$ time perl -nle '$sum += $_ } END { print $sum' file
290980117
real 0m13.158s
user 0m11.537s
sys 0m0.586s
$ time awk '{ sum += $1 } END { print sum }' file
290980117
real 0m9.028s
user 0m7.627s
sys 0m0.414s
For each try, awk is faster than Perl.
Lastly, try to learn awk beyond what they can do as one liners.
This might fall dangerously close to arm-chair optimization, but here are some ideas that might rationalize your results:
Fork/exec: almost anything useful that is done by a shell script is done via a shell-out, that is starting a new shell and running the a command such as sed, awk, cat etc. More often then not, more then one process is executed, and data is moved via pipes.
Data structures: Perl's data structures are more sophisticated then Bash's or Csh's. This typically force the programmer to be created with data storage. This can take the forms of:
use non optimal data structures (arrays instead of hashes)
store data in textual form (for example integers as strings) that needed to be reinterpreted every time.
save data in a file, and re-parse it again and again.
etc.
Non optimized implementation: some shell construct might not be designed with optimization in mind, but with user convenience. For example, I have reason to believe that the bash implementation of Parameter Expansion in particular ${foo//search/replace} is sub-optimal relative to the same operation in sed. This is typically not a problem for day-to-day tasks.
Okay, I know I'm asking for it by opening up a can of worms closed two years ago, but I'm not 100% happy with any of the answers.
The right answer is YES. But most new coders will still go to Perl and Python and write code that struggles mightily to WRAP CALLS TO EXTERNAL EXECUTABLES because they lack the mentoring or experience required to know when to use which tools.
The Korn Shell (ksh) has fast builtin math, and a fully capable and speedy regex engine that, gasp, can handle Perl type regex. It also has associative arrays. It can even load external .so libraries. And it was a finished and mature product 10 years ago. It's even already installed on your Mac.
No, I think it is impossible:
bash command is truly interpeted language, but Perl programs are compiled to bytecode before execution
Certain shell commands can run faster than Perl, in some situations. I once benchmarked a simple sed script against the equivalent in perl, and sed won. But when the requirements became more complex, the perl version started beating the sed version. So the answer is, it depends. But for other reasons, (simplicity, maintainability, etc.) I'd lean toward doing things in Perl anyway unless the requirements are very simple, and I expect them to stay that way.
Yes. C code is going to be faster than Perl code for the same thing, so a script which uses a compiled executable for doing a lot of work is going to be faster than a perl program doing the same thing.
Of course, the Perl program could be rewritten to use the executable, in which case it would probably be faster again.

Why is Perl the best choice for most string manipulation tasks?

I've heard that Perl is the go-to language for string manipulation (and line noise ;). Can someone provide examples and comparisons with other language(s) to show me why?
It is very subjective, so I wouldn't say that Perl is the best choice, but it is certainly a valid choice for string manipulation. Other alternatives are Tcl, Python, AWK, etc.
I like Perl's capabilities because it has excellent support (better than POSIX as pointed out in the comment) for fast regexs and the implicit variables makes it easy to do basic string crunching with very little code.
If you have a *nix background a lot of what you already know will apply to Perl as well, which makes it fairly easy to pick up for a lot of people.
Perl -> Practical Extraction and Reporting Language
Perl's strength(when it comes to string processing) lies in it's very powerful Regular expression engine.
Because of this there are many people in the field of BioInformatics using Perl as their
main tool, hence the large number of posts about BioPerl on PerlMonks . In BioInformatics they work with strings a lot , they call them "sequences"(I don't know much about this).
Perlmonks.org is the heart of the Perl community, check out the immense number of hits
when you search for site:perlmonks.org regex 20,000 hits
You cannot ignore the sheer number of modules on CPAN:
375 modules under the namespace String on CPAN(Perl's module repository)
241 in Regex namespace
156 in Regexp namespace.
This is very clear evidence that Perl is a very powerful language when it comes to string processing.
So if you want to do some string processing and you're using Perl, you've got it covered :)
To address the second part of your question: Perl's reputation for line noise comes from 4 kinds of people:
Overly clever (for their own good) hackers (or sometimes just hacks) who value cleverness and showing off over readability. "If it was hard to write it should be hard to read" is NOT just a mythical attitude.
People who wouldn't know good software development if it hit them over the head with a cluebat. Such as people who save a couple of characters in a program by using $_ instead of a named variable. In a nested scope. Or never heard of comments. Or self-documenting identifiers. Or whitespace.
People who think that software development == code golf. More seriously, that the less the amount of characters in the code, the more readable it is, because they misunderstand what "conciseness" means in code.
(NOTE: first 2 sets are not mutually exclusive)
People who code/hack in perl (e.g. SysAdmins) who have very little training, experience or incentive to do software development. E.g. the percentage of people using Perl who do quick and dirty hacks with bad style and worse code quality is probably higher than, say Python.
Just for reference, 80% of awful Perl "code" in my $work falls under this - it was written by financial analysts who are smart enough to pick up a Perl book and some earlier scripts, clone off a script that does what business need is, and don't have CS/programming background to worry about how readable/maintainable their code was.
In other (and less snide) words, you can write beautiful, incredibly readable and easy to maintain software in Perl. It all depends on who does the writing, what their priorities and skills are. Also, just like with any other language, you can write a miserable write-only mess with it.
The difference from other languages is that very often, the write-onlyness of said mess, when done in Perl, does indeed consist of very high density of non-letter characters (sygils and special characters in poorly written RegExes). This high density can indeed, asymptotically approximate line noise.
Because It is what is perl made for. Because Perl is expressive, powerful and fast. I have beaten many times specialized products with small and dirty script in perl written in few minutes. For example, outer join and large join vs. MySQL (just because can't do merge join), ETL processing vs. Java Hadoop (because I have years experience to write it effectively and perl IO layer is just great) and so and so.
It's a very subjective question. Perhaps the true answer is that Perl has a nice syntax (incl. the regex syntax) that makes people want to sign it high praises over other languages? IMHO, any language that supports a rich regex syntax would be considerablly powerfull at string manipulation.
Kids these days! Back in the day, all we had was SNOBOL -- and we liked it! Try it sometime...you never know, you might want something respectable to fall back on when this Perl fad runs its course!
Perl is widely used for string manipulation tasks as its string manipulation API is easy to learn. And also its regex is widely used. It has been in use for a very long time and anyone with a Unix background would pick up perl very easily. Historically, perl was developed in the late 80's for report processing tasks and was "originally" developed for text processing tasks. So till date, the trend continues as anyone with a string manipulation task or text processing task would opt for perl as the first choice. Its not that other languages like python arent up to the task, but perl's popular in this area.
I like Perl a lot, write books about it, publish a magazine about it, and so on. I don't think I would ever say it's the best language to do anything in. A lot of that has to do with the task you need to do. For many string processing tasks, ETL, data cleanup, and so in, Perl is a very strong and capable language. You wouldn't have that much trouble doing simple tasks.
Your comment sounds like it comes from the early 1990s though, when the rest of the world hadn't caught up. Many of the dynamic languages are now up to task, so you might not have to switch languages. If you decide to use Perl and run into problems, there are plenty of people here who are willing to help, and not all of us will fault you if you choose something else. :)
At the beginning, Perl was developed for easy report processing and dealing with text files, thus it's got a very strong REGEX support. Most of the info on REGEX you can find in perldoc.
Perl was the go-to language for a long time. The problem is it can be pretty messy and difficult to maintain (some people can write Perl that avoids this, but it is very easy to wrote ugly code). I would not tell you to avoid Perl, but many have moved on to some modern alternatives.
I would recommend learning one of the newer scripting languages such as Python or Ruby. Both will work very well for your needs, and can easily handle more difficult tasks later on. They're both quite nice to work in, after having written C and Perl for so long.
In short, Perl would be a good hammer for this nail. Python and Ruby would be nail-guns.
I disagree that Perl is the best language for text processing. Simple things are easy; to replace foo with bar:
$data =~ s/foo/bar/g;
Harder things are not simple, though. Look at Data::SExpression, for example. It is a lot of code to do something very simple.
An similar implementation in Haskell with PArrow looks something like:
import Text.ParserCombinators.PArrow
data Atom = QuotedString String | Symbol String
deriving (Show, Eq)
data Sexp = Sexp [Sexp] | Atom Atom
deriving (Eq)
quotedString :: Char -> Char -> MD a Atom
quotedString quoteChar escapeChar = between q q inside >>^ QuotedString
where q = char quoteChar
inside = many $ (char escapeChar >>> anyChar) <+> notChar quoteChar
doubleQuotedString, symbol :: MD a Atom
doubleQuotedString = quotedString '"' '\\'
symbol = word >>^ Symbol
atom, sexp :: MD a Sexp
atom = (doubleQuotedString <+> symbol) >>^ Atom
sexp = atom <+> (between (char '(') (char ')') sexp' >>^ Sexp)
where sexp' = sepBy1 sexp spaces
Just sayin'. Perl is not the end-all-and-be-all of text manipulation. There are many reasons to prefer Perl to other languages, but parsing is not one of them.