Joining specific lines in file - sed

I have a text file (snippet below) containing some public-domain corporate earnings report data, formatted as follows:
Current assets:
Cash and cash equivalents
$ 21,514 $ 21,120
Short-term marketable securities
33,769 20,481
Accounts receivable
12,229 16,849
Inventories
2,281 2,349
and what I'm trying to do (with sed) is the following: if the current line starts with a capital letter, and the next line starts with whitespace, copy the last N characters from the next line into the last N columns of the current line, then delete the next line. I'm doing it this way, because there are other lines in the files that begin with whitespace that I want to ignore. The results should look like the following:
Current assets:
Cash and cash equivalents $ 21,514 $ 21,120
Short-term marketable securities 33,769 20,481
Accounts receivable 12,229 16,849
Inventories 2,281 2,349
The closest I've come to getting what I want is:
sed -i -r ':a;N;$!ba;s/[^A-Z]*\n([[:space:]])/\1/g' file.txt
and I believe I've got the pattern matching ok, but the subsequent substitution really messes up the alignment of the columns of numbers. When I first started this, this seemed like a simple operation, but hours of searching and experimenting haven't helped. I'm open to any solutions that use something else other than sed, but would prefer to keep it strictly bash. Thank you much!

This might work for you (GNU sed):
sed -r '/^[[:upper:]]/{N;/\n\s/{h;x;s/\n.*//;s/./ /g;x;G;s/(\n *)(.*)\1$/\2/};P;D}' file
This solution only processes two consecutive lines that start with an upper-case letter and a white space respectively. All other lines are printed as is.
Having gathered the above two lines into the pattern space (PS), a copy is made and stored in the hold space (HS). Processing now swaps to the HS. The second line is removed and the contents of the first turned into spaces. Processing now swaps back to the PS. The HS is appended to the PS and using matching and back references the length of the first line in spaces is subtracted from the combined lines.
The line(s) are printed and then deleted. If the second line did not begin with a space, by use of the P and D commands, it is not deleted but re-appraised by virtue of the regexp at the start of the sed script.

Related

How to use sed to isolate only the first part of a file

I'm running Windows and have the GnuWin32 toolkit, which includes sed. Specifically:
C:\TEMP>sed --version
GNU sed version 4.2.1
I have a text file with two sections: A fixed part I want to preserve, and a part that's appended after running a job.
In the file is a unique string that identifies the start of the part that's added, and I'd like to use Gnu sed to isolate only the part of the file that's before the unique string - i.e., so I can append different data to the fixed part each time the job is run.
I know I could keep the fixed portion in a separate file, but that adds complexity and it would be more elegant if I could just reuse the data at the start of the same file.
A long time ago I knew how to set up sed scripts, and I'm sure this can be done with sed, but I've slept since then. :)
Can you please describe how to use sed to display the lines of text in a file up to and not including a specific string?
Example:
line 1 of fixed portion
line 2 of fixed portion
unique string
line 1 of appended portion
line 2 of appended portion
line 3 of appended portion
What I'd like is to see as output:
line 1 of fixed portion
line 2 of fixed portion
I've gotten as far as:
sed -r -n -e "0,/unique string/p"
but that prints the unique string as well.
Thanks in advance.
-Noel
This should work for you:
sed -n '/unique string/q;p' file
It quits processing at unique string. Other lines get printed.
An alternative might be to use a range address like this:
sed -n '1,/unique string/{/unique string/!p}' file
Note that sed includes the range border. We need to exclude unique string from printing.
Furthermore I'm using the -n option which makes sed suppress the output of input lines by default.
One thing, if unique string can contain characters which are also syntax characters in the regex like ...
test*
... sed might not be the right tool for the job any more since it can only match regular expressions but not fixed strings.
In that case awk might be the tool of choice:
awk 'index("*unique string*"){exit}1' file
index("string") returns a non zero value (the position) if the string has been found. We cancel further processing of input lines in that case and don't print that line as well.
The trailing 1 always evaluates to true and makes awk print all the lines until the previous condition applies.

How do I replace lines between two patterns with a single line in sed?

This is my input file:
one
two
three
four
five
six
seven
eight
nine
ten
I want to turn the file into
one
two
three
NEW LINE
eight
nine
ten
with sed. That is, I want to replace the lines from /four/ (including) to /seven/ (including) with the single line NEW LINE.
I can do that with
sed '/four/aNEW LINE
/four/,/seven/d' file.txt
But I am wondering if there is a simpler way, notably one without having to repeat a pattern (as I needed to with /four/).
Edit As per fedorquis comment-question, this can also be in awk (although for "academic" purposes I'd be interested in sed solutions.)
Edit 2 Unfortunately, the input file suggests that there is a logical order of words in the input file (one followed by two followed by three etc). In my "real world" problem, this is not the case, however. I have no idea how many lines the file has, nor what is preceeded or followed by the lines four and seven. The onl thing I know is that there is a line four which is (not necessarily immediately) followed by a line seven. I am sorry for not stating this clearly when I asked the question, especially because fedorqui has put so much effort in his answer.
Perl is pretty concise, and you don't need to repeat any keywords:
perl -00 -pe 's/four.*seven/NEW_LINE/s'
Here is how you do in sed:
$ sed ':a;N;s/four.*seven/NEW LINE/;ba' file
one
two
three
NEW LINE
eight
nine
ten
Logic is pretty much similar to Glenn's answer. Slurp the entire file in to one long line separated by newlines and substitute everything from four to seven and replace it with NEW LINE.
With sed, you can delete from line four to seven and append after seven. Which is in fact what you posted in your question :)
$ sed -e '/seven/a \NEW LINE' -e '/four/,/seven/d' file
one
two
three
NEW LINE
eight
nine
ten
With awk you can do:
$ awk '/four/ {f=1} !f; /seven/ {print "NEW LINE"; f=0}' file
one
two
three
NEW LINE
eight
nine
ten
What it does is to keep updating the flag f that stops the printing.
When "four" is found, the flag is activated.
When "seven" is found, the flag is deactivated, printing also the NEW LINE.
This might work for you (GNU sed & bash):
sed $'/^four/{:a;N;/^seven/McNEWLINE\nba}' file

Alternatives to grep/sed that treat new lines as just another character

Both grep and sed handle input line-by-line and, as far as I know, getting either of them to handle multiple lines isn't very straightforward. What I'm looking for is an alternative or alternatives to these two programs that treat newlines as just another character. Is there any tool that fits such a criteria
The tool you want is awk. It is record-oriented, not line-oriented, and you can specify your record-separator by setting the builtin variable RS. In particular, GNU awk lets you set RS to any regular expression, not just a single character.
Here is an example where awk uses one blank line to separate every record. If you show us what data you have, we can help you with it.
cat file
first line
second line
third line
fourth line
fifth line
sixth line
seventh line
eight line
more data
Running awk on this and reconstruct data using blank line as new record.
awk -v RS= '{$1=$1}1' file
first line second line third line
fourth line fifth line sixth line
seventh line eight line
more data
PS RS is not equal to file, is set to RS= blank, equal to RS=""
1) Sed can handle a block lines together, not always line by line.
In sed, normally I use :loop; $!{N; b loop}; to get all the lines available in pattern space delimited by newline.
Sample:
Productivity
Google Search\
Tips
"Web Based Time Tracking,
Web Based Todo list and
Reduce Key Stores etc"
result (remove the content between ")
sed -e ':loop; $!{N; b loop}; s/\"[^\"]*\"//g' thegeekstuff.txt
Productivity
Google Search\
Tips
You should read this URL (Unix Sed Tutorial: 6 Examples for Sed Branching Operation), it will give you detail how it works.
http://www.thegeekstuff.com/2009/12/unix-sed-tutorial-6-examples-for-sed-branching-operation/
2) For grep, check if your grep support -z option, which needn't handle input line by line.
-z, --null-data
Treat the input as a set of lines, each terminated by a zero
byte (the ASCII NUL character) instead of a newline. Like the
-Z or --null option, this option can be used with commands like
sort -z to process arbitrary file names.

At what stage is sed's pattern space printed?

I have heard that for the pattern space, the maximum number of addresses is two.
And that sed goes through each line of the text file, and for each of them, runs through all the commands in the script expression or script file.
When does sed print the pattern space? Is it at the end of the text file, after it has done the last line? Or is it as the ending part of processing each line of the text file, just after it has run through all commands, it dumps the pattern space?
Can anybody demonstrate
a)the max limit of the pattern space being two?
b)the fact of when the pattern space is printed. And, if you can, please provide a textual source that says so too.
And why is it that here in my attempt to see the size of the pattern space, it looks like it can fit a lot..
When this tutorial, says
http://www.thegeekstuff.com/2009/12/unix-sed-tutorial-7-examples-for-sed-hold-and-pattern-buffer-operations/
Sed G function
The G function appends the contents of the holding area to the contents of the pattern space. The former and new contents are separated by a newline. The maximum number of addresses is two.
An example of what I found about the size of the pattern space, trying unsuccessfully to see its limit of two..
abc.txt is a text file with just the character z
sed h;G;G;G;G;G;G;G;G abc.txt
prints many zs so I guess it can hold more than 2.
So i've misunderstood some thing(s).
An address is a way of selecting lines. Lines can be selected using zero, one or two addresses. This has nothing to do with the capacity of pattern space.
Consider the following input file:
aaa
bbb
ccc
ddd
eee
This sed command has zero addresses, so it processes every line:
s/./X/
Result:
Xaa
Xbb
Xcc
Xdd
Xee
This command has one address, it selects only the third line:
3s/./X/
Result:
aaa
bbb
Xcc
ddd
eee
An address of $ as in $s/./X/ would function the same way, but for the last line (regardless of the number of lines).
Here is a two-address command. In this case, it selects the lines based on their content. A single address command can do this, too.
/b/,/d/s/./X/
Result:
aaa
Xbb
Xcc
Xdd
eee
Pattern space is printed when given an explicit p or P command or when the script is complete for the current line of the input file (which includes ending the processing of the file with the q command) if the -n (suppress automatic printing) option is not in place.
Here's a demonstration of sed printing each line immediately upon receiving and processing it:
for i in {1..3}; do echo aaa$i; sleep 2; done | sed 's/./X/'
The capacity of pattern space (and hold space) has to do with the number of characters it can hold (and is implementation dependent) rather than the number of input lines. The newlines separating those lines are simply another character in that total. The G command simply appends a copy of hold space onto the end of what's in pattern space. Multiple applications of the G command appends that many copies.
In the tutorial that you linked to, the statement "The maximum number of addresses is two." is somewhat ambiguous. What that indicates is that you can use zero, one or two addresses to select lines to apply that command to. As in the above examples, you could apply G to all lines, one line or a range of lines. Each command can accept zero, zero or one, or zero, one, or two addresses. See man sed under the Synopsis section for sub headings that group the commands by the number of addresses they accept.
From info sed:
3.1 How `sed' Works
'sed' maintains two data buffers: the active pattern space, and the
auxiliary hold space. Both are initially empty.
'sed' operates by performing the following cycle on each lines of
input: first, 'sed' reads one line from the input stream, removes any
trailing newline, and places it in the pattern space. Then commands
are executed; each command can have an address associated to it:
addresses are a kind of condition code, and a command is only executed
if the condition is verified before the command is to be executed.
When the end of the script is reached, unless the '-n' option is in
use, the contents of pattern space are printed out to the output
stream, adding back the trailing newline if it was removed.(1) Then the
next cycle starts for the next input line.
Unless special commands (like 'D') are used, the pattern space is
deleted between two cycles. The hold space, on the other hand, keeps
its data between cycles (see commands 'h', 'H', 'x', 'g', 'G' to move
data between both buffers).

Use sed to delete a matched regexp and the line (or two) underneath it

OK I found this question:
How do I delete a matching line, the line above and the one below it, using sed?
and just spent the last hour trying to write something that will match a string and delete the line containing the string and the line beneath it (or a variant - delete 2 lines beneath it).
I feel I'm now typing random strings. Please somebody help me.
If I've understood that correctly, to delete match line and one line after
/matchstr/{N;d;}
Match line and two lines after
/matchstr/{N;N;d;}
N brings in the next line
d - deletes the resulting single line
you can use awk. eg search for the word "two" and skip 2 lines after it
$ cat file
one
two
three
four
five
six
seven
eight
$ awk -vnum=2 '/two/{for(i=0;i<=num;i++)getline}1' file
one
five
six
seven
eight