Sed - merge two lines based on a pattern in second line - sed

I'm working on a data dump from a phone and need to get a chat file re-formatted
Person1 my son will be there shortly
5/3/2018 6:02:31 PM(UTC+0)
Person2 OK. Tell him to call when he's here
5/3/2018 6:03:33 PM(UTC+0)
Person1 Ok
5/3/2018 6:03:41 PM(UTC+0)
Person2 You forgot your charger
5/3/2018 8:43:20 PM(UTC+0)
I need to change to this (the chat and the timestamp are separated by a tab):
Person1 my son will be there shortly 5/3/2018 6:02:31 PM(UTC+0)
Person2 OK. Tell him to call when he's here 5/3/2018 6:03:33 PM(UTC+0)
Person1 Ok 5/3/2018 6:03:41 PM(UTC+0)
Person2 You forgot your charger 5/3/2018 8:43:20 PM(UTC+0)
I've been trying to merge a line that contains "UTC" with the PREVIOUS line, but so far the best I've gotten is:
sed -e :a -e '$!N;s/\n.*UTC/\t/;ta' -e 'P;D' temp.txt > temp2.txt
And the results are these:
Person1 my son will be there shortly +0)
Person2 OK. Tell him to call when he's here +0)
Person1 Ok 5/3/2018 +0)
Person2 You forgot your charger +0)
The reason I want to use "UTC" as the pattern is that there are other extraneous lines in the file that are NOT timestamps (e.g. multi-line chat entries, information about attachments, etc.). "UTC" is the only pattern that's unique to timestamps.

I'd do it like this:
$ sed 'N;/\n.*UTC/s/\n/\t/;P;D' infile
Person1 my son will be there shortly 5/3/2018 6:02:31 PM(UTC+0)
Person2 OK. Tell him to call when he's here 5/3/2018 6:03:33 PM(UTC+0)
Person1 Ok 5/3/2018 6:03:41 PM(UTC+0)
Person2 You forgot your charger 5/3/2018 8:43:20 PM(UTC+0)
N;P;D creates a moving two-line window; the command /\n.*UTC/s/\n/\t/ says "if the pattern space has UTC on the second line, substitute the newline with a tab".

If your sample is representative of a regularly structured file and you simply need to merge every other line with the previous line, the script can be simplified significantly.
I prefer Awk for legibility and maintainability:
awk 'NR%2 { printf "%s\t", $0; next } 1' file >newfile
In some more detail, NR is the current line number (or more properly record number; by default Awk splits records on newlines) and % is the mathematical modulo operator. The expression evaluates to non-zero (true) on even-numbered lines, and so we print those with a tab instead of a newline. The next statement terminates the script for this input line and fetches the next line and starts over, just like the n/N command in sed. Finally, the lone 1 is true for every line which falls through to here, causing it to be printed verbatim.

Related

Can I avoid duplicate strings with the sed "a\" command?

Can I avoid duplicate strings with the sed "a" command?
I added the word "apple" under "true" in my file.txt.
The problem is that every time I run the command "apple" is appended.
$ sed -i '/true/a\apple' file.txt ...execute 3 time
$ cat file.txt
true
apple
apple
apple
If the word "apple" already exists, repeating the sed command does not want to add any more.
I have no idea, please help me
...
I want to do this,
...execute sed command anytime
$ cat file.txt
true
apple
It seems you don't want to append the line apple if the line following the true already contains apple. Then this sed command should do the trick.
sed -i.backup '
/true/!b
$!{N;/\napple$/!s/\n/&apple&/;p;d;}
a\
apple
' file.txt
Explanation of sed commands:
If the line doesn't contain true then jump to the end of the script, which will print out the line read (/true/!b).
Otherwise the line contains true:
If it isn't the last line ($!) then• read the next line (N).• If the next line doesn't consist of apple (/\napple$/!) then insert the apple between two lines (s/\n/&apple&/).• Print out the pattern space (p) and start a new cycle (d)
Otherwise it is the last line (and contains true)
Append apple (a\ apple)
Edit:
The above sed script won't work properly if two consecutive true line occurs in the file, as pointed out by #potong. The version below should fix this, if I haven't overlooked something.
sed -i.backup ':a
/true/!b
a\
apple
n
/^apple$/d
ba
' file.txt
Explanation:
/true/!b: If the line doesn't contain true, no further processing is required. Jump to the end of the script. This will print the current pattern space.
a\ apple: Otherwise, the line contains true. Append apple.
n: Print the current pattern space and appended line (apple) and replace the pattern space with the next line. This will end the script if no next line available.
/^apple$/d: If the line read consists of string apple then delete it and start a new cycle (because it is already appended before)
ba: Jump to the start of the script (label a) without reading an input line.
There is no general solution for sed unless the file is sorted. If sorted, the following deletes the duplicate lines:
sed '$!N; /^\(.*\)\n\1$/!P; D'
This was taken from this link: https://www.unix.com/shell-programming-and-scripting/146404-command-remove-duplicate-lines-perl-sed-awk.html
Great answer by M. Nejat Aydin but to make things simpler just add grep:
grep -q apple file.txt || sed -i '/true/a\apple' file.txt
This might work for you (GNU sed):
sed -e ':a;/true/!b;$a apple' -e 'n;/apple/b;i apple' -e 'ba' file
If a line does not contain true just print it.
Otherwise, if it is the last line, append the line apple.
Otherwise, print that line and fetch the next.
If that line contains apple just print it.
Otherwise, insert a line apple and jump to the first sed instruction since the fetched line might be one containing true.
N.B. This uses both the a command (for end of file condition) and the i command for when there is a following line.

Perl System Grep and Paste

I have a file that looks like this:
Dog
BulldogTerrier
Cat
Persian
Ape
Gorilla
Dog
PitbullLabShepardHusky
I want to be able to search for each line containing dog and select everything until the next empty line and put it into a new file.
I want an output file like:
Dog
BulldogTerrier
Dog
PitbullLabShepardHusky
I know I can use grep to find the word dog but how can I use it, or with what can I use it, so that it grabs everything after it UNTIL the next empty line and moves it into another file.
I am writing a script in Perl to do this because there are other things I wish to add on that are made easier with Perl. I was going to use system(grep....) to find the word but I wasn't sure what to do after that.
I will also note that I want to be able to do this recursively. I have many files that look like what I had shown and I would like to extract the Dog block from all of them. So it would be something recursive from the directory.
perl -ne 'print if /^Dog/../^$/' file
The .. and ... operators in perl can join two conditionals. From the time that the first evaluates true until the second conditional evaluates true, the joined conditional will evaluate true. So you want to print from the time $_ =~ m/^Dog/ is true until $_ =~ m/^\s+$/ is true. The above is shorthand for that.
The distinction between .. vs ... is not important here because in this case, the conditionals cannot both be true on the same line.
IF you can use awk, then this can be done. By setting Record Selector to nothing awk works in block mode. Test if block starts with dog, and if yes do the default action, print the block.
awk '/^Dog/' ORS="\n\n" RS="" file
Dog
Bulldog
Terrier
Dog
Pitbull
Lab
Shepard
Husky

Append to non-empty line that doesn't start with whitespace AND is followed, two lines down, by a non-empty line that doesn't start with whitespace

I am converting several unruly, early 90's DOS-generated text files to something more usable. I need to append a set of characters to all of the non-empty lines in said text files that don't start with whitespace AND that are followed, two lines down, by another non-empty line that doesn't start with whitespace (I will refer to all single lines of text that meet these characteristics as "target" lines). BTW, irrelevant to the problem are the characteristics of the line directly below each of the target lines.
Of interest is the fact that all of the target lines in the above-mentioned text files end with the same character. Also, the command I'm looking for needs to slot into a rather long pipeline.
Suppose I have the following file:
foo
third line foo
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foo
eleventh line foo
this line starts with a space foo
last line foo
I want the output to look like this:
foobar
third line foobar
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foobar
eleventh line foo
this line starts with a space foo
last line foo
Although I'm looking for a sed solution, awk and perl are welcome as well. All solutions must be able to be used in a pipeline. Also welcomed are solutions which handle a more general case (e.g. able to append the desired text to target lines that end in various ways, including whitespace).
Now, for the backstory:
I recently asked a question similar to the subject question a few days ago (see here). As you can see, I got some great answers. It turned out, however, that I did not fully understand my problem, so I did not ask the correct question that would actually solve said problem.
Now, I'm asking the right question!
Based on what I learned by scrutinizing the answers to the question I linked to above, I've cobbled together the following sed command
sed '1N;N;/^[^[:space:]]/s/^\([^[:space:]].*\o\)\(\n\n[^[:space:]].*\)$/\1bar\2/;P;D' infile
Ugly, yes, but it works for my humble purposes. Indeed, as my original intent with this question was to post a question, then self-answer same, you can see this sed construct posted below as one of the answers (posted by me).
I'm sure there are better ways to solve this particular problem, however...any ideas, anyone?
From your posted expected output it looks like you meant to say "is followed, two lines down, by a line that DOES NOT start with whitespace" instead of "is followed, two lines down, by a line that DOES start with whitespace".
This produces the output you show:
$ cat tst.awk
NR>2 { print p2 ((p2 ~ /^[^[:blank:]]/) && /^[^[:blank:]]/ ? "bar" : "") }
{ p2=p1; p1=$0 }
END { print p2 ORS p1 }
$ awk -f tst.awk file
foobar
third line foobar
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foobar
eleventh line foo
this line starts with a space foo
last line foo
It simply keeps a 2 line buffer and adds "bar" to the end of the line being printed given whatever condition you need. It will work on all POSIX awks and any others that support POSIX character classes (for the rest, change [[:blank:]] to [ \t]).
You have over-analysed the problem so that your question now reads as a computer program, and you have got that program wrong. Requirements are best explained using examples and real data, so that we have some hope of rationalising the problem in our heads
This Perl program alters your algorithm so the output matches your required output
use strict;
use warnings 'all';
chomp(my #data = <>);
my $i = 0;
for ( #data ) {
$_ .= 'bar' if /^\S/ and $data[$i+2] =~ /^\S/;
++$i;
last if $i+2 > $#data;
}
print "$_\n" for #data;
output
foobar
third line foobar
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foobar
eleventh line foo
this line starts with a space foo
last line foo
This sed one-liner seems to do the trick for the specific case outlined in the OP:
sed '1N;N;/^[^[:space:]]/s/^\([^[:space:]].*\o\)\(\n\n[^[:space:]].*\)$/\1bar\2/;P;D' infile
Thanks to the excellent clarifying information given by Benjamin W. in his answer to one of my recent questions, I was able to cobble together this one-liner that solved my specific problem. Please refer to same if you wish to gain insight into said command.

sed or awk deleting lines between pattern matches, excluding the second token's line

I have a sed command which will successfully print lines matching two patterns:
sed -n '/PAGE 2/,/\x0c/p' filename.txt
What I haven't figured out, is that I want it to print all the lines from the first token, up until the second token. The \x0c token is a record separator on a big flat file, and I need to keep THAT line intact.
In between the two tokens, the data is completely variable, and I do not have a reliable anchor to work with.
[CLARIFICATION]
Right now it prints all the lines between /PAGE 2/ and /\x0c/ inclusive. I want it to print /PAGE 2/ up until the next /\x0c/ in the record.
[test data] The /x0c will be at the start of the first line, and the beginning of the last line of this record.
I need to delete the first line of the record, through the line just before the beginning of the next record.
^L20-SEP-2006 01:54:08 PM Foobars College PAGE 2
TERM: 200610 Student Billing Statement SUMDATA
99999
Foo bar R0000000
999 Geese Rural Drive DUE: 15-OCT-2012
Columbus, NE 90210
--------------------------------------------------------------------------------
Balance equal to or greater than $5000.00 $200.00
Billing inquiries may be directed to 444/555-1212 or by
email to bursar#foobar.edu. Financial Aid inquiries should
be directed to 444/555-1212 or finaid#foobar.edu.
^L20-SEP-2006 01:54:08 PM Foobars College PAGE 1
[expected result]
^L20-SEP-2006 01:54:08 PM Foobars College PAGE 1
There will be multiple such records in the file. I can rely only on the /PAGE 2/ token, and the /x0c/ token.
[solution]:
Following Choruba's lead, I edited his command to:
sed '/PAGE [2-9]/,/\x0c/{/\x0c$/!d}'
The rule in the curly brackets was applying itself to any line containing a ^L and was selectively ignoring them.
EDIT: New answer for the new question the OP asked (how to delete records:
Given a file with control-Ls delimiting records and a desire to print specific lines from specific records, just set your record separator to control-L and your field separator to "\n" and print whatever you like. For example, to get the output the OP says he wants from the input he posted would just be:
awk -v RS='^L' -F'\n' 'NR==3{print $1}' file
^L shown here represents a literal control-L, and it's the 3rd record because there's an empty record before te first control-L in the input file.
#
This is the answer to the original question the OP asked:
You want this:
awk '/PAGE 2/ {f=1} /\x0c/{f=0} f' file
but also try these to see the difference (for the future):
awk '/PAGE 2/ {f=1} f; /\x0c/{f=0}' file
awk 'f; /PAGE 2/ {f=1} /\x0c/{f=0}' file
And finally, FYI, The following idioms describe how to select a range of records given a specific pattern to match:
a) Print all records from some pattern:
awk '/pattern/{f=1}f' file
b) Print all records after some pattern:
awk 'f;/pattern/{f=1}' file
c) Print the Nth record after some pattern:
awk 'c&&!--c;/pattern/{c=N}' file
d) Print every record except the Nth record after some pattern:
awk 'c&&!--c{next}/pattern/{c=N}1' file
e) Print the N records after some pattern:
awk 'c&&c--;/pattern/{c=N}' file
f) Print every record except the N records after some pattern:
awk 'c&&c--{next}/pattern/{c=N}1' file
g) Print the N records from some pattern:
awk '/pattern/{c=N}c&&c--' file
I changed the variable name from "f" for "found" to "c" for "count" where appropriate as that's more expressive of what the variable actually IS.
Tell sed not to print the line containing the character:
sed -n '/PAGE 2/,/\x0c/{/\x0c/!p}' filename.txt
I think this would do it:
awk '/PAGE 2/{a=1}/\x0c/{a=0}{if(a)print}'
In this line, the second sed deletes (d) the last line ($).
sed -n '/^START$/,/^STOP$/p' in.txt | sed '$d'
Following Choruba's lead, I edited his command to:
sed '/PAGE [2-9]/,/\x0c/{/\x0c$/!d}'

How to remove top comment lines with sed?

I know how to delete all comments or delete specified row in file, but don't know how to delete only first top comment lines that hold author, copyright and etc. I can't just use delete first lines because there might be source files that does not con taint top comment block at all. Any suggestion?
UPDATE: The source files are objective-c files, and all top comment lines start with //. Probably the rule "delete all comment lines until first non-comment line will be found then stop" best describes what I would like to achieve.
Second question: Also, looking for a way to tell sed to loop recursively though all project directories and files.
UPDATE2. Here's an example of the comment block of objective-c file I'm trying to modify:
//
// MyProjectAppDelegate.m
// MyProject
//
// Created by Name Surname on 2011-09-20.
// Copyright 2012 JSC MyCompany. All rights reserved.
//
if awk is accepted as well, take a look at following:
awk 'BEGIN{f=1}f&&/^\/\//{next;}{f=0}1' yourFile
test
kent$ cat test.txt
//
// MyProjectAppDelegate.m
// MyProject
//
// Created by Name Surname on 2011-09-20.
// Copyright 2012 JSC MyCompany. All rights reserved.
//
real codes come here...
// comments in codes should be kept
foo bar foo bar
kent$ awk 'BEGIN{f=1}f&&/^\/\//{next;}{f=0}1' test.txt
real codes come here...
// comments in codes should be kept
foo bar foo bar
you could combine find and the awk one-liner to do a recursive processing.
well sed can do that with same idea too. need to write a mark in Hold space, and check hold space for every line. It has an advantage with -i option, which allows you change your original file(s). however I feel awk is more straightforward to do the job. And saving changed text back to original file is not a hard thing either.
Here's a sed solution that works with Kent's test.txt:
sed -n '/^ *\/\//! { :a;p;N;s/.*\n//;ba}'
Output:
real codes come here...
// comments in codes should be kept
foo bar foo bar
Explanation
sed -n '/^ *\/\//!{ # if line does not start with //
:a # label a
p # print line
N # append next line to PS
s/.*\n// # remove previous line
ba # branch to a
}' < test.txt
try it:
sed '0,\#^\([^/]\|/[^/]\)\|^$# d'