copying every nth line to a new line - sed

I have a txt file that I need to copy the 1st line of every four lines and print it onto the 3rd line of every four. And print this into a new txt file.
e.g
#CR5SM:00004:00029
TTTTCTCTTTCTTTCTT
+
>>>/>#99419BAAABB
#CR5SM:00005:00026
ATTATAGAGGGATAG
+
;969999999-4;BB
change it to this:
#CR5SM:00004:00029
TTTTCTCTTTCTTTCTT
+CR5SM:00004:00029
>>>/>#99419BAAABB
#CR5SM:00005:00026
ATTATAGAGGGATAG
+CR5SM:00005:00026
;969999999-4;BB
I have tried using Awk but cant seem to find the correct commands to do this.
Does anyone have any solutions? Thanks

Using awk:
$ awk '/^#/{a=substr($0,2)}/^\+/{$0=$0 a}1' file
#CR5SM:00004:00029
TTTTCTCTTTCTTTCTT
+CR5SM:00004:00029
>>>/>#99419BAAABB
#CR5SM:00005:00026
ATTATAGAGGGATAG
+CR5SM:00005:00026
;969999999-4;BB
You can redirect the output to another file by saying:
awk '/^#/{a=substr($0,2)}/^\+/{$0=$0 a}1' file > newfile
We use the substr function to capture the lines that start with # from second character onwards until the end of the line.
We look for lines that start with + (notice we escape it since it is a meta-character). Once we find that line, we append our captured line to the existing line.
1 at the end allows us to print the lines.

Try:
awk '
(NR-1) % 4 == 0 { l=substr($0,2); print; next } # save every 4th line (print & continue)
(NR-1) % 4 == 2 { print $0 l; next } # append saved line to every 3rd line (print & continue)
{ print }' \ # all other lines: print as is
infile > outfile # specify input file and redirect output to output file

This might work for you (GNU sed):
sed 'h;n;n;G;s/\n.//;n' file
Copy the first line, print the first and second lines and append the first to the third removing the first character of the first, print it and the fourth line and repeat.

Related

sed/awk/perl remove the first two lines of a 3 line pattern

I have a huge text file. I need to replace all occurrences of this three line
pattern:
|pattern|some data|
|giberish|,,
|pattern|some other data|
by the last line of the pattern:
|pattern|some other data|
remove the first two lines of the pattern, keep only the last one.
The second line of the pattern ends with two commas and does not start with |pattern|
The first line of the pattern line starts with |pattern| and does not end with two commas.
The third line of the pattern line starts with |pattern| and does not end with two commas.
I tried this:
sed 'N;N;/^|pattern|.*\n.*,,\n|pattern|.*/I,+1 d' trial.txt
with no much luck
Edit: Here is a more substantial example
#!/usr/bin/env bash
cat > trial.txt <<EOL
|pattern|sdkssd|
|.x,mz|e,dsa|,,
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
EOL
and it should become:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
#zdim:
the first three lines of the file:
|pattern|sdkssd|
|.x,mz|e,dsa|,,
|pattern|sdk;sd|
satisfy the pattern. So they are replaced by
|pattern|sdk;sd|
so the top of the file now becomes:
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
|l'kk|3lke|,,
...
the first three lines of which are:
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
which satisfy the pattern, so they are replaced by:
|pattern|aslkaa|
so the top of the file now is:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
....
#JosephQuinsey:
consider this file:
#!/usr/bin/env bash
cat > trial.txt <<EOL
|pattern|blabla|
|||4|||-0.97|0|1429037262.8271||20160229||1025||1000.0|0.01|,,
|pattern|blable|
|||5|||-1.27|0|1429037262.854||20160229||1025||1000.0|0.01|,,
|pattern|blasbla|
|||493|||-0.22|5|1429037262.8676||20170228||1025||1000.0|0.01|,,
|||11|||-0.22|5|1429037262.8676||20170228||1025||1000.0|0.01|,|T|347||1429043438.1962|-0.22|5|0||-0.22|1429043438.1962|,|Q|346||1429043437.713|-0.24|26|-0.22|5|||1429043437.713|
|pattern|jksds|
|||232|||-5.66|0|1429037262.817||20150415||1025||1000.0|0.01|,,
|pattern|bdjkds|
|||123q|||-7.15|0|1429037262.8271||20150415||1025||1000.0|0.01|,,
|pattern|blabla|
|||239ps|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,,
|||-92opa|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|1||1428969600.5019|-0.99|1|11||||,
|||kj2w|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|2||1428969600.5019|-1|1|11||||,
|||0293|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|3||1428969600.5019|-1.01|1|11||||,
|||2;;w32|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|4||1428969600.5019|-1.11|1|11||||,
EOL
Here is a simple take on it, using a buffer to collect and manage the pattern-lines
use warnings;
use strict;
use feature 'say';
my $file = shift or die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my #buf;
while (<$fh>) {
chomp;
if (/^\|pattern\|/ and not /,,$/) {
#buf = $_; # start the buffer (first line) or overwrite (third)
}
elsif (/,,$/ and not /^\|pattern\|/) {
if (#buf) { push #buf, $_ } # add to buffer with first line in it
else { say } # not part of 3-line-pattern; print
}
else {
say for #buf; # time to print out buffer
#buf = (); # ... empty it ...
say # and print the current line
}
}
This prints the expected output.
Explanation.
Pattern-lines go in a buffer, and when we get the "third line" the first two need be removed. Then "assign" to the array whenever we see ^|pattern| -- either to start the buffer if it's the first line or to re-initialize the array (removing what's in it) if it's the third line
A line ending with ,, is added to the buffer, if there is a line there already. Nothing prohibits lines ending with ,, just so -- they may be outside of a pattern; in that case just print it
So each |pattern| line sets the buffer straight -- either starts it or resets it. Thus once we run into a line with neither ^|pattern| nor ,,$ we can print out our buffer, and that line
Please test more comprehensively, what i still didn't get to do.
In order to run this either in a pipeline or on a file use the "magical" <> filehandle. So it becomes
use warnings;
use strict;
use feature 'say';
my #buf;
while (<>) { # reads lines from files given on command line, or from STDIN
...
}
Now you can run it either as data | script.pl or as script.pl datafile. (Make the script executable for this, or use as perl script.pl.)
The script's output goes to STDOUT which can be piped into other programs or redirected to a file.
It may depend on how your file is huge but if it is smaller than the allowed memory size, how about:
perl -0777 -pe '
1 while s/^\|pattern\|.+?\|\n(?<!\|pattern\|).+?,,\n(\|pattern\|.+?\|)$/\1/m;
' trial.txt
Output:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
An awk solution:
awk -v pa=pattern '
$0 ~ pa {
do {
hold=$0;
getline;
hold=hold "\n" $0;
getline;
} while(match($0, pa));
print hold
}
1' trial.txt
The idea is to buffer the line that matched the pattern, then the line after. If the next line also matches the pattern, loop, this time buffer the most recent matching line and and the one following it. This has the effect of removing the lines that need to be replaced.
When the loop stops, the first line the buffer contains is either the line to replace the removed lines or simply a first pattern match that is not to be removed. Either way the contents of the buffer get printed.
The final 1 statement is needed to print the line that ended the while loop and all other lines that aren't the first or second after one matching the pattern.
Updated answer: The following sed solution should work:
sed '/\n/!N;/\n.*\n/!N;/^|pattern|.*\n.*,,\n|pattern|/!{P;D;};s/[^\n]*\n//;D;'
Explanation:
/\n/!N if the P-space has only one line, read the next
/\n.*\n/!N if the P-space has only two lines, read in a third
/^|pattern|.*\n.*,,\n|pattern|/ test if the first and third lines start with |pattern|, and the middle line ends with two commas
!{P;D;} if the match fails, then print the first line and start over
s/[^\n]*\n//;D; otherwise, when the match succeeds, delete the first two lines, and start over.
This might work for you (GNU sed):
sed ':a;N;s/[^\n]*/&/3;Ta;/^|pattern|.*\n.*,,\n|pattern|/{/,,\n.*\n\|,,$/!{s/.*\n//;ba}};P;D' file
Populate the pattern space with the next three lines of the file. If the first pattern matches the current three lines and neither the first or the third line ends with ,,, then delete the first two lines and repeat. Otherwise print and delete the first line of the three line window and repeat.

Join two specific lines with sed

I'm trying to manipulate a dataset with sed so I can do it in a batch because the datasets have the same structure.
I've a dataset with two rows (first line in this example is the 7th row) like this:
Enginenumber; ABX 105;Productionnumber.;01 2345 67-
"",,8-9012
What I want:
Enginenumber; ABX 105;Productionnumber.;01 2345 67-8-9012
So the numbers (8-9012) in the second line have been added at the end of the first line because those numbers belong to each other
What I've tried:
sed '8s/7s/' file.csv
But that one does not work and I think that one will just replace whole row 7. The 8-9012 part is on row 8 of the file and I want that part added to row 7. Any ideas and is this possible?
Note: In the question's current form, a sed solution is feasible - this was not the case originally, where the last ;-separated field of the joined lines needed transforming as a whole, which prompted the awk solution below.
Joining lines 7 and 8 as-is, merely by removing the line break between them, can be achieved with this simple sed command:
sed '7 { N; s/\n//; }' file.csv
awk solution:
awk '
BEGIN { FS = OFS = ";" }
NR==7 { r = $0; getline; sub(/^"",,/, ""); $0 = r $0 }
1
' file.csv
Judging by the OP's comments, an additional problem is the presence of CRLF line endings in the input. With GNU Awk or Mawk, adding RS = "\r\n" to the BEGIN block is sufficient to deal with this (or RS = ORS = "\r\n", if the output should have CRLF line endings too), but with BSD Awk, which only supports single-character input record separators, more work is needed.
BEGIN { FS = OFS = ";" } tells Awk to split the input lines into fields by ; and to also use ; on output (when rebuilding the line).
Pattern NR==7 matches input line 7, and executes the associated action ({...}) with it.
r = $0; getline stores line 7 ($0 contains the input line at hand) in variable r, then reads the next line (getline), at which point $0 contains line 8.
sub(/^"",,/, "") then removes substring "",, from the start of line 8, leaving just 8-9012.
$0 = r $0 joins line 7 and modified line 8, and by assigning the concatenation back to $0, the string assigned is split into fields by ; anew, and the resulting fields are joined to form the new $0, separated by OFS, the output field separator.
Pattern 1 is a common shorthand that simply prints the (possibly modified) record at hand.
With sed:
sed '/^[^"]/{N;s/\n.*,//;}' file
/^[^"]/: search for lines not starting with ", and if found:
N: next line is appended to the pattern space
s/\n.*,//: all characters up to last , are removed from second line

Using sed to remove embedded newlines

What is a sed script that will remove the "\n" character but only if it is inside "" characters (delimited string), not the \n that is actually at the end of the (virtual) line?
For example, I want to turn this file
"lalala","lalalslalsa"
"lalalala","lkjasjdf
asdfasfd"
"lalala","dasdf"
(line 2 has an embedded \n ) into this one
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
(Line 2 and 3 are now joined, and the real line feed was replaced with the character string \\n (or any other easy to spot character string, I'm not picky))
I don't just want to remove every other newline as a previous question asked, nor do I want to remove ALL newlines, just those that are inside quotes. I'm not wedded to sed, if awk would work, that's fine too.
The file being operated on is too large to fit in memory all at once.
sed is an excellent tool for simple substitutions on a single line but for anything else you should use awk., e.g:
$ cat tst.awk
{
if (/"$/) {
print prev $0
prev = ""
}
else {
prev = prev $0 " \\\\n "
}
}
$ awk -f tst.awk file
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
Below was my original answer but after seeing #NeronLeVelu's approach of just testing for a quote at the end of the line I realized I was doing this in a much too complicated way. You could just replace gsub(/"/,"&") % 2 below with /"$/ and it'd work the same but the above code is a simpler implementation of the same functionality and will now handle embedded escaped double quotes as long as they aren't at the end of a line.
$ cat tst.awk
{ $0 = saved $0; saved="" }
gsub(/"/,"&") % 2 { saved = $0 " \\\\n "; next }
{ print }
$ awk -f tst.awk file
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
The above only stores 1 output line in memory at a time. It just keeps building up an output line from input lines while the number of double quotes in that output line is an odd number, then prints the output line when it eventually contains an even number of double quotes.
It will fail if you can have double quotes inside your quoted strings escaped as \", not "", but you don't show that in your posted sample input so hopefully you don't have that situation. If you have that situation you need to write/use a real CSV parser.
sed -n ':load
/"$/ !{N
b load
}
:cycle
s/^\(\([^"]*"[^"]*"\)*\)\([^"]*"[^"]*\)\n/\1\3 \\\\n /
t cycle
p' YourFile
load the lines in working buffer until a close line (ending with ") is found or end reach
replace any \n that is after any couple of open/close " followed by a single " with any other caracter that " between from the start of file by the escapped version of new line (in fact replace starting string + \n by starting string and escaped new line)
if any substitution occur, retry another one (:cycle and t cycle)
print the result
continue until end of file
thanks to #Ed Morton for remark about escaped new line

Remove newline depending on the format of the next line

I have a special file with this kind of format :
title1
_1 texthere
title2
_2 texthere
I would like all newlines starting with "_" to be placed as a second column to the line before
I tried to do that using sed with this command :
sed 's/_\n/ /g' filename
but it is not giving me what I want to do (doing nothing basically)
Can anyone point me to the right way of doing it ?
Thanks
Try following solution:
In sed the loop is done creating a label (:a), and while not match last line ($!) append next one (N) and return to label a:
:a
$! {
N
b a
}
After this we have the whole file into memory, so do a global substitution for each _ preceded by a newline:
s/\n_/ _/g
p
All together is:
sed -ne ':a ; $! { N ; ba }; s/\n_/ _/g ; p' infile
That yields:
title1 _1 texthere
title2 _2 texthere
If your whole file is like your sample (pairs of lines), then the simplest answer is
paste - - < file
Otherwise
awk '
NR > 1 && /^_/ {printf "%s", OFS}
NR > 1 && !/^_/ {print ""}
{printf "%s", $0}
END {print ""}
' file
This might work for you (GNU sed):
sed ':a;N;s/\n_/ /;ta;P;D' file
This avoids slurping the file into memory.
or:
sed -e ':a' -e 'N' -e 's/\n_/ /' -e 'ta' -e 'P' -e 'D' file
A Perl approach:
perl -00pe 's/\n_/ /g' file
Here, the -00 causes perl to read the file in paragraph mode where a "line" is defined by two consecutive newlines. In your example, it will read the entire file into memory and therefore, a simple global substitution of \n_ with a space will work.
That is not very efficient for very large files though. If your data is too large to fit in memory, use this:
perl -ne 'chomp;
s/^_// ? print "$l " : print "$l\n" if $. > 1;
$l=$_;
END{print "$l\n"}' file
Here, the file is read line by line (-n) and the trailing newline removed from all lines (chomp). At the end of each iteration, the current line is saved as $l ($l=$_). At each line, if the substitution is successful and a _ was removed from the beginning of the line (s/^_//), then the previous line is printed with a space in place of a newline print "$l ". If the substitution failed, the previous line is printed with a newline. The END{} block just prints the final line of the file.

comparing lines with awk vs while read line

I have two files one with 17k lines and another one with 4k lines. I wanted to compare position 115 to position 125 with each line in the second file and if there is a match, write the entire line from the first file into a new file. I had come up with a solution where i read the file using 'cat $filename | while read LINE'. but it's taking around 8 mins to complete. is there any other way like using 'awk' to reduce this process time.
my code
cat $filename | while read LINE
do
#read 115 to 125 and then remove trailing spaces and leading zeroes
vid=`echo "$LINE" | cut -c 115-125 | sed 's,^ *,,; s, *$,,' | sed 's/^[0]*//'`
exist=0
#match vid with entire line in id.txt
exist=`grep -x "$vid" $file_dir/id.txt | wc -l`
if [[ $exist -gt 0 ]]; then
echo "$LINE" >> $dest_dir/id.txt
fi
done
How is this:
FNR==NR { # FNR == NR is only true in the first file
s = substr($0,115,10) # Store the section of the line interested in
sub(/^\s*/,"",s) # Remove any leading whitespace
sub(/\s*$/,"",s) # Remove any trailing whitespace
lines[s]=$0 # Create array of lines
next # Get next line in first file
}
{ # Now in second file
for(i in lines) # For each line in the array
if (i~$0) { # If matches the current line in second file
print lines[i] # Print the matching line from file1
next # Get next line in second file
}
}
Save it to a script script.awk and run like:
$ awk -f script.awk "$filename" "${file_dir}/id.txt" > "${dest_dir}/id.txt"
This will still be slow because for each line in second file you need to look at ~50% of the unique lines in first (assuming most line do in fact match). This can be significantly improved if you can confirmed that the lines in the second file are full line matches against the substrings.
For full line matches this should be faster:
FNR==NR { # FNR == NR is only true in the first file
s = substr($0,115,10) # Store the section of the line interested in
sub(/^\s*/,"",s) # Remove any leading whitespace
sub(/\s*$/,"",s) # Remove any trailing whitespace
lines[s]=$0 # Create array of lines
next # Get next line in first file
}
($0 in lines) { # Now in second file
print lines[$0] # Print the matching line from file1
}