Conditional substitution of patterns in bash strings depending on the beginning of a string - sed

I am new in bash, so excuse me if do not use the right terms.
I need to substitute certain patterns of six characters in a set of files. The order by patterns are substituted depends on the beginning of each string of text.
This is an example of input:
chr1:123-123 5GGGTTAGGGTTAGGGTTAGGGTTAGGGTTA3
chr1:456-456 5TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG3
chr1:789-789 5GGGCTAGGGTTAGGGTTAGGGTTA3
chr1:123-123 etc is the name of the string, they are separated from the string I need to work with by a tab. The string I need to work with is delimited by characters 5 and 3, but I can change them.
I want that all patterns containing T, A, G in anyone of these orders is substituted with X: TTAGGG, TAGGG, AGGGTT, GGGTTA, GGTTAG, GTTAGG.
Similarly, patterns containing CTAGGG, like row 3, in orders similar to the previous one will be substituted with a different character.
The game is repeated with some specific differences for all the 6 characters composing each pattern.
I started writing something like this:
#!/bin/bash
NORMAL=`echo "\033[m"`
RED=`echo "\033[31m"` #red
#read filename for the input file and create a copy and a folder for the output
read -p "Insert name for INPUT file: " INPUT
echo "Creating OUTPUT file " "${RED}"$INPUT"_sub.txt${NORMAL}"
mkdir -p ./"$INPUT"_OUTPUT
cp $INPUT.txt ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
echo
#start the first set of instructions
perfrep
#starting a second set of instructions to substitute pattern with one difference from TTAGGG
onemism
Instructions are
perfrep() {
sed -i -e 's/TTAGGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/TAGGGT/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/AGGGTT/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGGTTA/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGTTAG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GTTAGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
}
# starting a second set of instructions to substitute pattern with one difference from TTAGGG
onemism(){
sed -i -e 's/[GCA]TAGGG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/G[GCA]TAGG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GG[GCA]TAG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGG[GCA]TA/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/AGGG[GCA]T/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/TAGGG[GCA]/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
}
I will need to repeat also with T[GCA]AGGG, TT[TCG]GGG, TTA[ACT]GG, TTAG[ACT]G and TTAGG[ACT].
Using this procedure, I get for these results for the inputs shown
5GGGXXXXTTA3
5XXXXX3
5GGGLXXTTA3
In my point of view, for my job, the first and second string are both made by X repeated five times, and the order of characters is just slightly different. On the other hand, the third one could be masked like this:
5LXXX3
How do I tell the script that if the string starts with 5GGGTTA instead of 5TTAGGG must start to substitute with
sed -i -e 's/GGGTTA/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
instead of
sed -i -e 's/TTAGGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
?
I will need to repeat with all cases; for instance, if the string starts with GTTAGG I will need to start with
sed -i -e 's/GTTAGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
and so on, and add a couple of variation of my pattern.
I need to repeat the substitution with TTAGGG and the variations for all the rows of my input file.
Sorry for the very long question. Thank you all.
Adding information asked by Varun.
Patterns of 6 characters would be TTAGGG , [GCA]TAGGG , T[GCA]AGGG , TT[TCG]GGG , TTA[ACT]GG , TTAG[ACT]G , TTAGG[ACT].
Each one must be checked for a different frame, for instance for TTAGGG we have 6 frames TTAGGG , GTTAGG , GGTTAG, GGGTTA , AGGGTT , TAGGGT.
The same frames must be applied to the pattern containing a variable position.
I will have a total of 42 patterns to check, divided in 7 groups: one containing TTAGGG and derivative frames, 6 with the patterns with a variable position and their derivatives.
TTAGGG and derivatives are the most important and need to be checked first.

#! /usr/bin/awk -f
# generate a "frame" by moving the first char to the end
function rotate(base){ return substr(base,2) substr(base,1,1) }
# Unfortunately awk arrays do not store regexps
# so I am generating the list of derivative strings to match
function generate_derivative(frame,arr, i,j,k,head,read,tail) {
arr[i]=frame;
for(j=1; j<=length(frame); j++) {
head=substr(frame,1,j-1);
read=substr(frame,j,1);
tail=substr(frame,j+1);
for( k=1; k<=3; k++) {
# use a global index to simplify
arr[++Z]= head substr(snp[read],k,1) tail
}
}
}
BEGIN{
fs="\t";
# alternatives to a base
snp["A"]="TCG"; snp["T"]="ACG"; snp["G"]="ATC"; snp["C"]="ATG";
# the primary target
frame="TTAGGG";
Z=1; # warning GLOBAL
X[Z] = frame;
# primary derivatives
generate_derivative(frame, X);
xn = Z;
# secondary shifted targets and their derivatives
for(i=1; i<length(frame); i++){
frame = rotate(frame);
L[++Z] = frame;
generate_derivative(frame, L);
}
}
/^chr[0-9:-]*\t5[ACTG]*3$/ {
# because we care about the order of the prinary matches
for (i=1; i<=xn; i++) {gsub(X[i],"X",$2)}
# since we don't care about the order of the secondary matches
for (hit in L) {gsub(L[hit],"L",$2)}
print
}
END{
# print the matches in the order they are generated
#for (i=1; i<=xn; i++) {print X[i]};
#print ""
#for (i=1+xn; i<=Z; i++) {print L[i]};
}
IFF you can generate a static matching order you can live with then
something like the above Awk script could work. but you say the primary patterns should take precedence and that a secondary rule would be better applied first in some cases. (no can do).
If you need a more flexible matching pattern I would suggest looking at "recursive decent parsing with backtracking" Or "parsing expression grammars".
But then you are not in a bash shell anymore.

Related

Replacing all occurrence after nth occurrence in a line in perl

I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.
Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file
I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.
You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.
I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file
With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

sed editing multiple lines

Sed editing is always a new challenge to me when it comes to multiple line editing. In this case I have the following pattern:
RECORD 4,4 ,5,48 ,7,310 ,10,214608 ,12,199.2 ,13,-19.2 ,15,-83 ,17,35 \
,18,0.8 ,21,35 ,22,31.7 ,23,150 ,24,0.8 ,25,150 ,26,0.8 ,28,25 ,29,6 \
,30,1200 ,31,1 ,32,0.2 ,33,15 ,36,0.4 ,37,1 ,39,1.1 ,41,4 ,80,2 \
,82,1000 ,84,1 ,85,1
which I want to convert into:
#RECORD 4,4 ,5,48 ,7,310 ,10,214608 ,12,199.2 ,13,-19.2 ,15,-83 ,17,35 \
# ,18,0.8 ,21,35 ,22,31.7 ,23,150 ,24,0.8 ,25,150 ,26,0.8 ,28,25 ,29,6\
# ,30,1200 ,31,1 ,32,0.2 ,33,15 ,36,0.4 ,37,1 ,39,1.1 ,41,4 ,80,2 \
# ,82,1000 ,84,1 ,85,1
Besides this I would like to preserve the entirety of these 4 lines (which may be more or less than 4 (unpredictable as the appear in the input) into one (long) line without the backslashes or line wraps.
Two tasks in one so to say.
sed is mandatory.
It's not terribly clear how you recognize the blocks you want to comment out, so I'll use blocks from a line that starts with RECORD and process as long as there are backslashes at the end (if your requirements differ, the patterns used will need to be amended accordingly).
For that, you could use
sed '/^RECORD/ { :a /\\$/ { N; ba }; s/[[:space:]]*\\\n[[:space:]]*/ /g; s/^/#/ }' filename
This works as follows:
/^RECORD/ { # if you find a line that starts with
# RECORD:
:a # jump label for looping
/\\$/ { # while there's a backslash at the end
# of the pattern space
N # fetch the next line
ba # loop.
}
# After you got the whole block:
s/[[:space:]]*\\\n[[:space:]]*/ /g # remove backslashes, newlines, spaces
# at the end, beginning of lines
s/^/#/ # and put a comment sign at the
# beginning.
}
Addendum: To keep the line structure intact, instead use
sed '/^RECORD/ { :a /\\$/ { N; ba }; s/\(^\|\n\)/&#/g }' filename
This works pretty much the same way, except the newline-removal is removed, and the comment signs are inserted after every line break (and once at the beginning).
Addendum 2: To just put RECORD blocks onto a single line:
sed '/^RECORD/ { :a /\\$/ { N; ba }; s/[[:space:]]*\\\n[[:space:]]*/ /g }' filename
This is just the first script with the s/^/#/ bit removed.
Addendum 3: To isolate RECORD blocks while putting them onto a single line at the same time,
sed -n '/^RECORD/ { :a /\\$/ { N; ba }; s/[[:space:]]*\\\n[[:space:]]*/ /g; p }' filename
The -n flag suppresses the normal default printing action, and the p command replaces it for those lines that we want printed.
To write those records out to a file while commenting them out in the normal output at the same time,
sed -e '/^RECORD/ { :a /\\$/ { N; ba }; h; s/[[:space:]]*\\\n[[:space:]]*/ /g; w saved_records.txt' -e 'x; s/\(^\|\n\)/&#/g }' foo.txt
There's actually new stuff in this. Shortly annotated:
#!/bin/sed -f
/^RECORD/ {
:a
/\\$/ {
N
ba
}
# after assembling the lines
h # copy them to the hold buffer
s/[[:space:]]*\\\n[[:space:]]*/ /g # put everything on a line
w saved_records.txt # write that to saved_records.txt
x # swap the original lines back
s/\(^\|\n\)/&#/g # and insert comment signs
}
When specifying this code directly on the command line, it is necessary to split it into several -e options because the w command is not terminated by ;.
This problem does not arise when putting the code into a file of its own (say foo.sed) and running sed -f foo.sed filename instead. Or, for the advanced, putting a #!/bin/sed -f shebang on top of the file, chmod +xing it and just calling ./foo.sed filename.
Lastly, to edit the input file in-place and print the records to stdout, this could be amended as follows:
sed -i -e '/^RECORD/ { :a /\\$/ { N; ba }; h; s/[[:space:]]*\\\n[[:space:]]*/ /g; w /dev/stdout' -e 'x; s/\(^\|\n\)/&#/g }' filename
The new things here are the -i flag for inplace editing of the file, and to have /dev/stdout as target for the w command.
sed '/^RECORD.*\\$/,/[^\\]$/ s/^/#/
s/^RECORD.*/#&/' YourFile
After several remark of #Wintermute and more information from OP
Assuming:
line with RECORD at start are a trigger to modify the next lines
structure is the same (no line with \ with a RECORD line following directly or empty lines)
Explain:
take block of line starting with RECORD and ending with \
add # in front of each line
take line (so after ana eventual modification from earlier block that leave only RECORD line without \ at the end or line without record) and add a # at the start if starting with RECORD

find the line number where a specific word appears with “sed” on tcl shell

I need to search for a specific word in a file starting from specific line and return the line numbers only for the matched lines.
Let's say I want to search a file called myfile for the word my_word and then store the returned line numbers.
By using shell script the command :
sed -n '10,$ { /$my_word /= }' $myfile
works fine but how to write that command on tcl shell?
% exec sed -n '10,$ { /$my_word/= }' $file
extra characters after close-brace.
I want to add that the following command works fine on tcl shell but it starts from the beginning of the file
% exec sed -n "/$my_word/=" $file
447431
447445
448434
448696
448711
448759
450979
451006
451119
451209
451245
452936
454408
I have solved the problem as follows
set lineno 10
if { ! [catch {exec sed -n "/$new_token/=" $file} lineFound] && [string length $lineFound] > 0 } {
set lineNumbers [split $lineFound "\n"]
foreach num $lineNumbers {
if {[expr {$num >= $lineno}] } {
lappend col $num
}
}
}
Still can't find a single line that solve the problem
Any suggestions ??
I don't understand a thing: is the text you are looking for stored inside the variable called my_word or is the literal value my_word?
In your line
% exec sed -n '10,$ { /$my_word/= }' $file
I'd say it's the first case. So you have before it something like
% set my_word wordtosearch
% set file filetosearchin
Your mistake is to use the single quote character ' to enclose the sed expression. That character is an enclosing operator in sh, but has no meaning in Tcl.
You use it in sh to group many words in a single argument that is passed to sed, so you have to do the same, but using Tcl syntax:
% set my_word wordtosearch
% set file filetosearchin
% exec sed -n "10,$ { /$my_word/= }" $file
Here, you use the "..." to group.
You don't escape the $ in $my_word because you want $my_word to be substitued with the string wordtosearch.
I hope this helps.
After a few trial-and-error I came up with:
set output [exec sed -n "10,\$ \{ /$myword/= \}" $myfile]
# Do something with the output
puts $output
The key is to escape characters that are special to TCL, such as the dollar sign, curly braces.
Update
Per Donal Fellows, we do not need to escape the dollar sign:
set output [exec sed -n "10,$ \{ /$myword/= \}" $myfile]
I have tried the new revision and found it works. Thank you, Donal.
Update 2
I finally gained access to a Windows 7 machine, installed Cygwin (which includes sed and tclsh). I tried out the above script and it works just fine. I don't know what your problem is. Interestingly, the same script failed on my Mac OS X system with the following error:
sed: 1: "10,$ { /ipsum/= }": extra characters at the end of = command
while executing
"exec sed -n "10,$ \{ /$myword/= \}" $myfile"
invoked from within
"set output [exec sed -n "10,$ \{ /$myword/= \}" $myfile]"
(file "sed.tcl" line 6)
I guess there is a difference between Linux and BSD systems.
Update 3
I have tried the same script under Linux/Tcl 8.4 and it works. That might mean Tcl 8.4 has nothing to do with it. Here is something else that might help: Tcl comes with a package called fileutil, which is part of the tcllib. The fileutil package contains a useful tool for this case: fileutil::grep. Here is a sample on how to use it in your case:
package require fileutil
proc grep_demo {myword myfile} {
foreach line [fileutil::grep $myword $myfile] {
# Each line is in the format:
# filename:linenumber:text
set lineNumber [lindex [split $line :] 1]
if {$lineNumber >= 10} { puts $lineNumber}
}
}
puts [grep_demo $myword $myfile]
Here is how to do it with awk
awk 'NR>10 && $0~f {print NR}' f="$my_word" "$myfile"
This search for all line larger than line number 10 that contains word in variable $my_word in file name stored in variable myfile

Search for a particular multiline pattern using awk and sed

I want to read from the file /etc/lvm/lvm.conf and check for the below pattern that could span across multiple lines.
tags {
hosttags = 1
}
There could be as many white spaces between tags and {, { and hosttags and so forth. Also { could follow tags on the next line instead of being on the same line with it.
I'm planning to use awk and sed to do this.
While reading the file lvm.conf, it should skip empty lines and comments.
That I'm doing using.
data=$(awk < cat `cat /etc/lvm/lvm.conf`
/^#/ { next }
/^[[:space:]]*#/ { next }
/^[[:space:]]*$/ { next }
.
.
How can I use sed to find the pattern I described above?
Are you looking for something like this
sed -n '/{/,/}/p' input
i.e. print lines between tokens (inclusive)?
To delete lines containing # and empty lines or lines containing only whitespace, use
sed -n '/{/,/}/p' input | sed '/#/d' | sed '/^[ ]*$/d'
space and a tab--^
update
If empty lines are just empty lines (no ws), the above can be shortened to
sed -e '/#/d' -e '/^$/d' input
update2
To check if the pattern tags {... is present in file, use
$ tr -d '\n' < input | grep -o 'tags\s*{[^}]*}'
tags { hosttags = 1# this is a comment}
The tr part above removes all newlines, i.e. makes everything into one single line (will work great if the file isn't to large) and then search for the tags pattern and outputs all matches.
The return code from grep will be 0 is pattern was found, 1 if not.
Return code is stored in variable $?. Or pipe the above to wc -l to get the number of matches found.
update3
regex for searcing for tags { hosttags=1 } with any number of ws anywhere
'tags\s*{\s*hosttags\s*=\s*1*[^}]*}'
try this line:
awk '/^\s*#|^\s*$/{next}1' /etc/lvm/lvm.conf
One could try preprocessing the file first, removing commments and empty lines and introducing empty lines behind the closing curly brace for easy processing with the second awk.
awk 'NF && $1!~/^#/{print; if(/}/) print x}' file | awk '/pattern/' RS=

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile