Remove newline depending on the format of the next line - sed

I have a special file with this kind of format :
title1
_1 texthere
title2
_2 texthere
I would like all newlines starting with "_" to be placed as a second column to the line before
I tried to do that using sed with this command :
sed 's/_\n/ /g' filename
but it is not giving me what I want to do (doing nothing basically)
Can anyone point me to the right way of doing it ?
Thanks

Try following solution:
In sed the loop is done creating a label (:a), and while not match last line ($!) append next one (N) and return to label a:
:a
$! {
N
b a
}
After this we have the whole file into memory, so do a global substitution for each _ preceded by a newline:
s/\n_/ _/g
p
All together is:
sed -ne ':a ; $! { N ; ba }; s/\n_/ _/g ; p' infile
That yields:
title1 _1 texthere
title2 _2 texthere

If your whole file is like your sample (pairs of lines), then the simplest answer is
paste - - < file
Otherwise
awk '
NR > 1 && /^_/ {printf "%s", OFS}
NR > 1 && !/^_/ {print ""}
{printf "%s", $0}
END {print ""}
' file

This might work for you (GNU sed):
sed ':a;N;s/\n_/ /;ta;P;D' file
This avoids slurping the file into memory.
or:
sed -e ':a' -e 'N' -e 's/\n_/ /' -e 'ta' -e 'P' -e 'D' file

A Perl approach:
perl -00pe 's/\n_/ /g' file
Here, the -00 causes perl to read the file in paragraph mode where a "line" is defined by two consecutive newlines. In your example, it will read the entire file into memory and therefore, a simple global substitution of \n_ with a space will work.
That is not very efficient for very large files though. If your data is too large to fit in memory, use this:
perl -ne 'chomp;
s/^_// ? print "$l " : print "$l\n" if $. > 1;
$l=$_;
END{print "$l\n"}' file
Here, the file is read line by line (-n) and the trailing newline removed from all lines (chomp). At the end of each iteration, the current line is saved as $l ($l=$_). At each line, if the substitution is successful and a _ was removed from the beginning of the line (s/^_//), then the previous line is printed with a space in place of a newline print "$l ". If the substitution failed, the previous line is printed with a newline. The END{} block just prints the final line of the file.

Related

Merging Lines using sed

I have text file that consists of 45999 lines. Each line has a word (unigram). I want to create two-sequential words (bigrams). For example:
apple
pie
red
vine
I want 'apple pie', 'pie red', 'red vine'. I tried with sed 'N;s/\n/ /' but it creates just 'apple pie' and 'red vine'. How can I solve this problem? Thank you..
Could you please try following if you are ok with awk.
awk -v RS="" '
BEGIN{
OFS=","
s1="\047"
}
{
for(i=2;i<=NF;i++){
print s1 $(i-1) s1, s1 $i s1
}
}' Input_file
Output will be as follows.
'apple','pie'
'pie','red'
'red','vine'
2nd solution: since output of OP is not clear so adding this one too.
awk -v RS="" '
BEGIN{
OFS=","
s1="\047"
}
{
for(i=2;i<=NF;i++){
val=(val?val OFS:"")s1 $(i-1) s1 OFS s1 $i s1
}
}
END{
print val
}' Input_file
Output will be as follows.
'apple','pie','pie','red','red','vine'
This might work for you (GNU sed):
sed -nE 'N;s/\n(.*)/ \1&/;P;D' file
Append the next line to the current line, then replace the newline by a space and append the second line again. Print/delete the first line and repeat.
N.B. This does not print the last line as it is not a pair, if the last line is needed use:
sed -E 'N;s/\n(.*)/ \1&/;P;D' file
If the output is to be printed as a single line with each pair surrounded by single quotes and separated by a comma, use:
sed -E ':a;$!N;s/(\S+)\n(.*)/'\''\1 \2'\'', \2/;ta;s/ (\S+)$/ '\''\1'\''/' file
Or:
sed -E ':a;$!N;s/(\S+)\n(.*)/'\''\1 \2'\'', \2/;ta;s/, \S+$/' file

delete string for each line with sed

My file contains x number of lines, I would like to remove the string before and after the reference string at the beginning and end of each line.
The reference string and string to remove are separated by space.
The file contains :
test.user.passs
test.user.location
global.user
test.user.tel
global.pass
test.user.email string_err
#ttt...> test.user.car ->
test.user.address
è_ 788 test.user.housse
test.user.child
{kl78>&é} global.email
global.foo
test.user.foo
How to remove the string at the start of each line which contain "test" string and also the end of each line separated by space or tab with sed?
The desired result is :
test.user.passs
test.user.location
global.user
test.user.tel
global.pass
test.user.email
test.user.car
test.user.address
test.user.housse
test.user.child
{kl78>&é} global.email
global.foo
test.user.foo
I interpret your question as: find the first word that is "word characters and at least one dots"
Tcl:
echo '
set fh [open [lindex $argv 1] r]
while {[gets $fh line] != -1} {puts [regexp -inline {\w+(?:\.\w+)+} $line]}
' | tclsh - file
sed
sed -r 's/.*\<([[:alpha:]]+(\.[[:alpha:]]+)).*/\1/' file
perl
perl -nE '/(\w+(\.\w+)+)/ and say $1' file
using sed like
sed -r 's/^[^ ]+[ ]+([^ ]+)[ ]+[^ ]*/\1/' file
This might work for you (GNU sed):
sed -r 's/.*(test\S+).*/\1/' file

Use sed to replace word in 2-line pattern

I try to use sed to replace a word in a 2-line pattern with another word. When in one line the pattern 'MACRO "something"' is found then in the next line replace 'BLOCK' with 'CORE'. The "something" is to be put into a reference and printed out as well.
My input data:
MACRO ABCD
CLASS BLOCK ;
SYMMETRY X Y ;
Desired outcome:
MACRO ABCD
CLASS CORE ;
SYMMETRY X Y ;
My attempt in sed so far:
sed 's/MACRO \([A-Za-z0-9]*\)/,/ CLASS BLOCK ;/MACRO \1\n CLASS CORE ;/g' input.txt
The above did not work giving message:
sed: -e expression #1, char 30: unknown option to `s'
What am I missing?
I'm open to one-liner solutions in perl as well.
Thanks,
Gert
Using a perl one-liner in slurp mode:
perl -0777 -pe 's/MACRO \w+\n CLASS \KBLOCK ;/CORE ;/g' input.txt
Or using a streaming example:
perl -pe '
s/^\s*\bCLASS \KBLOCK ;/CORE ;/ if $prev;
$prev = $_ =~ /^MACRO \w+$/
' input.txt
Explanation:
Switches:
-0777: Slurp files whole
-p: Creates a while(<>){...; print} loop for each line in your input file.
-e: Tells perl to execute the code on command line.
When in one line the pattern 'MACRO "something"' is found then in the
next line replace 'BLOCK' with 'CORE'.
sed works on lines of input. If you want to perform substitution on the next line of a specified pattern, then you need to add that to the pattern space before being able to do so.
The following might work for you:
sed '/MACRO/{N;s/\(CLASS \)BLOCK/\1CORE/;}' filename
Quoting from the documentation:
`N'
Add a newline to the pattern space, then append the next line of
input to the pattern space. If there is no more input then sed
exits without processing any more commands.
If you want to make use of address range as in your attempt, then you need:
sed '/MACRO/,/CLASS BLOCK/{s/\(CLASS\) BLOCK/\1 CORE/}' filename
I'm not sure why do you need a backreference for substituting the macro name.
You could try this awk command also,
awk '{print}/MACRO/ {getline; sub (/BLOCK/,"CORE");{print}}' file
It prints all the lines as it is and do the replacing action on seeing a word MACRO on a line.
Since getline has so many pitfall I try not to use it, so:
awk '/MACRO/ {a++} a==1 {sub(/BLOCK/,"CORE")}1' file
MACRO ABCD
CLASS CORE ;
SYMMETRY X Y ;
This could do it
#!awk -f
BEGIN {
RS = ";"
}
/MACRO/ {
sub("BLOCK", "CORE")
}
{
printf s++ ? ";" $0 : $0
}
"line" ends with ;
sub BLOCK for CORE in "lines" with MACRO
print ; followed by "line" unless first line

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile

join 2 lines only if field-1 are equals with sed or awk

input file:
$ cat t.txt
id1;value1_1
id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1
id4;value4_2
id5;value5_1
result would be:
id1;value1_1;id1;value1_2
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1
using sed or awk. Please give your opinion.
Here's one way to do it:
awk -F';' 'BEGIN { getline; id=$1; line=$0 } { if ($1 != id) { print line; line = $0; } else { line = line ";" $0; } id=$1; } END { print line; }' t.txt
Explanation:
Set field separator to ;:
-F';'
Start by reading the first line of input (getline), save the first field ($1) as id, and the first line ($0) as line:
BEGIN { getline; id=$1; line=$0 }
For each line of input, check if the first field differs from the stored id:
if ($1 != id)
If it does, then print the saved line and store the new one ($0):
print line; line = $0;
Otherwise, append the new line to the stored line(s):
line = line ";" $0;
And save the new id:
id=$1
At the end, print whatever is left in line:
END { print line; }
I guess in your result example, the id2; line is missing by mistake, right?
anyway, you could try the awk line below:
awk -F';' '{a[$1]=($1 in a)?a[$1]";"$0:$0}END{for(x in a)print a[x]}' yourFile|sort
output would be:
id1;value1_1;id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1
This might work for you:
sed -e '1{h;d};H;${x;:a;s/\(\([^;]*;\)\([^\n]*\)\)\n\2/\1;\2/;ta;p};d' t.txt
Explanation:
Slurp file in to hold space (HS) then on end-of-file swap to the HS and using substitution concatenate lines with duplicate keys and print. N.B. lines normally printed are all deleted.
EDIT:
The above solution works (as far as I know) but for large volumes is not very fast (read incredibly slow). This solution is better:
# cat -A /tmp/t.txt
id1;value1_1$
id1;value1_2$
id2;value2_1$
id3;value3_1$
id4;value4_1$
id4;value4_2$
id5;value5_1$
# for x in {1..1000};do cat /tmp/t.txt;done |
> sed ':a;$!N;/^\([^;]*;\).*\n\1/s/\n//;ta;P;D'| sort | uniq
id1;value1_1;id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1