Merging Lines using sed - sed

I have text file that consists of 45999 lines. Each line has a word (unigram). I want to create two-sequential words (bigrams). For example:
apple
pie
red
vine
I want 'apple pie', 'pie red', 'red vine'. I tried with sed 'N;s/\n/ /' but it creates just 'apple pie' and 'red vine'. How can I solve this problem? Thank you..

Could you please try following if you are ok with awk.
awk -v RS="" '
BEGIN{
OFS=","
s1="\047"
}
{
for(i=2;i<=NF;i++){
print s1 $(i-1) s1, s1 $i s1
}
}' Input_file
Output will be as follows.
'apple','pie'
'pie','red'
'red','vine'
2nd solution: since output of OP is not clear so adding this one too.
awk -v RS="" '
BEGIN{
OFS=","
s1="\047"
}
{
for(i=2;i<=NF;i++){
val=(val?val OFS:"")s1 $(i-1) s1 OFS s1 $i s1
}
}
END{
print val
}' Input_file
Output will be as follows.
'apple','pie','pie','red','red','vine'

This might work for you (GNU sed):
sed -nE 'N;s/\n(.*)/ \1&/;P;D' file
Append the next line to the current line, then replace the newline by a space and append the second line again. Print/delete the first line and repeat.
N.B. This does not print the last line as it is not a pair, if the last line is needed use:
sed -E 'N;s/\n(.*)/ \1&/;P;D' file
If the output is to be printed as a single line with each pair surrounded by single quotes and separated by a comma, use:
sed -E ':a;$!N;s/(\S+)\n(.*)/'\''\1 \2'\'', \2/;ta;s/ (\S+)$/ '\''\1'\''/' file
Or:
sed -E ':a;$!N;s/(\S+)\n(.*)/'\''\1 \2'\'', \2/;ta;s/, \S+$/' file

Related

Extract substrings between strings

I have a file with text as follows:
###interest1 moreinterest1### sometext ###interest2###
not-interesting-line
sometext ###interest3###
sometext ###interest4### sometext othertext ###interest5### sometext ###interest6###
I want to extract all strings between ### .
My desired output would be something like this:
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
I have tried the following:
grep '###' file.txt | sed -e 's/.*###\(.*\)###.*/\1/g'
This almost works but only seems to grab the first instance per line, so the first line in my output only grabs
interest1 moreinterest1
rather than
interest1 moreinterest1
interest2
Here is a single awk command to achieve this that makes ### field separator and prints each even numbered field:
awk -F '###' '{for (i=2; i<NF; i+=2) print $i}' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
Here is an alternative grep + sed solution:
grep -oE '###[^#]*###' file | sed -E 's/^###|###$//g'
This assumes there are no # characters in between ### markers.
With GNU awk for multi-char RS:
$ awk -v RS='###' '!(NR%2)' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
You can use pcregrep:
pcregrep -o1 '###(.*?)###' file
The regex - ###(.*?)### - matches ###, then captures into Group 1 any zero o more chars other than line break chars, as few as possible, and ### then matches ###.
o1 option will output Group 1 value only.
See the regex demo online.
sed 't x
s/###/\
/;D; :x
s//\
/;t y
D;:y
P;D' file
Replacing "###" with newline, D, then conditionally branching to P if a second replacement of "###" is successful.
This might work for you (GNU sed):
sed -n 's/###/\n/g;/[^\n]*\n/{s///;P;D}' file
Replace all occurrences of ###'s by newlines.
If a line contains a newline, remove any characters before and including the first newline, print the details up to and including the following newline, delete those details and repeat.

Normalize column fill with space on right

Working with the example log file below:
1;000117;20190529;055529;9521;0988388019
1;000015;20190529;071944;2222;2231
1;000012;20190529;072734;4258;4252
1;000006;20190529;073336;2226;1000
3;000005;20190529;073715;1000;037760967
3;000004;20190529;073751;1000;037760967
I need to normalize the last column filling with spaces until they has the lenght = 25
Tryed with unsuccessful perl code:
perl -F';' -lane '$F[5] = $F[5], sprintf "% 25d"; $" = ";"; print "#F"'
I need the output below:
1;000117;20190529;055529;9521;0988388019
1;000015;20190529;071944;2222;2231
1;000012;20190529;072734;4258;4252
1;000006;20190529;073336;2226;1000
3;000005;20190529;073715;1000;037760967
3;000004;20190529;073751;1000;037760967
$ awk 'BEGIN{FS=OFS=";"} {$NF=sprintf("%-25s",$NF)}1' file
1;000117;20190529;055529;9521;0988388019
1;000015;20190529;071944;2222;2231
1;000012;20190529;072734;4258;4252
1;000006;20190529;073336;2226;1000
3;000005;20190529;073715;1000;037760967
3;000004;20190529;073751;1000;037760967
So you can see the blanks:
$ awk 'BEGIN{FS=OFS=";"} {$NF=sprintf("%-25s",$NF)}1' file | tr ' ' '#'
1;000117;20190529;055529;9521;0988388019###############
1;000015;20190529;071944;2222;2231#####################
1;000012;20190529;072734;4258;4252#####################
1;000006;20190529;073336;2226;1000#####################
3;000005;20190529;073715;1000;037760967################
3;000004;20190529;073751;1000;037760967################
You were on the right track. More successful Perl codes:
perl -F';' -lane '$F[5]=sprintf("%-25s",$F[5]);print join ";",#F'
perl -F';' -pane '$F[5]=sprintf("%-25s",$F[5]);$_=join ";",#F'
This might work for you (GNU sed):
sed -i ':a;/;[^;]\{25\}$/!s/$/ /;ta' file
If the last field is not 25 characters long, add a space until it is.

delete string for each line with sed

My file contains x number of lines, I would like to remove the string before and after the reference string at the beginning and end of each line.
The reference string and string to remove are separated by space.
The file contains :
test.user.passs
test.user.location
global.user
test.user.tel
global.pass
test.user.email string_err
#ttt...> test.user.car ->
test.user.address
è_ 788 test.user.housse
test.user.child
{kl78>&é} global.email
global.foo
test.user.foo
How to remove the string at the start of each line which contain "test" string and also the end of each line separated by space or tab with sed?
The desired result is :
test.user.passs
test.user.location
global.user
test.user.tel
global.pass
test.user.email
test.user.car
test.user.address
test.user.housse
test.user.child
{kl78>&é} global.email
global.foo
test.user.foo
I interpret your question as: find the first word that is "word characters and at least one dots"
Tcl:
echo '
set fh [open [lindex $argv 1] r]
while {[gets $fh line] != -1} {puts [regexp -inline {\w+(?:\.\w+)+} $line]}
' | tclsh - file
sed
sed -r 's/.*\<([[:alpha:]]+(\.[[:alpha:]]+)).*/\1/' file
perl
perl -nE '/(\w+(\.\w+)+)/ and say $1' file
using sed like
sed -r 's/^[^ ]+[ ]+([^ ]+)[ ]+[^ ]*/\1/' file
This might work for you (GNU sed):
sed -r 's/.*(test\S+).*/\1/' file

Remove newline depending on the format of the next line

I have a special file with this kind of format :
title1
_1 texthere
title2
_2 texthere
I would like all newlines starting with "_" to be placed as a second column to the line before
I tried to do that using sed with this command :
sed 's/_\n/ /g' filename
but it is not giving me what I want to do (doing nothing basically)
Can anyone point me to the right way of doing it ?
Thanks
Try following solution:
In sed the loop is done creating a label (:a), and while not match last line ($!) append next one (N) and return to label a:
:a
$! {
N
b a
}
After this we have the whole file into memory, so do a global substitution for each _ preceded by a newline:
s/\n_/ _/g
p
All together is:
sed -ne ':a ; $! { N ; ba }; s/\n_/ _/g ; p' infile
That yields:
title1 _1 texthere
title2 _2 texthere
If your whole file is like your sample (pairs of lines), then the simplest answer is
paste - - < file
Otherwise
awk '
NR > 1 && /^_/ {printf "%s", OFS}
NR > 1 && !/^_/ {print ""}
{printf "%s", $0}
END {print ""}
' file
This might work for you (GNU sed):
sed ':a;N;s/\n_/ /;ta;P;D' file
This avoids slurping the file into memory.
or:
sed -e ':a' -e 'N' -e 's/\n_/ /' -e 'ta' -e 'P' -e 'D' file
A Perl approach:
perl -00pe 's/\n_/ /g' file
Here, the -00 causes perl to read the file in paragraph mode where a "line" is defined by two consecutive newlines. In your example, it will read the entire file into memory and therefore, a simple global substitution of \n_ with a space will work.
That is not very efficient for very large files though. If your data is too large to fit in memory, use this:
perl -ne 'chomp;
s/^_// ? print "$l " : print "$l\n" if $. > 1;
$l=$_;
END{print "$l\n"}' file
Here, the file is read line by line (-n) and the trailing newline removed from all lines (chomp). At the end of each iteration, the current line is saved as $l ($l=$_). At each line, if the substitution is successful and a _ was removed from the beginning of the line (s/^_//), then the previous line is printed with a space in place of a newline print "$l ". If the substitution failed, the previous line is printed with a newline. The END{} block just prints the final line of the file.

Brocade alishow merge two consecutive lines awk sed

How would like to join two lines usung awk or sed?
For example, I have data like below:
abcd
12:12:12:12:12:12:12:12
efgh001_01
45:45:45:45:45:45:45:45
ijkl7464746
78:78:78:78:78:78:78:78
and I need output like below:
abcd 12:12:12:12:12:12:12:12
efgh001_01 45:45:45:45:45:45:45:45
ijkl7464746 78:78:78:78:78:78:78:78
Running this almost works, but I need the space or tab:
awk '!(NR%2){print$0p}{p=$0}'
You're almost there:
awk '(NR % 2 == 0) {print p, $0} {p = $0}'
With sed you can do that as follows:
sed -n 'N;s/\n/ /p' file
where:
N reads next line
s replaces the new line character with a space to join both lines properly
p prints the result
This might work for you:
sed '$!N;s/\n/ /' file
or this:
paste -sd' \n' file