Simplest sed to remove all duplicating anywhere in document - sed

How is sed simplest code to remove all duplicating lines anywhere in document
This reference script, I guess, do only such if it's consecutive only isn't it?
sed -E '$q; N; /^(.*)\n\1$/!{ P; D }; :L $d; s/.*\n//; N; /^(.*)\n\1$/bL; D
Please help out... will highly gratified

This might work for you (GNU sed):
sed -E 'H;x;s/((\n[^\n]*).*)\2$/\1/;x;$!d;x;s/.//' file
Append each line to the hold space and remove it if it has occurred before.
At the end of file, remove the first introduced newline.

Using sed, assuming your input file looks like;
$ cat input_file
one
one
two
one
three
four
three
three
five
two
five
four
three
four
four
five
$ sed -n 'G;/^\(.*\n\).*\n\1$/d;H;P' input_file
one
two
three
four
five

If I'm understanding correctly, you want to remove all but the first instance of any duplicated lines in a file, right?
So that
alpha
beta
beta
gamma
beta
delta
alpha
delta
becomes
alpha
beta
gamma
delta
So don't use sed. Use Perl to walk through the file line by line and only print the lines it has not seen before:
perl -ne'print unless $seen{$_}++' input.txt > output.txt

Related

Using Sed to Delete multiple lines using a file with patterns

I am currently using sed to delete lines and subsequent line with various patterns from a file using the following the following code:
sed -i -e"/String1/,+1d" -e"/String2/,+1d," filename.txt
Works very well however I have a lot of patterns which vary from time to time.
Is it possible to put all patterns in another text file and make sed to delete all entries for patterns found in such file ?
Thanks
Here is an awk version
awk 'NR==FNR {a[$0]++;next} {for (i in a) if ($0~i) f=2} --f<0' list yourfile
NR==FNR {a[$0]++;next} store the list of lines to remove for file list in array a
for (i in a) for every line, loop through all lines in list
if ($0~i) f=2 if trigger line is found, set flag f to 2
--f<0 decrease flag f by one and test if it less than 0, if yes, print the line.
example
cat yourfile
one
two
three
four
five
six
seven
eight
nine
ten
eleven
cat list
three
eight
awk 'NR==FNR {a[$0]++;next} {for (i in a) if ($0~i) f=2} --f<0' list yourfile
one
two
five
six
seven
ten
eleven
Trying to stick with sed - at all cost, and being creative :-)
Consider using sed itself to generate the sed script that will perform the substitutions, based on the patterns file.
Important to note that this is solution will process each input file with one-pass, making it possible to use on large files/many patterns.
Proposed Solution:
sed -i -e "$(sed -e '/\//d;s/^/\//;s/$/\/,+1d/' < patterns.txt)" filename.txt
The embedded sed program (sed -e '/\//d;s/^/\//;s/$/\/,+1d/ ...) will convert the patterns.txt to a small sed script:
pattern.txt:
three
eight
foo/bar
Output: (noticed foo/bar ignored - contains '/')
/three/,+1d
/eight/,+1d
Notes, Limitations, etc:
One limit (of above implementation) is the delimiter, code remove any pattern with '/' to simplify generation of sed script, and to avoid potential injection. Possible to work around this limitation and allow for alternate delimiter (by escaping special characters in the pattern, or leveraging the '\%' addresses). May need additional testing.
Code assumes that the patterns are valid RE.

How to insert sth with sed just once

I'm trying to substitute the first empty line in my input file with a multiline block, i. e. out of
one
two
three
four
five
six
I want to create
one
two
foo
three
four
five
six
For this I tried this sed script:
sed '/^$/i\
\
foo'
But it inserts at /each/ empty line.
How can I tweak this call to sed so that it inserts just at the first occurrence of an empty line? Is there a way to tell sed that now the rest of the input should just be copied from to the output?
I do not want to switch to awk or other shell tools like read in a loop or similar. I'm just interested in the use of sed for this task.
You can loop and print lines until the end of the file:
sed '/^$/{i\
\
foo
:a;n;ba}' file
I found a way by replacing the i with a s command:
sed '0,/^$/s//\
foo\
/'
But I would prefer a solution using the i command because not everything I could want to do after the search might be easily replaceable with an s.

Using sed to swap columns X and X+1 inline in delimited file

I have a file with multiple lines and for line 2 to the end of the file I want to swap fields 8 and 9. The file is comma separated and I'd like to do the swap inline so I can run it on a batch of files using * wildcard. If this can be accomplished similarly with awk then that works for me too.
example:
header1,header2,header3,...,header8,header9,...,headerN
field1.1,...,field1.9,field1.8,...,field1.N
field2.1,...,field2.9,field2.8,...,field2.N
field3.1,...,field3.9,field3.8,...,field3.N
...
I think the command would look similar to sed -r -i '2,$s/^(([^,]*,){8})([^,]*,)([^,]*,)(.*)/\1\3\2\4/' temp*.log,
but \2 is not what I expect, it is the 7th field. I know that \2 will not be the 8th field because I have double parentheses there, but I'm not sure how to fix it. Could somebody please explain what this equation is doing and specifically what [^,] is doing and how the {8} is applied?
Thanks in advance.
In awk, you might use:
awk -F',' 'BEGIN {OFS=","} {t = $8; $8 = $9; $9 = t; print}'
In sed, the command is more convoluted, but it could be done.
sed -e 's/^\(\([^,]*,\)\{7\}\)\([^,]*,\)\([^,]*,\)/\1\4\3/'
Add the -i .bak option if your version of sed (e.g. GNU or BSD) supports it.
This uses the universally available sed regexes (it would work on even archaic versions of sed). You could lose most of the backslashes if you used 'extended regular expressions' instead:
sed -r -i 's/^(([^,]*,){7})([^,]*,)([^,]*,)/\1\4\3\5/'
Note the nested remembered (captured) patterns. The outer set is \1, the inner set would be \2 but that gets repeated 7 times, so you'd have the seventh field as \2. Anyway, that's why the eighth and ninth columns are switched with \4 and \3. \5 are the remaining columns.
(I note in passing that it would have been helpful to have some sample data in sufficiently the correct format to test with. It was a nuisance having to edit what is shown in the question to be able to test the code.)
If you need to do much CSV work, then either use Perl and its CSV modules (Text::CSV and Text::CSV_XS) or Python and its CSV module, or get CSVfix.
$2 is the second part in the RE
Denumbered by first occurence of (.
So in
'2,$s/^(([^,]*,){8})([^,]*,)([^,]*,)(.*)/\1\3\2\4/'
You could see (followind alignment):
$1 = (([^,]*,){8})
$2 = ([^,]*,)
$3 = ([^,]*,)
$4 = ([^,]*,)
and finaly $5 = (.*)
In this specific case, $2 must hold the last match of the height ({8}).
it seems that awk is the right tool:
awk -F',' -v OFS=',' '{t=$8;$8=$9;$9=t}7' file
This might work for you (GNU sed):
sed -ri '1!s/(,[^,]*)(,[^,]*)/\2\1/4' file
This swaps the 9th field with the 8th i.e. 8 / 2 = 4, if you wanted the 7th with the 8th:
sed -ri '1!{s/^/,/;s/(,[^,]*)(,[^,]*)/\2\1/4;s/^,//}' file

detect two consecutive lines matching a pattern with sed

I am looking for two consecutive lines matching a certain pattern, say containing word 'pat' using sed and have noticed that I am able to detect it sometimes with this command:
sed -n 'N; /.*pat.*\n.*pat.*/p'
but this command fails if the line numbers for the duplicates are not of the same parity and I assume it's because we're searching lines 1+2, 3+4, 5+6 etc.. if this is the case, what would be the correct way to do this?
Why does it need to be sed? May I suggest awk?
awk '{/pat/?f++:f=0} f==2' file
If pat is found, increment f with 1
If pat is not found, reset f to 0
If f==2 print the line.
This might work for you (GNU sed):
sed '$!N;/pattern.*\n.*pattern/p;D' file
This keeps 2 lines in the pattern space and prints both of them out if the regexp matches.

Append text to a line on multiple conditions

I am very new to sed so please bear with me... I have a file with contents like
a=1
b=2,3,4
c=3
d=8
.
.
I want to append 'x' to a line which starts with 'c=' and does not contain an 'x'. What I am using right now is
sed -i '/^c=/ s/$/x/'
but this does not cover the second part of my explanation, the 'x' should only be appended if the line did not have it already and hence if I run the command twice it makes the line "c=3xx" which I do not want.
Any help here would be highly appreciated and I know there are a lot of sharp heads around here :) I understand that this can be handled pretty easily through bash but using sed here is a hard requirement.
You can do something like this:
sed -i '/^c=/ {/x/b; s/$/x/}'
Curly brackets are used for grouping. The b command branches to the end of the script (stops the processing of the current line).
b label
Branch to label; if label is omitted, branch to end of script.
Edit: as William Pursell suggests in the comment, a shorter version would be
sed -i '/^c=/ { /x/ !s/$/x/ }'
awk is probably a better choice here as you can easily combine regular expression matches with logical operators. Given the input:
$ cat file
a=1
b=2,3,4
c=3
c=x
c=3
d=8
The command would be:
$ awk '/^c=/ && !/x/ {$0=$0"x"; print $0}' file
a=1
b=2,3,4
c=3x
c=x
c=3x
d=8
Where $0 is the awk variable that contains the current line being read.
This might work for you (GNU sed):
sed -i '/^c=[^x]*$/s/$/x/' file
or:
sed -i 's/^c=[^x]*$/&x/' file