sed more than one number or 'g' in substitute flags - sed

When I run my command I get:
sed 's/<?xml version="1.0" encoding="UTF-8"?>//2g' all.xml
sed: 1: "s/<?xml version="1.0" e ...": more than one number or 'g' in substitute flags
How can I fix this? It doesn't make much sense to me. I'm trying to remove all but the first instance of <?xml version="1.0" encoding="UTF-8"?> from a new xml file.

You sed line does not give error on my Ubuntu 18.04, but it does not work. sed is very good to work with one line at a time, but harder to get to work across multiple lines.
awk may be a better solution.
awk '/<\?xml version="1\.0" encoding="UTF-8"\?>/ && f++>0 {next} 1' file
This will test your line if it contains the patten and test if flag f is larger than 0 and then increase flag f by 1. So in first hit, the flag is 0 and line is printed. For all other find flag f us larger than one and the next makes us skip the line and go to next. The last 1 is always true, so if its run, do the default action, print the line.

Related

Remove '^#' line from a file

I have a file in which I have a particular line of this type:
^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^ ...
Actually all the others lines are a list (a matrix) of numbers or *******. The problem is that I can not be able to open this file with normal editors and so I can not be able to remove this line.
I can open the file via shell using nano.
To eliminate this line (that is the second line from the top) I used the simple command:
sed '2d' fort.21.dat
But I can not be able to eliminate it.
Can someone help me to eliminate this line and make this file.dat normally readable ?
Thanks a lot
Try:
tr -d '\0' < fort.21.dat > fixed.21.dat
This uses the tr utility to delete the ^# (zero) bytes from the file.

Sed to replace last character on condition

I have a file which has following lines
172XI207 X123955 1
412XE401 XE05689 1
412XI402 XI9515 1
412XI403 XI06702 1
412XE404 XE75348 1
I want to replace last column to 2 if the first two characters in the second column matches to XE.
The result should be like below
172XI207 X123955 1
412XE401 XE05689 2
412XI402 XI9515 1
412XI403 XI06702 1
412XE404 XE75348 2
I wanted to use sed (not awk). Can someone please let me know how this can be acheived using sed?
many sed commands take an address or address range (see the man page for the gory details). Probably the most common command is s of course, but it is among those that take an address range, meaning it doesn't need to apply to every line. An address range xan be a regular expression. The s command is:
{address}s/pattern/replacement/
For you the address - matching RE - is / XE/ (assuming your columns are space separarated; change that to a tab if necessary), the pattern is 1$ and the replacement 2. Therefore:
/ XE/s/1$/2/
or as a command line
sed -e '/ XE/s/1$/2/' < oldfile > newfile
EDIT: oops, second column, not start of line.
This command should do the trick (providing you are looking at myfile.txt)
sed -e '/ XE/ s/1$/2//' myfile.txt
You can make sure your replacement is acted by adding the -i option which will modify the file in-place, make sure it's exactly what you are expecting before though.
Edit: based on question in comments, here is a command that matches on 3rd column and replaces on fifth.
sed -e 's/^\(\(\w\+\W\+\)\{2\}XE\(\w\+\W\+\)\{2\}\)1/\12/'
Or, as an alternative, you can first select the line and then substitute:
sed -e '/^\(\w\+\W\+\)\{2\}XE/ s/^\(\(\w\+\W\+\)\{4\}\)1/\12/'

SED Code Explanation

I have a line of SED, below, that is in a batch command that I run every month. It was written by someone before me, and I am looking to understand the parts of this code. From the two outputs I can tell that it takes one line and deletes another when sequential lines are duplicates, I just don't understand how it is being done with this line.
sed "$!N; /^\(.*\)\n\1$/!P; D" finalish.txt > final.txt
Exmple of - Finalish.txt
201408
201409
201409
201409
201409
Example of - Final.txt
201408
201409
Not going in to the basics of sed, here is your sed command broken down:
$!N: If it is not end of file, append next line to pattern space. The two lines will be separated by a newline (\n). At this time your pattern space is 201408\n201409.
/^\(.*\)\n\1$/!P: If the pattern space does not contain two similar content separated by a newline (\n), then Print up to the first newline (\n). So this will print 201408 to STDOUT. During the second iteration though, the pattern space will have 201409\n201409 and since it fails the regex, nothing gets printed and we proceed to the next command.
D: Deletes up to the first newline (\n) and repeats the sed script. Remember during the repeat cycle your pattern space still has the 201409
So during the first iteration 201408 gets printed but 201409 doesn't get printed until the end of file is reached which is when your regex will become true again and the content will get printed.
If you are inheriting alot of sed code, I would strongly recommend sedsed utility which is written in python and will help you understand convoluted and cryptic sed that can often become a maintenance nightmare.
Here is a sample run from the sedsed utility (I haven't shown all iterations as it is pretty verbose but you get the picture. I have added few comments to what the output really means. Also notice I am using single quotes since I am on Mac (BSD Unix) and not Windows):
$ sedsed.py -d '$!N; /^\(.*\)\n\1$/!P; D' file
PATT:201408$ # This shows your current pattern space
HOLD:$ # This shows your current hold buffer
COMM:$ !N # This shows the command that is going to run
PATT:201408$ # This shows the pattern space after the command has ran
201409$
HOLD:$ # This shows the hold buffer after the command has ran
COMM:/^\(.*\)\n\1$/ !P # This shows the command being ran
201408 # Anything without a <TAG:> is what gets printed to STDOUT
PATT:201408$
201409$
HOLD:$
COMM:D
PATT:201409$
HOLD:$
...
...
...
COMM:$ !N
PATT:201409$
HOLD:$
COMM:/^\(.*\)\n\1$/ !P
201409
PATT:201409$
HOLD:$
COMM:D
I would also suggest that once you get the idea of what your sed commands were written for, you port them to a more friendlier scripting language like awk, perl or python
This will not help you understanding the sed, but here is an awk that just get the unique lines.
awk '!seen[$0]++' finalish.txt
201408
201409

remove duplicate lines in a txt document and keep one?

i am not programmer, but i would like some help to remove duplicate lines in a document and keep only original lines.
i was trying to do this with some text processors, editpadpro, but since my file is more than 1 gigabyte, always gets frozen and can't complete the operation.
i know perl is very good at this, but i don't know how to use it, keeping in mind that the file can be over 1 or 2 gB.
example of input lines:
line 1
line 2
line 3
line 1
line 2
line 4
line 1
example of output lines:
line 1
line 2
line 3
line 4
i am sorry if this is very basic, but i really don't know how to proceed, most of the time i use built in functions, i hope not to annoy anyone with this question.
If you don't mind the lines not being in the original order, you can use this command:
$ sort -u old_file.txt > new_file.txt
The sort will sort your file, and the -u option stands for unique which means that it will only output the first matching line.
Even with very large files, sort may be your best hope.
Preserving the existing order (first time each line is found):
perl -i -wlne'our %uniq; $uniq{$_}++ or print' file.txt
This can also be done effectively using awk: http://awk.freeshell.org/AwkTips
awk '!a[$0]++'

Alternatives to grep/sed that treat new lines as just another character

Both grep and sed handle input line-by-line and, as far as I know, getting either of them to handle multiple lines isn't very straightforward. What I'm looking for is an alternative or alternatives to these two programs that treat newlines as just another character. Is there any tool that fits such a criteria
The tool you want is awk. It is record-oriented, not line-oriented, and you can specify your record-separator by setting the builtin variable RS. In particular, GNU awk lets you set RS to any regular expression, not just a single character.
Here is an example where awk uses one blank line to separate every record. If you show us what data you have, we can help you with it.
cat file
first line
second line
third line
fourth line
fifth line
sixth line
seventh line
eight line
more data
Running awk on this and reconstruct data using blank line as new record.
awk -v RS= '{$1=$1}1' file
first line second line third line
fourth line fifth line sixth line
seventh line eight line
more data
PS RS is not equal to file, is set to RS= blank, equal to RS=""
1) Sed can handle a block lines together, not always line by line.
In sed, normally I use :loop; $!{N; b loop}; to get all the lines available in pattern space delimited by newline.
Sample:
Productivity
Google Search\
Tips
"Web Based Time Tracking,
Web Based Todo list and
Reduce Key Stores etc"
result (remove the content between ")
sed -e ':loop; $!{N; b loop}; s/\"[^\"]*\"//g' thegeekstuff.txt
Productivity
Google Search\
Tips
You should read this URL (Unix Sed Tutorial: 6 Examples for Sed Branching Operation), it will give you detail how it works.
http://www.thegeekstuff.com/2009/12/unix-sed-tutorial-6-examples-for-sed-branching-operation/
2) For grep, check if your grep support -z option, which needn't handle input line by line.
-z, --null-data
Treat the input as a set of lines, each terminated by a zero
byte (the ASCII NUL character) instead of a newline. Like the
-Z or --null option, this option can be used with commands like
sort -z to process arbitrary file names.