Substituting everything except an ID with sed - sed

I want to keep the first id and remove everything afterwards with sed.
My line looks like
CAM_READ_0623233309 /library_id=CAM_LIB_002149 /sample_id=CAM_SMPL_003380 raw_id=G9ALM7U02F5HAW length=383 /IP_notice=?This genetic information downloaded from CAMERA may be considered to be part of the genetic patrimony of Denmark, the country from which the sample was obtained. Users of this information agree to: 1) acknowledge Denmark as the country of origin in any country where the genetic information is presented and 2) contact the CBD focal point identified on the CBD website (http://www.cbd.int/countries/) if they intend to use the genetic information for commercial purposes.?
and I just want :
CAM_READ_06232333

Capturing specific sequence:
sed -r 's/.*(CAM_READ_[0-9]+).*/\1/' input.txt
or
sed -e 's/.*\(CAM_READ_[0-9]\+\).*/\1/' input.txt
Capturing everything at the front, except whitespace characters:
sed -r 's/^(\S+).*/\1/' input.txt

Nice and easy sed statement:
sed 's/ .*$//'
s substitute
/ .*$/ match everything after the first space in the line
/ replace it with nothing
Command example:
echo "CAM_READ_0623233309 /library_id=CAM_LIB_002149 blah blah" | sed 's/ .*$//'
Command example output:
CAM_READ_0623233309
Now, of course, if you have multiple different types of lines within the same file that you're dealing with this will not work for you. But, your question above does not indicate this.

Related

Why is my sed multiline find-and-replace not working as expected?

I have a simple sed command that I am using to replace everything between (and including) //thistest.com-- and --thistest.com with nothing (remove the block all together):
sudo sed -i "s#//thistest\.com--.*--thistest\.com##g" my.file
The contents of my.file are:
//thistest.com--
zone "awebsite.com" {
type master;
file "some.stuff.com.hosts";
};
//--thistest.com
As I am using # as my delimiter for the regex, I don't need to escape the / characters. I am also properly (I think) escaping the . in .com. So I don't see exactly what is failing.
Why isn't the entire block being replaced?
You have two problems:
Sed doesn't do multiline pattern matches—at least, not the way you're expecting it to. However, you can use multiline addresses as an alternative.
Depending on your version of sed, you may need to escape alternate delimiters, especially if you aren't using them solely as part of a substitution expression.
So, the following will work with your posted corpus in both GNU and BSD flavors:
sed '\#^//thistest\.com--#, \#^//--thistest\.com# d' /tmp/corpus
Note that in this version, we tell sed to match all lines between (and including) the two patterns. The opening delimiter of each address pattern is properly escaped. The command has also been changed to d for delete instead of s for substitute, and some whitespace was added for readability.
I've also chosen to anchor the address patterns to the start of each line. You may or may not find that helpful with this specific corpus, but it's generally wise to do so when you can, and doesn't seem to hurt your use case.
# separation by line with 1 s//
sed -n -e 'H;${x;s#^\(.\)\(.*\)\1//thistest.com--.*\1//--thistest.com#\2#;p}' YourFile
# separation by line with address pattern
sed -e '\#//thistest.com--#,\#//--thistest.com# d' YourFile
# separation only by char (could be CR, CR/LF, ";" or "oneline") with s//
sed -n -e '1h;1!H;${x;s#//thistest.com--.*\1//--thistest.com##;p}' YourFile
Note:
assuming there is only 1 section thistest per file (if not, it remove anything between the first opening until the last closing section) for the use of s//
does not suite for huge file (load entire file into memory) with s//
sed using addresses pattern cannot select section on the same line, it search 1st pattern to start, and a following line to stop but very efficient on big file and/or multisection

Insert specific lines from file before first occurrence of pattern using Sed

I want to insert a range of lines from a file, say something like 210,221r before the first occurrence of a pattern in a bunch of other files.
As I am clearly not a GNU sed expert, I cannot figure how to do this.
I tried
sed '0,/pattern/{210,221r file
}' bunch_of_files
But apparently file is read from line 210 to EOF.
Try this:
sed -r 's/(FIND_ME)/PUT_BEFORE\1/' test.text
-r enables extendend regular expressions
the string you are looking for ("FIND_ME") is inside parentheses, which creates a capture group
\1 puts the captured text into the replacement.
About your second question: You can read the replacement from a file like this*:
sed -r 's/(FIND_ME)/`cat REPLACEMENT.TXT`\1/' test.text
If replace special characters inside REPLACEMENT.TXT beforehand with sed you are golden.
*= this depends on your terminal emulator. It works in bash.
In https://stackoverflow.com/a/11246712/4328188 CodeGnome gave some "sed black magic" :
In order to insert text before a pattern, you need to swap the pattern space into the hold space before reading in the file. For example:
sed '/pattern/ {
h
r file
g
N
}' in
However, to read specific lines from file, one may have to use a two-calls solution similar to dummy's answer. I'd enjoy knowing of a one-call solution if it is possible though.

Manipulate characters with sed

I have a list of usernames and i would like add possible combinations to it.
Example. Lets say this is the list I have
johna
maryb
charlesc
Is there is a way to use sed to edit it the way it looks like
ajohn
bmary
ccharles
And also
john_a
mary_b
charles_c
etc...
Can anyone assist me into getting the commands to do so, any explanation will be awesome as well. I would like to understand how it works if possible. I usually get confused when I see things like 's/\.(.*.... without knowing what some of those mean... anyway thanks in advance.
EDIT ... I change the username
sed s/\(user\)\(.\)/\2\1/
Breakdown:
sed s/string/replacement/ will replace all instances of string with replacement.
Then, string in that sed expression is \(user\)\(.\). This can be broken down into two
parts: \(user\) and \(.\). Each of these is a capture group - bracketed by \( \). That means that once we've matched something with them, we can reuse it in the replacement string.
\(user\) matches, surprisingly enough, the user part of the string. \(.\) matches any single character - that's what the . means. Then, you have two captured groups - user and a (or b or c).
The replacement part just uses these to recreate the pattern a little differently. \2\1 says "print the second capture group, then the first capture group". Which in this case, will print out auser - since we matched user and a with each group.
ex:
$ echo "usera
> userb
> userc" | sed "s/\(user\)\(.\)/\2\1/"
auser
buser
cuser
You can change the \2\1 to use any string you want - ie. \2_\1 will give a_user, b_user, c_user.
Also, in order to match any preceding string (not just "user"), just replace the \(user\) with \(.*\). Ex:
$ echo "marya
> johnb
> alfredc" | sed "s/\(.*\)\(.\)/\2\1/"
amary
bjohn
calfred
here's a partial answer to what is probably the easy part. To use sed to change usera to user_a you could use:
sed 's/user/user_/' temp
where temp is the name of the file that contains your initial list of usernames. How this works: It is finding the first instance of "user" on each line and replacing it with "user_"
Similarly for your dot example:
sed 's/user/user./' temp
will replace the first instance of "user" on each line with "user."
Sed does not offer non-greedy regex, so I suggest perl:
perl -pe 's/(.*?)(.)$/$2$1/g' file
ajohn
bmary
ccharles
perl -pe 's/(.*?)(.)$/$1_$2/g' file
john_a
mary_b
charles_c
That way you don't need to know the username before hand.
Simple solution using awk
awk '{a=$NF;$NF="";$0=a$0}1' FS="" OFS="" file
ajohn
bmary
ccharles
and
awk '{a=$NF;$NF="";$0=$0"_" a}1' FS="" OFS="" file
john_a
mary_b
charles_c
By setting FS to nothing, every letter is a field in awk. You can then easy manipulate it.
And no need to using capturing groups etc, just plain field swapping.
This might work for you (GNU sed):
sed -r 's/^([^_]*)_?(.)$/\2\1/' file
This matches any charactes other than underscores (in the first back reference (\1)), a possible underscore and the last character (in the second back reference (\2)) and swaps them around.

Separating every line in a file at specific points

I have a dictionary file formatted like this:
A B [C] D
Where a is a word (with no spaces), B is another word (with no spaces inside it), C is the pronunciation (there are spaces here), and D is the definition expressed in words (there are spaces, and a variety of symbols).
I wish to separate it into 4 parts, like this:
A####B####C####D
In this way, the first space is converted to ####, the first [ is converted to ####, and the first ] is converted to ####. This will allow easy import into a spreadsheet as a CSV (####'s serve as the commas).
Can this be achieved with awk or another tool in BASH?
Update:
Here are some samples:
一千零一夜 一千零一夜 [Yi1 qian1 ling2 yi1 ye4] /The Book of One Thousand and One Nights/
灰姑娘 灰姑娘 [Hui1 gu1 niang5] /Cinderella/a sudden rags-to-riches celebrity/
雪白 雪白 [xue3 bai2] /snow white/
Would be converted to:
一千零一夜####一千零一夜 ####Yi1 qian1 ling2 yi1 ye4#### /The Book of One Thousand and One Nights/
灰姑娘####灰姑娘 ####Hui1 gu1 niang5#### /Cinderella/a sudden rags-to-riches celebrity/
雪白####雪白 ####xue3 bai2#### /snow white/
Consider that anything might appear after the third set of ####'s, including more spaces, [, etc., however, before the third ####, everything is consistent in format.
I think sed will be easier:
sed -e 's/ /####/' -e 's/ [/####/' -e 's/] /####/' infile > outfile
By default (i.e. if you don't specify the g modifier at the end) substitutions only work once per line.
Or, if you want to do it in-place:
sed -i -e 's/ /####/' -e 's/ [/####/' -e 's/] /####/' infile
(but not all versions of sed support that, and you'll lose your input file)

sed - inserting a comma after every 4th line in plain text file

I am trying to insert a comma after the values on line 1, 4, 8, etc using sed,
sed '0-4 s/$/,/' in.txt > in2.txt
For some reason this isn't working so I was wondering if anyone has any solutions doing this using awk, sed, or any other methods.
The error I am getting is
sed: 1: "0-4 s/$/,/": invalid command code -
Currently my data looks like this:
City
Address
Zip Code
County
and I was trying to format it like this
City,
Address
Zip Code
County
Much Appreciated.
0-4 indeed is not well-formed sed syntax. I would use awk for this, but it is easy to do it with either.
sed 's/$/,/;n;n;n' file
which substitutes one line and prints it, then prints the next three lines without substitution, then starts over from the beginning of the script; or
awk 'NR % 4 == 1 {sub(/$/,",")} {print}'
which does the substitution if the line number modulo 4 is 1, then prints unconditionally.
Sed's addressing modes are sometimes a tad disappointing; there is no standard way to calculate line offsets, relative or in reference to e.g. the end of the file. Of course, awk is more complex, but if you can only learn one or the other, definitely go for awk. (Or in this day and age, Python or Perl -- a much better investment.)
This might work for you (GNU sed):
sed '1~4s/$/,/' file