How to make sed pattern "intelligent"

How to make sed pattern "intelligent" - sed

I have a file like:
None44 DET20_22526;size=4; DET20_39906;size=2; DEX29.h_40767;size=4; DEX27.h_779;size=6757;
Goal:
None44 DET20_22526 DET20_39906 DEX29.h_40767 DEX27.h_779
Simply remove the ";size=**;
The digits after size= range from 1-6757 (at the most).
I have been trying:
sed 's/;size=*;//g'
My limited knowledge of sed and regX limited me to this.
Can someone point out how to either remove all between ;'s including the ;'s
or
How to make my sed realize what I can state in English... but can't code yet :(

You could try :
sed 's/;size=[0-9]*;//g'
What does this regex means ?
s/.../.../g stands for : replace every match of first expression with the second expression
first expression, in our case, is ;size=[0-9]*; that shoud be decomposed as:
the exact string ;size=, followed by
zero or more occurences of any digit in the range 0-9, followed by
;
second expression is empty, so the matched part is suppressed
the final g is an option that tells sed to match all matching parts, and not stop at the first one

Related

getting the first letter of an filtered part in sed

I have a filename e.g. 15736--1_brand-new-image.jpg
My goal is to get the first letter after the _ in this case the b.
With s/\(.*\)\_\(.*\)$/\2/ I am able to extract brand-new-image.jpg
which is partly based on the info found on https://www.oncrashreboot.com/use-sed-to-split-path-into-filename-extension-and-directory
I've already found get first letter of words using sed but fail to combine the two.
To validate my sed statement I've used https://sed.js.org/
How can I combina a new sed statement on the part I've filtered to get the first letter?

With your shown samples could you please try following.
echo "15736--1_brand-new-image.jpg" | sed 's/[^_]*_\(.\).*/\1/'
Explanation: Simply using substitution operation of sed, then looking till 1st occurrence of _ then saving next 1 char into back reference and mentioning .* will cover everything after it, while substituting simply substituting everything with 1st back reference value which will be after 1st _ in this case its b.
Explanation: Following is only for explanation purposes.
sed ' ##Starting sed program from here.
s/ ##using s to tell sed to perform substitution operation.
[^_]*_\(.\).* ##using regex to match till 1st occurrence of _ then using back reference \(.\) to catch value in temp buffer memory here.
/\1/ ##Substituting whole line with 1st back reference value here which is b in this case.
'

Using a . or \w could also match _ in case there are 2 consecutive __
If you want to match the first word character without matching the _ you could also use
echo "15736--1_brand-new-image.jpg" | sed 's/[^_]*_\([[:alnum:]]\).*/\1/'
Output
b

This might work for you (GNU sed):
sed -nE 's/^[^_]*_[^[:alpha:]]*([[:alpha:]]).*/\1/p' file
Since this a filtering type operation use the -n option to print only when there is a positive match.
Match the first _ from the start of the line and then discard any non-alpha characters until an alpha character and finally discard any other characters.
Print the result if there is a match.
N.B. Anchoring the match to the start of the line, prevents the result containing more than one character i.e. consider the string 123_456_abc might otherwise result in 4 or 123_a.

Using sed to replace a number located between two other numbers

I need to replace a numeric value, that occurs in a specific line of a series of config files in a pattern like this:
string number_1 number_to_replace number_2
I want to obtain something like this:
string number_1 number_replaced number_2
The difficulties I encountered are:
number_1 or number_2 can be equal to number_to_replace, so a simple replacement is not possible.
number_1 and number_2 vary between config files so I don't know them in advance.
The closest attempt I got until now is:
echo "field 4 4 4" | sed 's/\s4\s/3/'
Which ouputs:
field34 4
This is close, given that I want to replace the intermediate number I added another "\s" to try to use the known fact that the line starts with a character.
echo "field 4 4 4" | sed 's/\s\s4\s/3/'
Which gives:
field 4 4 4
So, nothing is replaced this time. How can I proceed? A somewhat detailed explanation would be ideal, because my knowledge of replacing expressions that involve patterns in nearly zero.
Thanks.

You can do something like below, which matches your exact sequence of digits as in the example. You could replace 3 with any digit of your choice.
sed 's/\([0-9]\{1,\}\)[[:space:]]\([0-9]\{1,\}\)[[:space:]]\([0-9]\{1,\}\)/\1 3 \3/'
Notice that I've used the POSIX bracket expression to match the whitespace character which should be supported in any variant of sed you are using. Note that \s is supported in only the GNU variants.
The literal meaning of the regex definition is to match a single digit followed by a space, then a digit and space and another digit. The captured groups are stored from \1. Since your intention is to remove the 2nd digit, you replace that with the word of your choice.
If the extra escapes causes it unreadable, use the -E flag for extended regex support. I've used the default BRE version

Unable to use '*' to search/replace -- sed

I want to make all a.b.c.top*.gz mentions to new-word/new-table.
Something like -->
es.fr.en.top20.gz becomes binarised-model/phrase-table
I did this :
sed -i 's/es\.fr\.en\.top*\.gz/binarised-model\/phrase-table/g' top*/mert-work/moses.ini
I had initially not used backslash before periods, but, once it did not work, I thought maybe period is tricky.
But, it does not seem to replace anything. What's going wrong ?
Thanks !

Using * as a wildcard is correct for bash globbing, but not if you work with regex, which is the case when using sed. Instead of *, try .*.
In regex, * means match the preceding character any number of times. The wildcard character is ., so .* matches any number of any characters.
If you know that the character you want to match is always a number, it's safer to use [0-9]*. If you even know how many characters this number will have, then you can even use e.g. [0-9]\{2\} to match exactly two numerals.

Sed uses regular expressions, not shell globbing. That means that (1) . matches any single character except a newline, so you are right to escape them to match a literal dot, and (2) * matches zero or more of the token preceding it, here that's p. You need
sed -i 's/es\.fr\.en\.top.*\.gz/binarised-model\/phrase-table/g' top*/mert-work/moses.ini
# ˆ

SED search and replace substring in a database file

To all,
I have spent alot of time searching for a solution to this but cannot find it.
Just for a background, I have a text database with thousands of records. Each record is delineated by :
"0 #nnnnnn# Xnnn" // no quotes
The records have many fields on a line of their own, but the field I am interested in to search and replace a substring (notice spaces) :
" 1 X94 User1.faculty.ventura.ca" // no quotes
I want to use sed to change the substring ".faculty.ventura.ca" to ".students.moorpark.ut", changing nothing else on the line, globally for ALL records.
I have tested many things with negative results.
How can this be done ?
Thank You for the assistance.
Bob Perez (robertperez1957#gmail.com)

If I understand you correctly, you want this:
sed 's/1 X94 \(.*\).faculty.ventura.ca/1 X94 \1.students.moorpark.ut/' mydatabase.file
This will replace all records of the form 1 X94 XXXXXX.faculty.ventura.ca with 1 X94 XXXXX.students.moorpark.ut.
Here's details on what it all does:
The '' let you have spaces and other messes in your script.
s/ means substitute
1 X94 \(.*\).faculty.ventura.ca is what you'll be substituting. The \(.*\) stores anything in that regular expression for use in the replacement
1 X94 \1.students.moorpark.ut is what to replace the thing you found with. \1 is filled in with the first thing that matched \(.*\). (You can have multiple of those in one line, and the next one would then be \2.)
The final / just tells sed that you're done. If your database doesn't have linefeeds to separate its records, you'll want to end with /g, to make this change multiple times per line.
mydatabase.file should be the filename of your database.
Note that this will output to standard out. You'll probably want to add
> mynewdatabasefile.name
to the end of your line, to save all the output in a file. (It won't do you much good on your terminal.)
Edit, per your comments
If you want to replace 1 F94 bperez.students.Napvil.NCC to 1 F94 bperez.JohnSmith.customer, you can use another set of \(.*\), as:
sed 's/1 X94 \(.*\).\(.*\).Napvil.NCC/1 X94 \1.JohnSmith.customer/' 251-2.txt
This is similar to the above, except that it matches two stored parameters. In this example, \1 evaluates to bperez and \2 evaluates to students. We match \2, but don't use it in the replace part of the expression.
You can do this with any number of stored parameters. (Sed probably has some limit, but I've never hit a sufficiently complicated string to hit it.) For example, we could make the sed script be '\(.\) \(...\) \(.*\).\(.*\).\(.*\).\(.*\)/\1 \2 \3.JohnSmith.customer/', and this would make \1 = 1, \2 = X94, \3 = bperez, \4 = Napvil and \5 = NCC, and we'd ignore \4 and \5. This is actually not the best answer though - just showing it can be done. It's not the best because it's uglier, and also because it's more accepting. It would then do a find and replace on a line like 2 Z12 bperez.a.b.c, which is presumably not what you want. The find query I put in the edit is as specific as possible while still being general enough to suit your tasks.
Another edit!
You know how I said "be as specific as possible"? Due to the . character being special, I wasn't. In fact, I was very generic. The . means "match any character at all," instead of "match a period". Regular expressions are "greedy", matching the most they could, so \(.*\).\(.*\) will always fill the first \(.*\) (which says, "take 0 to many of any character and save it as a match for later") as far as it can.
Try using:
sed 's/1 X94 \(.*\)\.\(.*\).Napvil.NCC/1 X94 \1.JohnSmith.customer/' 251-2.txt
That extra \ acts as an escape sequence, and changes the . from "any character" to "just the period". FYI, since I don't (but should) escape the other periods, technically sed would consider 1 X94 XXXX.StdntZNapvilQNCC as a valid match. Since . means any character, a Z or a Q there would be considered a fit.

The following tutorial helped me
sed - replace substring in file
try the same using a -i prefix to replace in the file directly
sed -i 's/unix/linux/' file.txt

how to use sed/awk to remove words with multiple pattern count

I have a file of string records where one of the fields - delimited by "," - can contain one or more "-" inside it.
The goal is to delete the field value if it contains more than two "-".
i am trying to recoup my past knowledge of sed/awk but can't make much headway
==========
info,whitepaper,Data-Centers,yes-the-6-top-problems-in-your-data-center-lane
info,whitepaper,Data-Centers,the-evolution-center
info,whitepaper,Data-Centers,the-evolution-of-lan-technology-lanner
==========
expected outcome:
info,whitepaper,Data-Centers
info,whitepaper,Data-Centers,the-evolution-center
info,whitepaper,Data-Centers
thanks

Try
sed -r 's/(^|,)([^,-]+-){3,}[^,]+(,|$)/\3/g'
or if you're into slashes
sed 's/\(^\|,\)\([^,-]\+-\)\{3,\}[^,]\+\(,\|$\)/\3/g'
Explanation:
I'm using the most basic sed command: substitution. The syntax is: s/pattern/replacement/flags.
Here pattern is (^|,)([^,-]+-){3,}[^,]+(,|$), replacement is \3, flags is g.
The g flag means global replacement (all matching parts are replaced, not only the first in line).
In pattern:
brackets () create a group. Somewhat like in math. They also allow to refer to a group with a number later.
^ and $ mean beginning and end of the string.
| means "or", so (^|,) means "comma or beginning of the string".
square brackets [] mean a character class, ^ inside means negation. So [^,-] means "anything but comma or hyphen". Not that usually the hyphen has a special meaning in character classes: [a-z] means all lowercase letters. But here it's just a hyphen because it's not in the middle.
+ after an expression means "match it 1 or more times" (like * means match it 0 or more times).
{N} means "match it exactly N times. {N,M} is "from N to M times". {3,} means "three times or more". + is equivalent to {1,}.
So this is it. The replacement is just \3. This refers to the third group in (), in this case (,|$). This will be the only thing left after the substitution.
P.S. the -r option just changes what characters need to be escaped: without it all of ()-{}| are treated as regular chars unless you escape them with \. Conversely, to match literal ( with -r option you'll need to escape it.
P.P.S. Here's a reference for sed. man sed is your friend as well.
Let me know if you have further questions.

You could try perl instead of sed or awk:
perl -F, -lane 'print join ",", grep { !/-.*-.*-/ } #F' < file.txt

This might work for you:
sed 's/,\{,1\}[^,-]*\(-[^,]*\)\{3,\}//g file

sed 's/\(^\|,\)\([^,]*-\)\{3\}[^,]*\(,\|$\)//g'
This should work in more cases:
sed 's/,$/\n/g;s/\(^\|,\|\n\)\([^,\n]*-\)\{3\}[^,\n]*\(,\|\n\|$\)/\3/g;s/,$//;s/\n/,/g'