Sed - Copy function arguments and comments - sed

C file has 100s of lines with strlcpy, I have to copy the first parameter and add as the a third argument for the strlpcy using - Eg p->account to sizeof(p->account)
Input
strlcpy(p->account,gettoken(NULL,&plast)); //Set Account Information
strlcpy(p->balance,gettoken(NULL,&plast));
strlcpy(p->startDate,skipchr(gettoken(NULL,&plast),'0')); /* YYYYMMDD */
strlcpy(p->endDate,skipchr(gettoken(NULL,&plast),'0')); /* YYYYMMDD */
strlcpy(p->status,gettoken(NULL,&plast));
Expected Output (Copy the first parameter and add as a third argument - pass as parameter for sizeof ());
strlcpy(p->account,gettoken(NULL,&plast),sizeof(p->account)); //Set Account Information
strlcpy(p->balance,gettoken(NULL,&plast),sizeof(p->balance));
strlcpy(p->startDate,skipchr(gettoken(NULL,&plast),'0'),sizeof(p->startDate)); /* YYYYMMDD */
strlcpy(p->endDate,skipchr(gettoken(NULL,&plast),'0'),sizeof(p->endDate)); /* YYYYMMDD */
strlcpy(p->status,gettoken(NULL,&plast),sizeof(p->status));
Current Ouput (Incorrect Result)
sed 's/^\([^\s]*strlcpy[^(]*\)\(([^,]*\),\([^)]*[^)][^;]\).*/\1\2,\3,sizeof\2));/' kkk1.txt
strlcpy(p->account,gettoken(NULL,&plast),sizeof(p->account));
strlcpy(p->balance,gettoken(NULL,&plast),sizeof(p->balance));
strlcpy(p->startDate,skipchr(gettoken(NULL,&plast),sizeof(p->startDate));
strlcpy(p->endDate,skipchr(gettoken(NULL,&plast),sizeof(p->endDate));
strlcpy(p->status,gettoken(NULL,&plast),sizeof(p->status));
Line1, 3, and 4 Failed to Print the comments at the end of line
Line 3 and 4 : skipchr(gettoken(NULL,&plast),'0') - Parameter '0' failed to get copied in the result along with skipchr() in the result.
Guide me with the correct sed command. Thanks in Advnace

In general, such a task is beyond the power of sed. That's because C has an (approximately) context-free grammar, which cannot be parsed using only regular expressions. Regular grammars cannot describe arbitrarily nested parentheses, for example.
However, if you have sufficient constraints on your inputs, you can use sed for your specific, constrained, source code. In this case, it appears that we can make these assumptions:
Each relevant statement is on a single line
There is only one statement on each relevant line.
The first argument contains no commas.
There are no string literals or comments that might accidentally match.
Given these constraints, we want to match:
The strlcpy name and opening parenthesis (with optional space between): \bstrlcpy\s*(
The first argument, up to the first comma: [^,]+
The rest of the arguments: ,.+
The final closing parenthesis and semicolon: )\s*;
We then want to substitute this with the same text, except that we want sizeof group 2 interposed between groups 3 and 4 in the replacement: \1\2\3, sizeof \2\4.
Putting this together into a GNU sed one-liner:
#!/bin/sed -rf
s/\b(strlcpy\s*\()([^,]+)(,.+)(\)\s*;)/\1\2\3, sizeof \2\4/
Feeding your sample code through this gives the desired output:
strlcpy(p->account,gettoken(NULL,&plast), sizeof p->account); //Set Account Information
strlcpy(p->balance,gettoken(NULL,&plast), sizeof p->balance);
strlcpy(p->startDate,skipchr(gettoken(NULL,&plast),'0'), sizeof p->startDate); /* YYYYMMDD */
strlcpy(p->endDate,skipchr(gettoken(NULL,&plast),'0'), sizeof p->endDate); /* YYYYMMDD */
strlcpy(p->status,gettoken(NULL,&plast), sizeof p->status);
(Note: I didn't include the unnecessary parentheses around the argument to sizeof, as that makes it look like applying sizeof to a type rather than to an expression; if you feel very strongly that you want them, it's not difficult to do. But I don't encourage it.)

$ str="strlcpy(p->account,gettoken(NULL,&plast));"
$ sed -re '/strlcpy/{s/(\([^,]*)([^\)]*\))/\1\2,sizeof\1\)/}' <<< "$str"
strlcpy(p->account,gettoken(NULL,&plast),sizeof(p->account));
Here's the brief explanation,
/strlcpy/: find the line matched "strcpy", and use behind script to process it
(...): \1 and \2 refer to the corresponding matching regex between parentheses (...). Mind that because of the -r parameter, no need to escape the parentheses.
The regex pattern which contains parentheses, they need to be escaped as \(
The final step to edit the files in place, add -i option to do that.

Try this:
sed 's/^\(\s*strlcpy(\)\([^,]\+\)\(,.*\)\();.*\)$/\1\2\3,sizeof(\2)\4/'

Related

Add words at beginning and end of a FASTA header line with sed

I have the following line:
>XXX-220_5004_COVID-A6
TTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAA
AGAAGGTCAAATCAATGATATGATTTTATCTCTTCTTAGTAAAGGTAGACTTATAATTAG
AGAAAACAAC
I would like to convert the first line as follows:
>INITWORD/XXX-220_5004_COVID-A6/FINALWORD
TTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAA
AGAAGGT...
So far I have managed to add the first word as follows:
sed 's/>/>INITTWORD\//I'
That returns:
>INITWORD/XXX-220_5004_COVID-A6
TTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAA
AGAAGGT
How can i add the FINALWORD at the end of the first line?
Just substitute more. sed conveniently allows you to recall the text you matched with a back reference, so just embed that between the things you want to add.
sed 's%^>\(.*\)%>INITWORD/\1/FINALWORD%I' file.fasta
I also added a ^ beginning-of-line anchor, and switched to % delimiters so the slashes don't need to be escaped.
In some more detail, the s command's syntax is s/regex/replacement/flags where regex is a regular expression to match the text you want to replace, and replacement is the text to replace it with. In the regex, you can use grouping parentheses \(...\) to extract some of the matched text into the replacement; so \1 refers to whatever matched the first set of grouping parentheses, \2 to the second, etc. The /flags are optional single-character specifiers which modify the behavior of the command; so for example, a /g flag says to replace every match on a line, instead of just the first one (but we only expect one match per line so it's not necessary or useful here).
The I flag is non-standard but since you are using that, I assume it does something useful for you.

Extracting substring from inside bracketed string, where the substring may have spaces

I've got an application that has no useful api implemented, and the only way to get certain information is to parse string output. This is proving to be very painful...
I'm trying to achieve this in bash on SLES12.
Given I have the following strings:
QMNAME(QMTKGW01) STATUS(Running)
QMNAME(QMTKGW01) STATUS(Ended normally)
I want to extract the STATUS value, ie "Ended normally" or "Running".
Note that the line structure can move around, so I can't count on the "STATUS" being the second field.
The closest I have managed to get so far is to extract a single word from inside STATUS like so
echo "QMNAME(QMTKGW01) STATUS(Running)" | sed "s/^.*STATUS(\(\S*\)).*/\1/"
This works for "Running" but not for "Ended normally"
I've tried switching the \S* for [\S\s]* in both "grep -o" and "sed" but it seems to corrupt the entire regex.
This is purely a regex issue, by doing \S you requested to match non-white space characters within (..) but the failing case has a space between which does not comply with the grammar defined. Make it simple by explicitly calling out the characters to match inside (..) as [a-zA-Z ]* i.e. zero or more upper & lower case characters and spaces.
sed 's/^.*STATUS(\([a-zA-Z ]*\)).*/\1/'
Or use character classes [:alnum:] if you want numbers too
sed 's/^.*STATUS(\([[:alnum:] ]*\)).*/\1/'
sed 's/.*STATUS(\([^)]*\)).*/\1/' file
Output:
Running
Ended normally
Extracting a substring matching a given pattern is a job for grep, not sed. We should use sed when we must edit the input string. (A lot of people use sed and even awk just to extract substrings, but that's wasteful in my opinion.)
So, here is a grep solution. We need to make some assumptions (in any solution) about your input - some are easy to relax, others are not. In your example the word STATUS is always capitalized, and it is immediately followed by the opening parenthesis (no space, no colon etc.). These assumptions can be relaxed easily. More importantly, and not easy to work around: there are no nested parentheses. You will want the longest substring of non-closing-parenthesis characters following the opening parenthesis, no mater what they are.
With these assumptions:
$ grep -oP '\bSTATUS\(\K[^)]*(?=\))' << EOF
> QMNAME(QMTKGW01) STATUS(Running)
> QMNAME(QMTKGW01) STATUS(Ended normally)
> EOF
Running
Ended normally
Explanation:
Command options: o to return only the matched substring; P to use Perl extensions (the \K marker and the lookahead). The regexp: we look for a word boundary (\b) - so the word STATUS is a complete word, not part of a longer word like SUBSTATUS; then the word STATUS and opening parenthesis. This is required for a match, but \K instructs that this part of the matched string will not be returned in the output. Then we seek zero or more non-closing-parenthesis characters ([^)]*) and we require that this be followed by a closing parenthesis - but the closing parenthesis is also not included in the returned string. That's a "lookahead" (the (?= ... ) construct).

What does the following sed statement mean

sed 's/<img src=\"\([^"]*\).*/\1/g'
input:
<img src="geo.yahoo.com/b?s=792600534"; height="1" width="1" style="position: absolute;" />
output:
https://geo.yahoo.com/b?s=792600534
This part is the regular expression to match with a capturing group Later referred as \1 (first capturing group). It extracting the value of the src attribute.
First part if the regex -> <img src=\"
capturing group -> \([^"]*\)
rest of the regex -> .*
The expression inside the square brackets could be read as: "anything not a double quote".
sed is a scripting language. Its s command performs substitutions using regular expressions. The syntax is s/regex/replacement/flags. In your example, you have the regex
<img src=\"\([^"]*\).*
and the replacement
\1
and the flags
g
The regex is apparently attempting to parse HTML, which deserves you a place in a warm location where a friendly gentleman with a pitchfork helps you with motivational issues. Far, far away, God reluctantly ends the life of a fluffy kitten.
The regular expression contains a capturing group, which is simply the text which matched between the parentheses. The replacement \1 refers back to this captured text. So in brief, you are taking away the parts which matched around this captured string.
s/foo\(bar\)baz/\1/
replaces foobarbaz with just baz, retrieving the "baz" part from whatever matched, rather than hard-coding a replacement string.
The regular expression .* matches any character any number of times; the regular expression engine will prefer the longest, leftmost possible match.
The regular expression [^"]* matches a single character which is not (newline or) " and the * again says to match as many times as possible. So "\([^"]*\)" finds a double-quoted string, and captures its contents; the negated " prevents the regular expression from matching past the closing quote when matching as many characters as possible. (As noted in comments, the backslash before the first " is unnecessary, but basically harmless. It just tells us that whoever wrote this isn't a regex wizard.)
However, your example just implicitly includes the closing quote in the .* match which will simply match everything from the closing quote through to the end of the line.
The g flag says to repeat the substitution command as many times as possible; so if an input line contains multiple matches, all of them will be replaced. (Without the g flag, sed will just replace the first match it finds on a line.) But since you just removed the rest of the line, the flag isn't actually useful here; there can only ever be a single match.
The gentleman with the pitchfork doesn't want me to tell you this, but this code is not suitable for a general-purpose script. There is no guarantee that the src attribute of the img element will be immediately adjacent to the img opening tag with just a single space in between; HTML allows arbitrary spacing (including a line wrap) and you can have other attributes like id or alt or title which could go before or after the src attribute. The proper solution is to use a HTML parser to extract the src attributes of img tags with proper understanding of the surrounding syntax.
xmlstarlet sel -T -t -m "/img" -m "#src" -v '.' -n
... though the stray semicolon after the src attribute is a HTML syntax violation; is it really there in your input?
(xmlstarlet command line shamelessly adapted from https://stackoverflow.com/a/3174307/874188)

Converting long single-line comments into multiple short lines

I have some lines with very long single-line comments:
# this is a comment describing the function, let's pretend it's long.
function whatever()
{
# this is an explanation of something that happens in here.
do_something();
}
For this example (adapting it to other numbers should be trivial) I want
each line to contain at most 33 characters (each indentation level is 4 spaces) and
to be broken at the last possible space
each additional line do be indented exactly like the original line.
So it would end up looking like this:
# this is a comment describing
# the function, let's pretend
# it's long.
function whatever()
{
# this is an explanation of
# something that happens in
# here.
do_something();
}
I'm trying to write a sed script for that, my attempt looking like this (leaving out the attempts to make it break at a particular character count for clarity and because it didn't work):
s/\(^[^#]*# \)\(.*\) \(.*\)/\1\2\n\1\3/g;
This breaks the line only once and not repeatedly like I falsely assumed g to do (and which it actually would do if it were only s/ /\n/g or something).
Perl to the rescue!
Its Text::Wrap module does what you need:
perl -MText::Wrap='wrap,$columns' -pe '
s/^(\s*#)(.*)/$columns = 33 - length $1; wrap("$1", "$1 ", "$2")/e
' < input > output
-M uses the given module with the given parameters. Here, we'll use the wrap function and the $columns variable.
-p reads the input line by line and prints the possibly modified line (like sed)
s///e is a substitution that uses code in the replacement part, the matching part is replaced by the value returned from the code
to calculate the width, we subtract the initial whitespace from 33. If you use tabs in your sources, you'll have to handle them specially.
wrap takes three parameters: prefix for the first line, prefix for the rest of the lines (in this case, they're almost the same: the comment prefix, we just need to add the space to the second one), and the text to wrap.
Comparing the output to yours, it seems you want 33 characters regardless of the leading whitespace. If that's true, just remove the - length $1 part.

SED search and replace substring in a database file

To all,
I have spent alot of time searching for a solution to this but cannot find it.
Just for a background, I have a text database with thousands of records. Each record is delineated by :
"0 #nnnnnn# Xnnn" // no quotes
The records have many fields on a line of their own, but the field I am interested in to search and replace a substring (notice spaces) :
" 1 X94 User1.faculty.ventura.ca" // no quotes
I want to use sed to change the substring ".faculty.ventura.ca" to ".students.moorpark.ut", changing nothing else on the line, globally for ALL records.
I have tested many things with negative results.
How can this be done ?
Thank You for the assistance.
Bob Perez (robertperez1957#gmail.com)
If I understand you correctly, you want this:
sed 's/1 X94 \(.*\).faculty.ventura.ca/1 X94 \1.students.moorpark.ut/' mydatabase.file
This will replace all records of the form 1 X94 XXXXXX.faculty.ventura.ca with 1 X94 XXXXX.students.moorpark.ut.
Here's details on what it all does:
The '' let you have spaces and other messes in your script.
s/ means substitute
1 X94 \(.*\).faculty.ventura.ca is what you'll be substituting. The \(.*\) stores anything in that regular expression for use in the replacement
1 X94 \1.students.moorpark.ut is what to replace the thing you found with. \1 is filled in with the first thing that matched \(.*\). (You can have multiple of those in one line, and the next one would then be \2.)
The final / just tells sed that you're done. If your database doesn't have linefeeds to separate its records, you'll want to end with /g, to make this change multiple times per line.
mydatabase.file should be the filename of your database.
Note that this will output to standard out. You'll probably want to add
> mynewdatabasefile.name
to the end of your line, to save all the output in a file. (It won't do you much good on your terminal.)
Edit, per your comments
If you want to replace 1 F94 bperez.students.Napvil.NCC to 1 F94 bperez.JohnSmith.customer, you can use another set of \(.*\), as:
sed 's/1 X94 \(.*\).\(.*\).Napvil.NCC/1 X94 \1.JohnSmith.customer/' 251-2.txt
This is similar to the above, except that it matches two stored parameters. In this example, \1 evaluates to bperez and \2 evaluates to students. We match \2, but don't use it in the replace part of the expression.
You can do this with any number of stored parameters. (Sed probably has some limit, but I've never hit a sufficiently complicated string to hit it.) For example, we could make the sed script be '\(.\) \(...\) \(.*\).\(.*\).\(.*\).\(.*\)/\1 \2 \3.JohnSmith.customer/', and this would make \1 = 1, \2 = X94, \3 = bperez, \4 = Napvil and \5 = NCC, and we'd ignore \4 and \5. This is actually not the best answer though - just showing it can be done. It's not the best because it's uglier, and also because it's more accepting. It would then do a find and replace on a line like 2 Z12 bperez.a.b.c, which is presumably not what you want. The find query I put in the edit is as specific as possible while still being general enough to suit your tasks.
Another edit!
You know how I said "be as specific as possible"? Due to the . character being special, I wasn't. In fact, I was very generic. The . means "match any character at all," instead of "match a period". Regular expressions are "greedy", matching the most they could, so \(.*\).\(.*\) will always fill the first \(.*\) (which says, "take 0 to many of any character and save it as a match for later") as far as it can.
Try using:
sed 's/1 X94 \(.*\)\.\(.*\).Napvil.NCC/1 X94 \1.JohnSmith.customer/' 251-2.txt
That extra \ acts as an escape sequence, and changes the . from "any character" to "just the period". FYI, since I don't (but should) escape the other periods, technically sed would consider 1 X94 XXXX.StdntZNapvilQNCC as a valid match. Since . means any character, a Z or a Q there would be considered a fit.
The following tutorial helped me
sed - replace substring in file
try the same using a -i prefix to replace in the file directly
sed -i 's/unix/linux/' file.txt