Is it possible to use binary values in sed pattern matching?
I have a one line strings which contain plain text fields separated by binary value 1 as separator.
Is it possible to use sed to much everything up to binary separator 1?
Or should I use awk?
Example string where \x1 represents binary value 1:
key1=value1\x1key2=value2\x1key3=value3
Example expected output, values for key1 and key2:
value1 value2
edit: Here are a couple of options for printing the values based on a list of keys, couldn't figure out a more concise way with awk but one probably exists:
$ echo -e 'key1=value1\001key2=value2\001key3=value3' > test
$ sed 's/\x01/\n/g' test | awk -F= '{if ($1 == "key1" || $1 == "key2") print $2}'
value1
value2
$ sed 's/\x01/\n/g' test | perl -pe 's/((key1|key2)=(.*)|.*)/\3/'
value1
value2
You can't match everything up to the first \x1 since sed does not support non-greedy matching, your options are to use a different language, or something like the following:
$ sed 's/\x01/\n/g' test | head -n 1
key1=value1
The answer to the following question has a good example of using a Perl regex for non-greedy matches:
Non greedy regex matching in sed?
You have to find a way to get the \x1 in binary in the command there as sed doesn't parse it. For example to convert them all to new lines:
sed -e "s/$(echo -e \\001)/\n/g" filename
Type Control-A at the point where you want the character \001 to appear.
I would find this a lot easier than handling all the necessary escaping to get echo to produce the correct string if there are any backslashes in the regex - and I find there often are such backslashes.
Related
I have a file in which some lines contain a json object on a single line, and I want to extract the value of the window_indicator property.
A normal regular expression is: "window_indicator":\s*([\-\d\.]+) in which I want the value of the fist match group.
Here it is working perfectly well: https://regex101.com/r/w9Iuch/1
I've settled on sed because it seems that grep has to print the whole line and can't limit to the match group value, and perl is overkill.
Unfortunately, sed isn't actually capable of doing this, is it?
# sed 's/("window_indicator:)/\1/' in.txt
sed: -e expression #1, char 26: invalid reference \1 on `s' command's RHS
# sed -E 's/("window_indicator":)/\1/p' in.txt
prints out every line of the file
# sed -rn 's/("window_indicator":)/\1/p' in.txt
prints the whole line
# sed -rn 's/("window_indicator":)/\1/' in.txt
nothing
With sed, you need to match the whole line, capture what you need, replace the whole match with Group 1 placeholder, and make sure you suppress the default line output and only print the new text after successful substitution:
sed -nE 's/.*"window_indicator":[[:space:]]*([-0-9.]+).*/\1/p' in.txt
If the first match is to be retrieved, add q to quit:
sed -nE 's/.*"window_indicator":[[:space:]]*([-0-9.]+).*/\1/p;q' in.txt
Note that \d is not supported in POSIX regex, it is replaced with 0-9 range in the bracket expression here.
Details
n - suppress default line output
E - enables POSIX ERE flavor
.*"window_indicator":[[:space:]]*([-0-9.]+).* - finds
.* - any text
"window_indicator": - a fixed string
[[:space:]]* - zero or more whitespaces (GNU sed supports \s, too)
([-0-9.]+) - Group 1: one or more digits, - or .
.* - any text
\1 - replaces with Group 1 value
p - prints the result upon successful replacement
q - quits processing the stream.
With GNU grep, it is even easier:
grep -oP '"window_indicator":\s*\K[-\d.]+' in.txt
To get the first match,
grep -oP '"window_indicator":\s*\K[-\d.]+' in.txt | head -1
Here,
o - outputs matched texts only
P - enables the PCRE regex engine
"window_indicator":\s*\K[-\d.]+ - matches
"window_indicator": - a fixed string
\s* - zero or more whitespaces
\K - removes the text matched so far from the match value
[-\d.]+ - matches one or more -, . or digits.
1st solution: With your shown samples please try following awk code. Though its always advised to use json parsers like: jq. Simple explanation would be, using match function of awk here, where using regex "window_indicator":[0-9]+} in it to match needed value. If regex is successfully matched then creating variable val which has sub-string of matched regex in current line. Then substituting "window_indicator": and } with NULL in val and printing val which will give needed value.
awk '
match($0,/"window_indicator":[0-9]+}/){
val=substr($0,RSTART,RLENGTH)
gsub(/"window_indicator":|}/,"",val)
print val
}
' Input_file
2nd solution: Using GNU grep where using positive look ahead and positive look behind mechanism and getting the expected output as per requirement.
grep -oP '(?<="window_indicator":)\d+(?=})' Input_file
Using sed
$ sed -E 's/.*window_indicator":([0-9]+).*/\1/' input_file
0
Using grep
$ grep -Po '.*window_indicator":\K\d+' input_file
0
Using awk
$ awk '{match($0,/.*window_indicator":([0-9]+)/,arr);print arr[1]}' input_file
0
i m trying to perform the following substitution on lines of the general format:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......
as you see the problem is that its a comma separated file, with a specific field containing a comma decimal. I would like to replace that with a dot .
I ve tried this, to replace the first occurence of a pattern after match, but to no avail, could someone help me?
sed -e '/,"/!b' -e "s/,/./"
sed -e '/"/!b' -e ':a' -e "s/,/\./"
Thanks in advance. An awk or perl solution would help me as well. Here's an awk effort:
gawk -F "," 'substr($10, 0, 3)==3 && length($10)==12 { gsub(/,/,".", $10); print}'
That yielded the same file unchanged.
CSV files should be parsed in awk with a proper FPAT variable that defines what constitutes a valid field in such a file. Once you do that, you can just iterate over the fields to do the substitution you need
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")"; OFS="," }
{ for(i=1; i<=NF;i++) if ($i ~ /[,]/) gsub(/[,]/,".",$i);}1' file
See this answer of mine to understand how to define and parse CSV file content with FPAT variable. Also see Save modifications in place with awk to do in-place file modifications like sed -i''.
The following sed will convert all decimal separators in quoted numeric fields:
sed 's/"\([-+]\?[0-9]*\)[,]\?\([0-9]\+\([eE][-+]\?[0-9]+\)\?\)"/"\1.\2"/g'
See: https://www.regular-expressions.info/floatingpoint.html
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^",]*"[^"]*)*"[^",]*),/\1./;ta' file
This regexp matches a , within a pair of "'s and replaces it by a .. The regexp is anchored to the start of the line and thus needs to be repeated until no further matches can be matched, hence the :a and the ta commands which causes the substitution to be iterated over whilst any substitution is successful.
N.B. The solution expects that all double quotes are matched and that no double quotes are quoted i.e. \" does not appear in a line.
If your input always follows that format of only one quoted field containing 1 comma then all you need is:
$ sed 's/\([^"]*"[^"]*\),/\1./' file
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC, .......
If it's more complicated than that then see What's the most robust way to efficiently parse CSV using awk?.
Assuming you have this:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC
Try this:
awk -F',' '{print $1,$2,$3,$4"."$5,$6,$7}' filename | awk '$1=$1' FS=" " OFS=","
Output will be:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC
You simply need to know the field numbers for replacing the field separator between them.
In order to use regexp as in perl you have to activate extended regular expression with -r.
So if you want to replace all numbers and omit the " sign, then you can use this:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/g'
If you want to replace first occurrence only you can use that:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/1'
https://www.gnu.org/software/sed/manual/sed.txt
I have a string in the following format and I want to convert it to csv format (note the separator is the underscore character "_"
Title_YYYYMMDD_emailname convert to Title,YYYYMMDD,emailname
This is simple enough using sed ...
echo "Report_20131107_jlsmith" | sed 's/_/,/g'
Output:
Report,20131107,jlsmith
But there are complications trying to parse a string that contains underscores in the title field ..
I want to retain the underscores in the title (if any) but change the underscores to commas for the
date and emailname ...
For instance:
Report_Title_20131107_jlsmith convert to: Report_Title,20131107,jlsmith
And a related question: is there a way to compress multiple repeating instances of the underscore character for the entire string?
Report_Title____20131107_jlsmith convert to: Report_Title,20131107,jlsmith
Last request first:
echo "Report_Title____20131107_jlsmith" | awk '{gsub(/_+/,"_")}1'
Report_Title_20131107_jlsmith
First request (using gnu awk)
echo "Report_Title_more_20131107_jlsmith" | awk '{print gensub(/_([0-9]+)_/,",\\1,","g")}'
Report_Title_more,20131107,jlsmith
All in one command
echo "Report_Title___more_20131107_jlsmith" | awk '{gsub(/_+/,"_");print gensub(/_([0-9]+)_/,",\\1,","g")}'
Report_Title_more,20131107,jlsmith
With the format you have shown, you could replace ____YYYYMMDD_ with ,YYYYMMDD, using sed as follows
echo 'Report_Title____20131107_jlsmith' | sed 's/__*\([0-9]\{8\}\)__*/,\1,/g'
Report_Title,20131107,jlsmith
Using sed
sed -r -e 's/_+/_/g' -e 's/_([^_]+)_([^_]+)$/,\1,\2/'
Or more robust with stringent regex
sed -r -e 's/_+/_/g' -e 's/^(.+)_([0-9]{8})_(\w+)$/\1,\2,\3/'
I have a delimited file whose first few fields look like this:
2774013300|184500|2012-01-04 23:00:00|
and I want to alter certain rows whose first field equals or exceeds 8 characters.
I want to truncate the value in the first column.
In the case of 2774013300 I want its value to become become 27740133.
I would like to do this in sed, preferably, or awk.
Using sed, I can find any number that exceeds 8 digits at the beginning of the line, but am not quite sure how to truncate it, using, I would assume, substitute.
sed -n -e /'^[0-9]\{10,\}/p' infile
I am thinking I could use grouping for the first 8 characters and return those in a substitute command, but I'm not quite sure how to do that.
In awk, I can detect the first field, but am not quite sure how to use substr to alter the first field and then return the remaining fields, so a full line is preserved.
awk -F'|' '{ if (length($1) > 9) { print $1; print length($1);} }' infile
Depending on the subtleties of your situation, you can use
sed 's/^\([0-9]\{8\}\)[0-9]*/\1/' infile
or
sed 's/^\([0-9]\{8\}\)[0-9]\{1,\}/\1/' infile
which with GNU sed can be simplified to
sed -r 's/^([0-9]{8})[0-9]+/\1/' infile
or, if you need to, add -n and p.
Example:
$ sed 's/^\([0-9]\{8\}\)[0-9]*/\1/' <<<'2774013300|184500|2012-01-04 23:00:00|'
27740133|184500|2012-01-04 23:00:00|
Using awk:
awk -F'|' 'BEGIN{OFS=FS}length($1)>9{$1=substr($1, 0,9)}{print}'
example:
$ echo "2774013300|184500|2012-01-04 23:00:00|" | awk -F'|' 'BEGIN{OFS=FS}length($1)>9{$1=substr($1, 0,9)}{print}'
27740133|184500|2012-01-04 23:00:00|
I work with strings like
abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf
and I need to get a new one where I remove in the original string everything from the beginning till the last appearance of "_" and the next characters (can be 3, 4, or whatever number)
so in this case I would get
_adf
How could I do it with "sed" or another bash tool?
Regular expression pattern matching is greedy. Hence ^.*_ will match all characters up to and including the last _. Then just put the underscore back in:
echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | sed 's/^.*_/_/'
sed 's/^(.*)_([^_]*)$/_\2/' < input.txt
Do you need to modify the string, or just find everything after the last underscore? The regex to find the last _{anything} would be /(_[^_]+)$/ ($ matches the end of the string), or if you also want to match a trailing underscore with nothing after it, /(_[^_]*)$/.
Unless you really need to modify the string in place instead of just finding this piece, or you really want to do this from the command line instead of a script, this regex is a bit simpler (you tagged this with perl, so I wasn't sure quite how committed to using just the command line as opposed to a simple script you were).
If you do need to modify the string in place, sed -i 's/(_[^_]+)$/\1/' myfile or sed -i 's/(_[^_]+)$/\1/g' myfile. The -i (edit: I decided not to be lazy and look up the proper syntax...) the -i flag will just overwrite the old file with the new one. If you want to create a new file and not clobber the old one, sed -e 's/.../.../g' oldfile > newfile. The g after the s/// will do this for all instances in the file you pass into sed; leaving it out just replaces the first instance.
If the string is not by itself at the end of the line, but rather embedded in other text. but just separated by whitespace, replace the $ with \s, which will match a whitespace character (the end of a word).
If you have strings like these in bash variables (I don't see that specified in the question), you can use parameter expansion:
s="abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf"
t="_${s##*_}"
echo "$t" # ==> _adf
In Perl, you could do this:
my $string = "abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf";
if ( $string =~ m/(_[^_]+)$/ ) {
print $1;
}
[Edit]
A Perl one liner approach (ie, can be run from bash directly):
perl -lne 'm/(_[^_]+)$/ && print $1;' infile > outfile
Or using substitution:
perl -pe 's/.*(_[^_]+)$/$1/' infile > outfile
Just group the last non-underscore characters preceded by the last underscore with \(_[^_]*\), then reference this group with \1:
sed 's/^.*\(_[^_]*\)$/\1/'
Result:
$ echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | sed 's/^.*\(_[^_]*\)$/\1/'
_adf
A Perl way:
echo 'abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf' | \
perl -e 'print ((split/(_)/,<>)[-2..-1])'
output:
_adf
Just for fun:
echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | tr _ '\n' | tail -n 1 | rev | tr '\n' _ | rev