Replacing repeating pattern with regex - amazon-redshift

I'm trying to replace a certain pattern that is repeated n times on a string with another pattern.
The pattern is: ivc_advertisername:Honda Colorado Dealers,
And is should be replaced with ivc_advertisername=Honda Colorado Dealers ;
This is a sample string
ivc_exdata:761653010, ivc_domain:content.overwolf.com, ivc_placementid:4210386, ivc_dsp:tremor, ivc_advertisername:Honda Colorado Dealers, ivc_advertiserid:483896, ivc_creativeid:8521296, ivc_bundle:, ivc_campaignid:4206156
and this is what I've tried, but it only works if there's only one pattern
select REGEXP_REPLACE('ivc_exdata:761653010, ivc_domain:content.overwolf.com, ivc_placementid:4210386, ivc_dsp:tremor, ivc_advertisername:Honda Colorado Dealers, ivc_advertiserid:483896, ivc_creativeid:8521296, ivc_bundle:, ivc_campaignid:4206156','((.*):(.*), )+','\\1=\\2;');
I'm using redshift
Thanks!

Related

Regex expression in q to match specific integer range following string

Using q’s like function, how can we achieve the following match using a single regex string regstr?
q) ("foo7"; "foo8"; "foo9"; "foo10"; "foo11"; "foo12"; "foo13") like regstr
>>> 0111110b
That is, like regstr matches the foo-strings which end in the numbers 8,9,10,11,12.
Using regstr:"foo[8-12]" confuses the square brackets (how does it interpret this?) since 12 is not a single digit, while regstr:"foo[1[0-2]|[1-9]]" returns a type error, even without the foo-string complication.
As the other comments and answers mentioned, this can't be done using a single regex. Another alternative method is to construct the list of strings that you want to compare against:
q)str:("foo7";"foo8";"foo9";"foo10";"foo11";"foo12";"foo13")
q)match:{x in y,/:string z[0]+til 1+neg(-/)z}
q)match[str;"foo";8 12]
0111110b
If your eventual goal is to filter on the matching entries, you can replace in with inter:
q)match:{x inter y,/:string z[0]+til 1+neg(-/)z}
q)match[str;"foo";8 12]
"foo8"
"foo9"
"foo10"
"foo11"
"foo12"
A variation on Cillian’s method: test the prefix and numbers separately.
q)range:{x+til 1+y-x}.
q)s:"foo",/:string 82,range 7 13 / include "foo82" in tests
q)match:{min(x~/:;in[;string range y]')#'flip count[x]cut'z}
q)match["foo";8 12;] s
00111110b
Note how unary derived functions x~/: and in[;string range y]' are paired by #' to the split strings, then min used to AND the result:
q)flip 3 cut's
"foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo"
"82" ,"7" ,"8" ,"9" "10" "11" "12" "13"
q)("foo"~/:;in[;string range 8 12]')#'flip 3 cut's
11111111b
00111110b
Compositions rock.
As the comments state, regex in kdb+ is extremely limited. If the number of trailing digits is known like in the example above then the following can be used to check multiple patterns
q)str:("foo7"; "foo8"; "foo9"; "foo10"; "foo11"; "foo12"; "foo13"; "foo3x"; "foo123")
q)any str like/:("foo[0-9]";"foo[0-9][0-9]")
111111100b
Checking for a range like 8-12 is not currently possible within kdb+ regex. One possible workaround is to write a function to implement this logic. The function range checks a list of strings start with a passed string and end with a number within the range specified.
range:{
/ checking for strings starting with string y
s:((c:count y)#'x)like y;
/ convert remainder of string to long, check if within range
d:("J"$c _'x)within z;
/ find strings satisfying both conditions
s&d
}
Example use:
q)range[str;"foo";8 12]
011111000b
q)str where range[str;"foo";8 12]
"foo8"
"foo9"
"foo10"
"foo11"
"foo12"
This could be made more efficient by checking the trailing digits only on the subset of strings starting with "foo".
For your example you can pad, fill with a char, and then simple regex works fine:
("."^5$("foo7";"foo8";"foo9";"foo10";"foo11";"foo12";"foo13")) like "foo[1|8-9][.|0-2]"

KDB split by fixed delimiter

I have a column with xmls
<Options TE="2017/09/01, 16:45:00.000" ST="2017/09/01, 09:00:00.000" TT="2017/09/01, 16:45:00.000"/>
<Options TE="2017/09/01, 16:45:00.000" ST="2017/09/01, 09:00:00.000" TT="2017/09/01, 16:45:00.000"/>
<Options TE="2017/09/04, 16:45:00.000" ST="2017/09/04, 09:00:00.000" TT="2017/09/04, 16:45:00.000"/>
That I am trying to split in columns
TE, ST, TT
The type of the data is C
Not very familiar with kdb/q I tried to go the very manual way. First removed the start and end tags
x:update `$ssr[;"<Options";""] each tags from x
x:update `$ssr[;"/>";""] each string tags from x
leaving me with rows like
TE="2017/09/01, 16:45:00.000" ST="2017/09/01, 09:00:00.000" TT="2017/09/01, 16:45:00.000"
Then, splitting the string
select `$"\"" vs' string tags from x
gives me a list where the odd entries are my times. I just can't figure out how to take that list and split it into separate columns. Any ideas?
I've taken a slightly different approach but the following should do what you want:
//Clean the tags up for separation
//(get rid of open/close tags, change ", " to "," for ease of parsing and remove quote marks)
x:update tags:{ssr/[x;("<Options ";"/>";", ";"\"");("";"";",";"")]} each tags from x
//Parse the various tags using 0:, put the result into a dictionary,
//exec out to table form and add to x
x:x,'exec (!) ./: ("S= " 0:/: tags) from x
For reference here's the table I used:
x:([] tags:("<Options TE=\"2017/09/01, 16:45:00.000\" ST=\"2017/09/01, 09:00:00.000\" TT=\"2017/09/01, 16:45:00.000\"/>";
"<Options TE=\"2017/09/01, 16:45:00.000\" ST=\"2017/09/01, 09:00:00.000\" TT=\"2017/09/01, 16:45:00.000\"/>";
"<Options TE=\"2017/09/04, 16:45:00.000\" ST=\"2017/09/04, 09:00:00.000\" TT=\"2017/09/04, 16:45:00.000\"/>"))
Crazy thought: Is your XML data that regular looking, so that one can select "columns" via indexing. If so, suppose the data (above) was in 3-element list of strings, is it not possible that you apply some function foo to:
foo xmllist[;ind]
where ind selects the data required. The function foo would do the necessary conversion to the timestamp datatype, either by using (types;delimiter) 0: ... ?
see if you can export XML file into JSON file.
kdb+/q has a json parser which does all the dirty work for you.
.j.k and .j.j.
Reference: http://code.kx.com/q/cookbook/websockets/#json

Scala : How to split words using multiple delimeters

Suppose I have the text file like this:
Apple#mango&banana#grapes
The data needs to be split on multiple delimiters before performing the word count.
How to do that?
Use split method:
scala> "Apple#mango&banana#grapes".split("[#&#]")
res0: Array[String] = Array(Apple, mango, banana, grapes)
If you just want to count words, you don't need to split. Something like this will do:
val numWords = """\b\w""".r.findAllIn(string).length
This is a regex that matches start of a word (\b is a (zero-length) word boundary, \w is any "word" character (letter, number or underscore), so you get all the matches in your string, and then just check how many there are.
If you are looking to count each word separately, and do it across multiple lines, then, split is, probably, a better option:
source
.getLines
.flatMap(_.split("\\W+"))
.filterNot(_.isEmpty)
.groupBy(identity)
.mapValues(_.size)

Using fscanf in MATLAB to read an unknown number of columns

I want to use fscanf for reading a text file containing 4 rows with an unknown number of columns. The newline is represented by two consecutive spaces.
It was suggested that I pass : as the sizeA parameter but it doesn't work.
How can I read in my data?
update: The file format is
String1 String2 String3
10 20 30
a b c
1 2 3
I have to fill 4 arrays, one for each row.
See if this will work for your application.
fid1=fopen('test.txt');
i=1;
check=0;
while check~=1
str=fscanf(fid1,'%s',1);
if strcmp(str,'')~=1;
string(i)={str};
end
i=i+1;
check=strcmp(str,'');
end
fclose(fid1);
X=reshape(string,[],4);
ar1=X(:,1)
ar2=X(:,2)
ar3=X(:,3)
ar4=X(:,4)
Once you have 'ar1','ar2','ar3','ar4' you can parse them however you want.
I have found a solution, i don't know if it is the only one but it works fine:
A=fscanf(fid,'%[^\n] *\n')
B=sscanf(A,'%c ')
Z=fscanf(fid,'%[^\n] *\n')
C=sscanf(Z,'%d')
....
You could use
rawText = getl(fid);
lines = regexp(thisLine,' ','split);
tokens = {};
for ix = 1:numel(lines)
tokens{end+1} = regexp(lines{ix},' ','split'};
end
This will give you a cell array of strings having the row and column shape or your original data.
To read an arbitrary line of text then break it up according the the formating information you have available. My example uses a single space character.
This uses regular expressions to define the separator. Regular expressions powerful but too complex to describe here. See the MATLAB help for regexp and regular expressions.

Function to split string in matlab and return second number

I have a string and I need two characters to be returned.
I tried with strsplit but the delimiter must be a string and I don't have any delimiters in my string. Instead, I always want to get the second number in my string. The number is always 2 digits.
Example: 001a02.jpg I use the fileparts function to delete the extension of the image (jpg), so I get this string: 001a02
The expected return value is 02
Another example: 001A43a . Return values: 43
Another one: 002A12. Return values: 12
All the filenames are in a matrix 1002x1. Maybe I can use textscan but in the second example, it gives "43a" as a result.
(Just so this question doesn't remain unanswered, here's a possible approach: )
One way to go about this uses splitting with regular expressions (MATLAB's strsplit which you mentioned):
str = '001a02.jpg';
C = strsplit(str,'[a-zA-Z.]','DelimiterType','RegularExpression');
Results in:
C =
'001' '02' ''
In older versions of MATLAB, before strsplit was introduced, similar functionality was achieved using regexp(...,'split').
If you want to learn more about regular expressions (abbreviated as "regex" or "regexp"), there are many online resources (JGI..)
In your case, if you only need to take the 5th and 6th characters from the string you could use:
D = str(5:6);
... and if you want to convert those into numbers you could use:
E = str2double(str(5:6));
If your number is always at a certain position in the string, you can simply index this position.
In the examples you gave, the number is always the 5th and 6th characters in the string.
filename = '002A12';
num = str2num(filename(5:6));
Otherwise, if the formating is more complex, you may want to use a regular expression. There is a similar question matlab - extracting numbers from (odd) string. Modifying the code found there you can do the following
all_num = regexp(filename, '\d+', 'match'); %Find all numbers in the filename
num = str2num(all_num{2}) %Convert second number from str