Datastage, Remove only last two characters of string - datastage

This function: Trim(In.Col, Right(In.Col, 2), 'T') works unless the last >2 characters are the same.
What I want:
abczzzz -> abczz
What I get:
abczzzz -> abc
How do I solve this?

The "T" option removes all trailing occurrences. Since you are limiting your input to only two characters with the Right() function, the second occurence will never be a trailing char.
It sounds though like you are just doing a substring..? If so, then you might just want to do a substring [ ] instead.
expression [ [ start, ] length ]
In.Col[(string length) - 2]

I prefer the Left() function, although it's equivalent here, as it's self-documenting.
Left(InLink.MyString, Len(InLink.MyString) - 2)

Related

Regex expression in q to match specific integer range following string

Using q’s like function, how can we achieve the following match using a single regex string regstr?
q) ("foo7"; "foo8"; "foo9"; "foo10"; "foo11"; "foo12"; "foo13") like regstr
>>> 0111110b
That is, like regstr matches the foo-strings which end in the numbers 8,9,10,11,12.
Using regstr:"foo[8-12]" confuses the square brackets (how does it interpret this?) since 12 is not a single digit, while regstr:"foo[1[0-2]|[1-9]]" returns a type error, even without the foo-string complication.
As the other comments and answers mentioned, this can't be done using a single regex. Another alternative method is to construct the list of strings that you want to compare against:
q)str:("foo7";"foo8";"foo9";"foo10";"foo11";"foo12";"foo13")
q)match:{x in y,/:string z[0]+til 1+neg(-/)z}
q)match[str;"foo";8 12]
0111110b
If your eventual goal is to filter on the matching entries, you can replace in with inter:
q)match:{x inter y,/:string z[0]+til 1+neg(-/)z}
q)match[str;"foo";8 12]
"foo8"
"foo9"
"foo10"
"foo11"
"foo12"
A variation on Cillian’s method: test the prefix and numbers separately.
q)range:{x+til 1+y-x}.
q)s:"foo",/:string 82,range 7 13 / include "foo82" in tests
q)match:{min(x~/:;in[;string range y]')#'flip count[x]cut'z}
q)match["foo";8 12;] s
00111110b
Note how unary derived functions x~/: and in[;string range y]' are paired by #' to the split strings, then min used to AND the result:
q)flip 3 cut's
"foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo"
"82" ,"7" ,"8" ,"9" "10" "11" "12" "13"
q)("foo"~/:;in[;string range 8 12]')#'flip 3 cut's
11111111b
00111110b
Compositions rock.
As the comments state, regex in kdb+ is extremely limited. If the number of trailing digits is known like in the example above then the following can be used to check multiple patterns
q)str:("foo7"; "foo8"; "foo9"; "foo10"; "foo11"; "foo12"; "foo13"; "foo3x"; "foo123")
q)any str like/:("foo[0-9]";"foo[0-9][0-9]")
111111100b
Checking for a range like 8-12 is not currently possible within kdb+ regex. One possible workaround is to write a function to implement this logic. The function range checks a list of strings start with a passed string and end with a number within the range specified.
range:{
/ checking for strings starting with string y
s:((c:count y)#'x)like y;
/ convert remainder of string to long, check if within range
d:("J"$c _'x)within z;
/ find strings satisfying both conditions
s&d
}
Example use:
q)range[str;"foo";8 12]
011111000b
q)str where range[str;"foo";8 12]
"foo8"
"foo9"
"foo10"
"foo11"
"foo12"
This could be made more efficient by checking the trailing digits only on the subset of strings starting with "foo".
For your example you can pad, fill with a char, and then simple regex works fine:
("."^5$("foo7";"foo8";"foo9";"foo10";"foo11";"foo12";"foo13")) like "foo[1|8-9][.|0-2]"

Could I specify pattern match priority in lex code?

I've got a related thread in the site(My lex pattern doesn't work to match my input file, how to correct it?)
The problems I met, is about how "greedy" lex will do pattern match, e.g. I've got my lex file:
$ cat b.l
%{
#include<stdio.h>
%}
%%
"12" {printf("head\n");}
"34" {printf("tail\n");}
.* {printf("content\n");}
%%
What I wish to say is, when meet "12", print "head"; when meet "34", print "tail", otherwise print "content" for the longest match that doesn't contain either "12" or "34".
But the fact was, ".*" was a greedy match that whatever I input, it prints "content".
My requirement is, when I use
12sdf2dfsd3sd34
as input, the output should be
head
content
tail
So seems there're 2 possible ways:
1, To specify a match priority for ".*", it should work only when neither "12" and "34" works to match. Does lex support "priority"?
2, to change the 3rd expression, as to match any contiguous string that doesn't contain sub-string of "12", or "34". But how to write this regular expression?
Does (f)lex support priority?
(F)lex always produces the longest possible match. If more than one rule matches the same longest match, the first one is chosen, so in that case it supports priority. But it does not support priority for shorter matches, nor does it implement non-greedy matching.
How to match a string which does not contain one or more sequences?
You can, with some work, create a regular expression which matches a string not containing specified substrings, but it is not particularly easy and (f)lex does not provide a simple syntax for such regular expressions.
A simpler (but slightly less efficient) solution is to match the string in pieces. As a rough outline, you could do the following:
"12" { return HEAD; }
"34" { if (yyleng > 2) {
yyless(yyleng - 2);
return CONTENT;
}
else
return TAIL;
}
.|\n { yymore(); }
This could be made more efficient by matching multiple characters when there is not chance of skipping a delimiter; change the last rule to:
.|[^13]+ { yymore(); }
yymore() causes the current token to be retained, so that the next match appends to the current token rather than starting a new token. yyless(x) returns all but the first x characters to the input stream; in this case, that is used to cause the end delimiter 34 to be rescanned after the CONTENT token is identified.
(That assumes you actually want to tokenize the input stream, rather than just print a debugging message, which is why I called it an outline solution.)

Is it possible to match any character that is not ']' in PATINDEX?

I need to find the index of the first character that is not ]. Normally to match any character except X, you use the pattern [^X]. The problem is that [^]] simply closes the first bracket too early. The first part, [^], will match any character.
In the documentation for the LIKE operator, if you scroll down to the section "Using Wildcard Characters As Literals" it shows a table of methods to indicated literal characters like [ and ] inside a pattern. It makes no mention of using [ or ] inside double brackets. If the pattern is being used with the LIKE operator, you would use the ESCAPE clause. LIKE doesn't return an index and PATINDEX doesn't seem to have a parameter for an escape clause.
Is there no way to do this?
(This may seem arbitrary. To put some context around it, I need to match ] immediately followed by a character that is not ] in order to locate the end of a quoted identifier. ]] is the only character escape inside a quoted identifier.)
This isn't possible. The Connect item PATINDEX Missing ESCAPE Clause is closed as won't fix.
I'd probably use CLR and regular expressions.
A simple implementation might be
using System.Data.SqlTypes;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlInt32 PatIndexCLR(SqlString pattern, SqlString expression)
{
if (pattern.IsNull || expression.IsNull)
return new SqlInt32();
Match match = Regex.Match(expression.ToString(), pattern.ToString());
if (match.Success)
{
return new SqlInt32(match.Index + 1);
}
else
{
return new SqlInt32(0);
}
}
}
With example usage
SELECT [dbo].[PatIndexCLR] ( N'[^]]', N']]]]]]]]ABC[DEF');
If that is not an option a possible flaky workaround might be to substitute a character unlikely to be in the data without this special significance in the grammar.
WITH T(Value) AS
(
SELECT ']]]]]]]]ABC[DEF'
)
SELECT PATINDEX('%[^' + char(7) + ']%', REPLACE(Value,']', char(7)))
FROM T
(Returns 9)

ignore spaces and cases MATLAB

diary_file = tempname();
diary(diary_file);
myFun();
diary('off');
output = fileread(diary_file);
I would like to search a string from output, but also to ignore spaces and upper/lower cases. Here is an example for what's in output:
the test : passed
number : 4
found = 'thetest:passed'
a = strfind(output,found )
How could I ignore spaces and cases from output?
Assuming you are not too worried about accidentally matching something like: 'thetEst:passed' here is what you can do:
Remove all spaces and only compare lower case
found = 'With spaces'
found = lower(found(found ~= ' '))
This will return
found =
withspaces
Of course you would also need to do this with each line of output.
Another way:
regexpi(output(~isspace(output)), found, 'match')
if output is a single string, or
regexpi(regexprep(output,'\s',''), found, 'match')
for the more general case (either class(output) == 'cell' or 'char').
Advantages:
Fast.
robust (ALL whitespace (not just spaces) is removed)
more flexible (you can return starting/ending indices of the match, tokenize, etc.)
will return original case of the match in output
Disadvantages:
more typing
less obvious (more documentation required)
will return original case of the match in output (yes, there's two sides to that coin)
That last point in both lists is easily forced to lower or uppercase using lower() or upper(), but if you want same-case, it's a bit more involved:
C = regexpi(output(~isspace(output)), found, 'match');
if ~isempty(C)
C = found; end
for single string, or
C = regexpi(regexprep(output, '\s', ''), found, 'match')
C(~cellfun('isempty', C)) = {found}
for the more general case.
You can use lower to convert everything to lowercase to solve your case problem. However ignoring whitespace like you want is a little trickier. It looks like you want to keep some spaces but not all, in which case you should split the string by whitespace and compare substrings piecemeal.
I'd advertise using regex, e.g. like this:
a = regexpi(output, 'the\s*test\s*:\s*passed');
If you don't care about the position where the match occurs but only if there's a match at all, removing all whitespaces would be a brute force, and somewhat nasty, possibility:
a = strfind(strrrep(output, ' ',''), found);

why can the character's order in regex expression affect sed?

The tv.txt file is as following:
mms://live21.gztv.com/gztv_gz 广州台[可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3]
mms://live21.gztv.com/gztv_news 广州新闻台·直播广州(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
mms://live21.gztv.com/gztv_kids 广州少儿台(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
mms://live21.gztv.com/gztv_econ 广州经济台
I want to group it into three groups.
sed -r 's/([^ ]*)\s([^][()]*)((\(.+\))*|(\[.+\])*)/\3/' tv.txt
got the result:
[可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3]
(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
When I write it into
sed -r 's/([^ ]*)\s([^][()]*)((\(.+\))*|(\[.+\])*)/\3/' tv.txt
It can't work.
The only difference is [^][()] and [^[]()]; neither of the [^\[\]()] ,escape characters can not make it run properly.
I want to know the reason.
The POSIX rules for getting ] into a character class are a little arcane, but they make sense when you think about it hard.
For a positive (non-negated) character class, the ] must be the first character:
[]and]
This recognizes any character a, n, d or ] as part of the character class.
For a negated character class, the ] must be the first character after the ^:
[^]and]
This recognizes any character except a, n, d or ] as part of the character class.
Otherwise, the first ] after the [ marks the end of the character class. Inside a character class, most of the normal regex special characters lose their special meaning, and others (notably - minus) acquire special meanings. (If you want a - in a character class, it has to be 'first' or last, where 'first' means 'after the optional ^ and only if ] is not present'.)
In your examples:
[^][()] — this is a negated character class that recognizes any character except [, ], ( or ), but
[^[]()] — this is a negated character class that recognizes any character except [, followed by whatever () symbolizes in the regex family you're using, and ] which represents itself.