How to check for valid file name format in kdb/q? - kdb

I'd like to check that the file names in my directory are all formatted properly. First I create a variable dir and then use the keyword key to see what files are listed...
q)dir:`:/myDirectory/data/files
q)dirkey:key dir
q)dirkey
`FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
`FILEB_ABC_20190430_b556nyc1_OrderSale_000456.meta
I select and parse the .json file name...
q)dirjsn:dirkey where dirkey like "*.json"
q)sepname:raze{"_" vs string x}'[dirjsn]
"FILEA"
"XYZ"
"20190501"
"b233nyc9"
"OrderPurchase"
"000123.json"
Next I'd like to confirm that each character in sepname[0] and sepname[1] are letters, that characters in sepname[2] are numerical/temporal, and that sepname[3] contains alphanumeric values.
What is the best way to optimize the following sequential if statements for performance and how can I check for alphanumeric values, like in the case of sepname[3], not just one or the other?
q)if[not sepname[0] like "*[A-Z]";:show "Incorrect Submitter"];
if[not sepname[1] like "*[A-Z]";:show "Incorrect Reporter"];
if[not sepname[2] like "*[0-9]";:show "Incorrect Date"];
if[not sepname[3] like " ??? ";:show "Incorrect Kind"];
show "Correct File Format"

If your valid filenames alway have that same structure (specifically 5 chars, 3 chars, 8 chars, 8 chars) then you can use a single regex like statement like so:
dirjsn:("FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json";"F2ILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json";"FILEA_XYZ2_20190501_b233nyc9_OrderPurchase_000123.json";"FILEA_XYZ_2A190501_b233nyc9_OrderPurchase_000123.json";"FILEA_XYZ_20190501_b233%yc9_OrderPurchase_000123.json";"FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json");
q)dirjsn
FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
F2ILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
FILEA_XYZ2_20190501_b233nyc9_OrderPurchase_000123.json
FILEA_XYZ_2A190501_b233nyc9_OrderPurchase_000123.json
FILEA_XYZ_20190501_b233%yc9_OrderPurchase_000123.json
FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
q)AZ:"[A-Z]";n:"[0-9]";Azn:"[A-Za-z0-9]";
q)dirjsn where dirjsn like raze(AZ;"_";AZ;"_";n;"_";Azn;"*")where 5 1 3 1 8 1 8 1
"FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json"
"FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json"

like will not work in this case as we need to check each character. One way to do that is to use in and inter:
q) a: ("FILEA"; "XYZ"; "20190501"; "b233nyc9")
Create a character set
q) c: .Q.a, .Q.A
For first 3 cases, check if each charcter belongs to specific set:
q) r1: all#'(3#a) in' (c;c;.Q.n) / output 111b
For alphanumeric case, check if it contains both number and character and no other symbol.
q)r2: (sum[b]=count a[3]) & all b:sum#'a[3] in/: (c;.Q.n) / output 1b
Print output/errors:
q) errors: ("Incorrect Submitter";"Incorrect Reporter";"Incorrect Date";"Incorrect Kind")
q) show $[0=count r:where not r1,r2;"All good";errors r]
q) "All good"

Related

Regex expression in q to match specific integer range following string

Using q’s like function, how can we achieve the following match using a single regex string regstr?
q) ("foo7"; "foo8"; "foo9"; "foo10"; "foo11"; "foo12"; "foo13") like regstr
>>> 0111110b
That is, like regstr matches the foo-strings which end in the numbers 8,9,10,11,12.
Using regstr:"foo[8-12]" confuses the square brackets (how does it interpret this?) since 12 is not a single digit, while regstr:"foo[1[0-2]|[1-9]]" returns a type error, even without the foo-string complication.
As the other comments and answers mentioned, this can't be done using a single regex. Another alternative method is to construct the list of strings that you want to compare against:
q)str:("foo7";"foo8";"foo9";"foo10";"foo11";"foo12";"foo13")
q)match:{x in y,/:string z[0]+til 1+neg(-/)z}
q)match[str;"foo";8 12]
0111110b
If your eventual goal is to filter on the matching entries, you can replace in with inter:
q)match:{x inter y,/:string z[0]+til 1+neg(-/)z}
q)match[str;"foo";8 12]
"foo8"
"foo9"
"foo10"
"foo11"
"foo12"
A variation on Cillian’s method: test the prefix and numbers separately.
q)range:{x+til 1+y-x}.
q)s:"foo",/:string 82,range 7 13 / include "foo82" in tests
q)match:{min(x~/:;in[;string range y]')#'flip count[x]cut'z}
q)match["foo";8 12;] s
00111110b
Note how unary derived functions x~/: and in[;string range y]' are paired by #' to the split strings, then min used to AND the result:
q)flip 3 cut's
"foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo"
"82" ,"7" ,"8" ,"9" "10" "11" "12" "13"
q)("foo"~/:;in[;string range 8 12]')#'flip 3 cut's
11111111b
00111110b
Compositions rock.
As the comments state, regex in kdb+ is extremely limited. If the number of trailing digits is known like in the example above then the following can be used to check multiple patterns
q)str:("foo7"; "foo8"; "foo9"; "foo10"; "foo11"; "foo12"; "foo13"; "foo3x"; "foo123")
q)any str like/:("foo[0-9]";"foo[0-9][0-9]")
111111100b
Checking for a range like 8-12 is not currently possible within kdb+ regex. One possible workaround is to write a function to implement this logic. The function range checks a list of strings start with a passed string and end with a number within the range specified.
range:{
/ checking for strings starting with string y
s:((c:count y)#'x)like y;
/ convert remainder of string to long, check if within range
d:("J"$c _'x)within z;
/ find strings satisfying both conditions
s&d
}
Example use:
q)range[str;"foo";8 12]
011111000b
q)str where range[str;"foo";8 12]
"foo8"
"foo9"
"foo10"
"foo11"
"foo12"
This could be made more efficient by checking the trailing digits only on the subset of strings starting with "foo".
For your example you can pad, fill with a char, and then simple regex works fine:
("."^5$("foo7";"foo8";"foo9";"foo10";"foo11";"foo12";"foo13")) like "foo[1|8-9][.|0-2]"

How to match exact string in perl

I am trying to parse all the files and verify if any of the file content has strings TESTDIR or TEST_DIR
Files contents might look something like:-
TESTDIR = foo
include $(TESTDIR)/chop.mk
...
TEST_DIR := goldimage
MAKE_TESTDIR = var_make
NEW_TEST_DIR = tesing_var
Actually I am only interested in TESTDIR ,$(TESTDIR),TEST_DIR but in my case last two lines should be ignored. I am new to perl , Can anyone help me out with re-rex.
/\bTEST_?DIR\b/
\b means a "word boundary", i.e. the place between a word character and a non-word character. "Word" here has the Perl meaning: it contains characters, numbers, and underscores.
_? means "nothing or an underscore"
Look at "characterset".
Only (space) surrounding allowed:
/^(.* )?TEST_?DIR /
^ beginning of the line
(.* )? There may be some content .* but if, its must be followed by a space
at the and says that a whitespace must be there. Otherwise use ( .*)?$ at the end.
One of a given characterset is allowed:
Should the be other characters then a space be possible you can use a character class []:
/^(.*[ \t(])?TEST_?DIR[) :=]/
(.*[ \t(])? in front of TEST_?DIR may be a (space) or a \t (tab) or ( or nothing if the line starts with itself.
afterwards there must be one of (space) or : or = or ). Followd by anything (to "anything" belongs the "=" of ":=" ...).
One of a given group is allowed:
So you need groups within () each possible group in there devided by a |:
/^(.*( |\t))?TEST_?DIR( | := | = )/
In this case, at the beginning is no change to [ \t] because each group holds only one character and \t.
At the end, there must be (single space) or := (':=' surrounded by spaces) or = ('=' surrounded by spaces), following by anything...
You can use any combination...
/^(.*[ \t(])?TEST_?DIR([) =:]| :=| =|)/
Test it on Debuggex.com. (Use 'PCRE')

Extract words in Lua split by Unicode spaces and control characters

I'm interested in a pure-Lua (i.e., no external Unicode library) solution to extracting the units of a string between certain Unicode control characters and spaces. The code points I would like to use as delimiters are:
0000-0020
007f-00a0
00ad
1680
2000-200a
2028-2029
202f
205f
3000
I know how to access the code points in a string, for example:
> for i,c in utf8.codes("é$ \tπ😃") do print(c) end
233
36
32
9
960
128515
but I am not sure how to "skip" the spaces and tabs and reconstitute the other codepoints into strings themselves. What I would like to do in the example above, is drop the 32 and 9, then perhaps use utf8.char(233, 36) and utf8.char(960, 128515) to somehow get ["é$", "π😃"].
It seems that putting everything into a table of numbers and painstakingly walking through the table with for-loops and if-statements would work, but is there a better way? I looked into string:gmatch but that seems to require making utf8 sequences out of each of the ranges I want, and it's not clear what that pattern would even look like.
Is there a idiomatic way to extract the strings between the spaces? Or must I manually hack tables of code points? gmatch does not look up to the task. Or is it?
would require painstakingly generating the utf8 encodings for all code points at each end of the range.
Yes. But of course not manually.
local function range(from, to)
assert(utf8.codepoint(from) // 64 == utf8.codepoint(to) // 64)
return from:sub(1,-2).."["..from:sub(-1).."-"..to:sub(-1).."]"
end
local function split_unicode(s)
for w in s
:gsub("[\0-\x1F\x7F]", " ")
:gsub("\u{00a0}", " ")
:gsub("\u{00ad}", " ")
:gsub("\u{1680}", " ")
:gsub(range("\u{2000}", "\u{200a}"), " ")
:gsub(range("\u{2028}", "\u{2029}"), " ")
:gsub("\u{202f}", " ")
:gsub("\u{205f}", " ")
:gsub("\u{3000}", " ")
:gmatch"%S+"
do
print(w)
end
end
Test:
split_unicode("#\0#\t#\x1F#\x7F#\u{00a0}#\u{00ad}#\u{1680}#\u{2000}#\u{2005}#\u{200a}#\u{2028}#\u{2029}#\u{202f}#\u{205f}#\u{3000}#")

String interpolation of variable value

I want the variable value to be processed by string interpolation.
val temp = "1 to 10 by 2"
println(s"$temp")
output expected:
inexact Range 1 to 10 by 2
but getting
1 to 10 by 2
is there any way to get this way done?
EDIT
The normal case for using StringContext is:
$> s"${1 to 10 by 2}"
inexact Range 1 to 10 by 2
This return the Range from 1 to 10 with the step value of 2.
And String context won't work on variable, so can there be a way I can do like
$> val temp = "1 to 10 by 2"
$> s"${$temp}" //hypothetical
such that the interpreter will evaluate this as
s"${$temp}" => s"${1 to 10 by 2}" => Range from 1 to 10 by step of 2 = {1,3,5,7,9}
By setting a string value to temp you are doing just that - creating a flat String. If you want this to be actual code, then you need to drop the quotes:
val temp = 1 to 10 by 2
Then you can print the results:
println(s"$temp")
This will print the following output string:
inexact Range 1 to 10 by 2
This is the toString(...) output of a variable representing a Range. If you want to print the actual results of the 1 to 10 by 2 computation, you need to do something like this:
val resultsAsString = temp.mkString(",")
println(resultsAsString)
> 1,3,5,7,9
or even this (watch out: here the curly brackets { } are used not for string interpolation but simply as normal string characters):
println(s"{$resultsAsString}")
> {1,3,5,7,9}
Edit
If what you want is to actually interpret/compile Scala code on the fly (not recommended though - for security reasons, among others), then you may be interested in this:
https://ammonite.io/ - Ammonite, Scala scripting
In any case, to interpret your code from a String, you may try using this:
https://docs.scala-lang.org/overviews/repl/embedding.html
See these lines:
val scripter = new ScriptEngineManager().getEngineByName("scala")
scripter.eval("""println("hello, world")""")

AutoHotKey Script - String Splitting

I have a string that looks like this:
17/07/2013 TEXTT TEXR 1 Text 1234567 456.78 987654
I need to separate this so I only end up with 2 values (in this example it's 1234567 and 456.78). The rest is unneeded.
I tried using string split with %A_Space% but as the whole middle area between values is filled with spaces, it doesn't really work.
Anyone got an idea?
src:="17/07/2013 TEXTT TEXR 1 Text "
. " 1234567 456.78 987654", pattern:="([\d\.]+)\s+([\d\.]+)"
RegexMatch(src, pattern, match)
MsgBox, 262144, % "result", % match1 "`n"match2
You should look at RegExMatch() and RegexReplace().
So, you will need to build a regex needle (I'm not an expert regexer, but this will work)
First, remove all of the string up to the end of "1 Text" since "1 Text" as you say, is constant. That will leave you with the three number values.
Something like this should find just the numbers you want:
needle:= "iO)1\s+Text"
partialstring := RegexMatch(completestring, needle, results)
lenOfFrontToRemove := results.pos() + results.len()
lastthreenumbers := substr(completestring, lenOfFrontToRemove, strlen(completestring) )
lastthreenumbers := trim(lastthreenumbers)
msgbox % lastthreenumbers
To explain the regex needle:
- the i means case insensitive
- the O stands for options - it lets us use results.pos and results.len
- the \s means to look for whitespace; the + means to look for more than one if present.
Now you have just the last three numbers.
1234567 456.78 987654
But you get the idea, right? You should able to parse it from here.
Some hints: in a regex needle, use \d to find any digit, and the + to make it look for more than one in a row. If you want to find the period, use \.