I'm trying to run a SYNCSORT job that will remove duplicate entries and when I run it, I'm still getting duplicates. The following is the SYNCSORT code I'm using:
INCLUDE COND=(((61,1,CH,EQ,C'P'),OR,
(61,1,CH,EQ,C'V')),AND,
(8,2,CH,EQ,C'FL'))
OUTREC FIELDS=(1:12,20,
30:36,20,
55:61,1)
SORT FIELDS=(30,20,CH,A,
01,20,CH,A)
SUM FIELDS=NONE
The input is as follows:
----+----1----+----2----+----3----+----4----+----5----+----6
FL AMELIA CITY
32034 FL NASSAU FERNANDINA BEACH P
32034 FL NASSAU AMELIA CITY V
32034 FL NASSAU AMELIA ISLAND S
32034 FL NASSAU FERNANDINA S
I'm getting most of the expected output, except that I'm still getting duplicates. The output that I have is as follows:
----+----1----+----2----+----3----+----4----+----5----+
MANATEE BRADENTON P
MANATEE BRADENTON P
MANATEE BRADENTON P
MANATEE BRADENTON P
MANATEE BRADENTON P
MANATEE BRADINGTON V
POLK BRADLEY P
HILLSBOROUGH BRANDON P
SUWANNEE BRANFORD P
MIAMI-DADE BRICKELL V
Any help would be appreciated as I'm not able to find my error.
This is what you are sort summing on:
< ------------ Sort Field ----------------------->
----+----1----+----2----+----3----+----4----+----5----+----6
FL AMELIA CITY
32034 FL NASSAU FERNANDINA BEACH P
32034 FL NASSAU AMELIA CITY V
32034 FL NASSAU AMELIA ISLAND S
32034 FL NASSAU FERNANDINA S
the Duplicate records will be different in the first 11 bytes which you can not see. Try removing the outrec to check.
Possible changes -
Change the outrec to an inrec
re-code the sort with fields associated with the output, see the following:
The following sort sorts based on the output records:
INCLUDE COND=(((61,1,CH,EQ,C'P'),OR,
(61,1,CH,EQ,C'V')),AND,
(8,2,CH,EQ,C'FL'))
OUTREC FIELDS=(1:12,20,
30:36,20,
55:61,1)
SORT FIELDS=(42,20,CH,A,
12,20,CH,A)
SUM FIELDS=NONE
It does not matter what order you code the different stages of a "sort", they will be executed in the order that SORT wants.
In your case this will be INCLUDE, then SORT, then SUM, then OUTREC. You can check that this is the case by entirely inverting the control cards, you will get identical output.
If you want to do something before SORT you use INREC, not just try to locate OUTREC before the SORT statement. Here, since you are SORTing, you only want to include the data you need. You do not want to include the spacing for formatting. Why would you want to load up your file to SORT with extra identical data on each record?
On INREC and OUTREC please don't use FIELDS. On OUTFIL please don't use OUTREC. It should be obvious that FIELDS is "overloaded" (see how many times you used FIELDS, and see how many are "the same") and OUTREC is "overloaded". More than 10 years ago BUILD was introduced to allow things to be much clearer - it describes what it is doing, and every time you see BUILD it only only means BUILD.
INCLUDE COND=(((61,1,CH,EQ,C'P'),
OR,
(61,1,CH,EQ,C'V')),
AND,
(8,2,CH,EQ,C'FL'))
INREC BUILD=(36,20,
12,20,
61,1)
SORT FIELDS=(1,40,CH,A)
OUTREC BUILD=(21,10,
10X,
1,20,
5X,
41,1)
The INREC selects only the data you want, and in an order where you need specify only one SORT key.
The OUTREC then formats the data how you want it. For each record in the SORT 15 bytes were saved (the blanks). 10X is 10 blanks, 5X is five blanks.
Note that it is much easier, to code and understand, and more maintainable therefore, if you include "explicit" blanks rather than implicit ones using column numbers. Imaging 10 columns of a report, and the spacing between columns one and two are incorrect. Do you want to change all the column references, just to add one extra space, or would you prefer to change 7X to 8X and the rest works itself out? Even if you enjoy tedious changes, remember your colleagues :-)
If your data is already in order don't use SUM FIELDS=NONE. Use OUTFIL reporting features, REMOVECC, NODETAIL and SECTIONS with TRAILER3. NEVER SORT data just to allow you to remove duplicates with SUM FIELDS=NONE.
Related
I would like to delete all records for all tables in memory but still keep the schemas.
for example:
a:([]a:1 2;b:2 4);
b:([]c:2 3;d:3 5);
I wrote a function:
{[t] t::select from t where i = -1} each tables[]
this didnt work, so i tried
{[t] ![`t;enlist(=;`i;-1);0b;()]} each tables[]
didnt work either.
Any idea or better ways?
If you pass a global table name as a symbol, it removes all the rows, leaving an empty table
q)delete from `a
`a
q)a
a b
---
q)meta a
c| t f a
-| -----
a| j
b| j
To do it for all global tables in root name space
{delete from x} each tables[]
Your second attempt using function was close. You can achieve it via the following (functional form of the above):
![;();0b;`symbol$()] each tables[]
The first argument should be the symbol of the table for the same reason I mentioned before
The second argument should be an empty list as we want to delete all records (we do not want to delete where i=-1, as that would delete nothing)
The final argument (list of columns to delete) should be an empty symbol list instead of an empty general list.
Mark's solution is best for doing what you want rather than functional form. Just adding to your question on t failing as putting kdb code in comments is awkward.
Your functional form fails not because of the t but because your last argument is not a symbol list `$(). Also you would want to delete where i is > -1, not =
q){[t] ![t;enlist(>;`i;-1);0b;`$()]} each `d`t`q
`d`t`q
q)d
date sym time bid1 bsize1 bid2 bsize2 bid3 bsize3 ask1 asize1 ask2 asize2 ask..
-----------------------------------------------------------------------------..
q)t
date sym time src price size
----------------------------
q)q
date sym time src bid ask bsize asize
-------------------------------------
I have this sample data to test regexp_extract function.
message_txt="test 9341Come Products Preferred*TEST*TEST, the mfg SYSTEM, paid18.26 toward the"
message_txt="mfg of TR tt 100 test, paid $861.82 toward your "
message_txt="TEST 0.015% , paid $1119.00toward your "
I need to extract the numeric value between "paid" and "toward", i.e. 18.26, 861.82 and 1119.00. I execute the below statement
regexp_extract(col("message_txt"),"(?i)paid\\s+(.*?)\\s+(?i)toward",1)
... but getting only spaces.
I don't know regexp_extract() but it looks to me like...
You don't want $ in your results, so you need to move that outside of the capture group.
There aren't always spaces before/after the target, so \\s needs to be optional.
There's no point in having a 2nd (?i).
It's usually better to describe exactly what's permitted in the capture group.
Try something like: "(?i)paid\\s*\\$?([\\d.]+)\\s*toward"
I'm looking for a way to write functional select in KDB such that the where phrases is only apply if the column exists (on order to avoid error). If the column doesn't exist, it defaults to true.
I tried this but it didn't work
enlist(|;enlist(in;`colname;key flip table);enlist(in;`colname;filteredValues[`colname]));
I tried to write a simple boolean expression and use parse to get my functional form
(table[`colname] in values)|(not `colname in key flip table)
But kdb doesn't have short circuit so the left-hand expression is still evaluated despite the right-hand expression evaluating to true. This caused a weird output boolean$() which is a list of booleans all evaluating to false 0b
Any help is appreciated. Thanks!
EDIT 1: I have to join a series of condition with parameter specified in the dictionary filters
cond,:(,/) {[l;k] enlist(in;k;enlist l[k])}[filters]'[a:(key filters)]
Then I pass this cond on and it gets executed on a few different selects on different tables. How can I make sure that whatever conditional expression I put in place of enlist(in;k;enlist l[k] will only get evaluated as the select statement gets executed.
You can use the if-else conditional $ here to do what you want
For example:
q)$[`bid in cols`quotes;enlist (>;`bid;35);()]
> `bid 35
q)$[`bad in cols`quotes;enlist (>;`bad;35);()]
Note that in the second example, the return is an empty list, as this column isn't in quotes table
So you can put this into the functional select like so:
?[`quotes;$[`bid in cols`quotes;enlist (>;`bid;35);()];0b;()]
and the where clause will be applied the the column is present, otherwise no where clause will be applied:
q)count ?[`quotes;$[`bid in cols`quotes;enlist (>;`bid;35);()];0b;()]
541 //where clause applied, table filtered
q)count ?[`quotes;$[`bad in cols`quotes;enlist (>;`bad;35);()];0b;()]
1000 //where clause not applied, full table returned
Hope this helps
Jonathon
AquaQ Analytics
EDIT: If I'm understanding your updated question correctly, you might be able to do something a like the following. Firstly, let's define an example "filters" dictionary:
q)filters:`a`b`c!(1 2 3;"abc";`d`e`f)
q)filters
a| 1 2 3
b| a b c
c| d e f
So here we are assuming a few different columns of different types, for illustration purposes. You can build up your list of where clauses like so:
q)(in),'flip (key filters;value filters)
in `a 1 2 3
in `b "abc"
in `c `d`e`f
(this is equivalent to the code you had to generate cond, but it's a little neater & more efficient - you also have the values enlisted, which isn't necessary)
You could then use a vector conditional to generate your list of where clauses to apply to a given table e.g.
q)t:([] a:1 2 3 4 5 6;b:"adcghf")
q)?[key[filters] in cols[t];(in),'flip (key filters;value filters);count[filters]#()]
(in;`a;,1 2 3)
(in;`b;,"abc")
()
As you can see, in this example the table "t" has columns a and b, but not c. So using the vector conditional, you get the where clauses for a and b but not c.
Finally to actually apply this list of output where clauses to the table, you can make use of an over to apply each in turn:
q)l:?[key[filters] in cols[t];(in),'flip (key filters;value filters);count[filters]#()]
q){?[x;$[y~();y;enlist y];0b;()]}/[t;l]
a b
---
1 a
3 c
One thing to note here is that in the where clause of the functional select we need to check if y is an empty list - this is so we can enlist it if it is not an empty list
Hope this helps
This is the given expression of GREL language on OpenRefine.
diff(date d1, date d2, optional string timeUnit)
For dates, returns the difference in given time units.
So the question is how to get the access to the values of both columns, that is not clear on presented on the documentation.
Thanks
The formula for accessing another column is:
cells.YourColumnName.value
If your column name contains spaces or non-ascii characters :
cells['Your Column Name'].value
So, assuming your two columns are named "date1" and "date2", and you want the difference in days, the GREL formula is as follows :
diff(cells.date1.value, cells.date2.value, "days")
or
diff(cells['date1'].value, cells['date2'].value, "days")
I found a way myself here is the example of the working command, the GREL documentation is not that explicit treating this procedure.
Here is the commend I used, I multiplied the result by -1 to make it positive.
diff(cells["DATA_COMPRA"].value, cells["DATA_VENCIMENTO"].value, "days") * -1
Hope that helps, I my have to come back here sometimes to get this script again and again.
I've a csv file. I'd like to filter it and keep only columns with headers beginning 'hit'. How can I do that?
Small example input:
hit1,miss1,hit2,miss2
a,0,d,0
b,0,e,0
c,0,f,0
Desired output:
hit1,hit2
a,d
b,e
c,f
I think I want the exclude command but I can't figure out the syntax
The order command will let you specify an inclusive list of column names to be included in the output:
csvfix order -fn hit1,hit2 data.csv
(I realize I'm late to the party, but maybe this will helpful to the next person.)