Using DIFFERENCE in a WHERE clause - tsql

I want to use the DIFFERENCE keyword in a T-SQL query to find data that is equal or very similar between two data sets. DIFFERENCE returns a score from 1-4, where 4 is very similar and 1 is not similar at all.
For example, if I have two data sets, A and B, that contain the following:
A B
---- ----
adam adam
bob billy
charlie brittany
doug charles
frances diana
heather
kim
I would want to select ones that are equal or similar (say, DIFFERENCE value of 3 or 4), so I would want the result set (which stems from data set A) of:
Result
----
adam
charlie
My thought is to put the DIFFERENCE keyword in the WHERE clause, something like this:
SELECT *
FROM A
/* somehow join B here, despite that A and B might not be exact matches such as in charlie and charles */
WHERE DIFFERENCE(A, B) >= 3
How can I do this?

select *
from a
join b
on difference(a.name, b.name) = 4;

Related

How to select a column containing dot in column name in kdb

I have a table which consists of column named "a.b"
q)t:([]a.b:3?10.0; c:3?10; d:3?`3)
How can we select column a.b and c from table t?
How can we rename column a.b to b?
Is it possible to achieve above two cases without functional select?
Failed attempts:
q)select a.b, c from t
'type
q)?[`t;();0b;enlist (`b`c!`a.b`c)]
'type
q)select b:a.b from t
'type
As others have mentioned, .Q.id t will sanitise table column names if they aren't suitable for qSQL statements or performance in general.
`a.b`c#t
will only work for multiple column selects and
`a.b#t
will return a type error. However, you can get around this by enlisting the single item into the take operator, like so:
q)enlist[`a.b]#t
a.b
---------
4.931835
5.785203
0.8388858
q)(enlist`a.b)#t
a.b
---------
4.931835
5.785203
0.8388858
If you only need the values from a single column another option would be to use indexing, in this case, it would be t[a.b] ` which would return all values from the a.b column.
You could also mix these selection styles like so, but ultimately lose the column name from a.b:
q)select c,t[`a.b] from t
c x
----------
8 4.707883
5 6.346716
4 9.672398
In the query operation the . itself is used for foreign key navigation and it is throwing a type error as it cannot find any table relating to the foreign key it believes you have passed it.
As much as I hate answering any online forum question by refuting the premise, I really must here, do not use periods in column names, it will cause trouble. .Q.id exists to santise column names for a reason.
The primary reason that errors are encountered is that the use of dot notation in qSQL is reserved for the resolution of linked columns. We can see how this is actually working by parsing the query itself
q)parse "select a.b from tab"
?
`tab
()
0b
(,`b)!,`a.b // Here the referencing of a linked column b via a is occuring
// Compared to a normal select
q)parse "select b from tab"
?
`tab
()
0b
(,`b)!,`b
Other issues could crop up depending on future processing, such as q attempting to treat the column names as namespaces or operating on each part of the name with the dot operator.
Using dot notation in your column names will hamstring any further development, and force all other kdb users to use roundabout methods. The development will be slow and encounter many bugs.
I would advise that if periods must be included in the column, you create an API for external users to use to translate queries into the sanitised forms.
You can easily sanitise the whole table with .Q.id
q)tab:enlist `a.b`c`d!(1 2 3)
q)tab:.Q.id tab
q)sel:{[tab;cl] ?[tab;();0b;((),.Q.id each cl)!((),.Q.id each cl)]}
q)sel[tab;`a.b]
ab
--
1
How about the following, using take # :
q) `a.b`c#t
a.b c
-----------
4.931835 1
5.785203 9
0.8388858 5
To rename:
q) `b xcol t
b c d
---------------
4.931835 1 mil
5.785203 9 igf
0.8388858 5 kao
You can use .Q.id to rename any unselectable columns:
q).Q.id t
ab c d
---------------
4.931835 1 mil
5.785203 9 igf
0.8388858 5 kao
Best to avoid dots in columns names and symbols in general, use underscore if you must.

pig merge lists without key

In Apache Pig 0.15, I have two simple lists (WITHOUT id/primary key, etc.) that I want to merge together to create one list of tuples with two columns. Example:
Names
-----
Peter
John
Anne
Ages
-----
45
23
44
I want to end up with:
Names Age
---------------
Peter 45
John 23
Anne 44
I know I can use RANK on both lists and then JOIN, but that looks way too costly as I have millions of entries in these lists. I kind of want to do a JOIN with "merge" without having a join parameter...
Any idea about how to do this efficiently in Apache Pig?
If you do not care about the mapping between Age and Name then you can try cross-join between two relations. Post Cross join group by names and retain anyone out of it. However IMO, this may be more costlier ( rather resource intensive) than the RANK approach you mentioned above.

postgres ANY() with BETWEEN Condition

in case someone is wondering, i am recycling a different question i answered myself, because is realized that my problem has a different root-cause than i thought:
My question actually seems pretty simple, but i cannot find a way.
How do is query postgres if any element of an array is between two values?
The Documentation states that a BETWEEN b and c is equivalent to a > b and a < c
This however does not work on arrays, as
ANY({1, 101}) BETWEEN 10 and 20 has to be false
while
ANY({1,101}) > 10 AND ANY({1,101}) < 20 has to be true.
{1,101} meaning an array containing the two elements 1 and 101.
how can i solve this problem, without resorting to workarounds?
regards,
BillDoor
EDIT: for clarity:
The scenario i have is, i am querying an xml document via xpath(), but for this problem a column containing an array of type int[] does the job.
id::int | numbers::int[] | name::text
1 | {1,3,200} | Alice
2 | {21,100} | Bob
I want all Names, where there is a number that is between 20 and 30 - so i want Bob
The query
SELECT name from table where ANY(numbers) > 20 AND ANY(numbers) < 30
will return Alice and Bob, showing that alice has numbers > 20 as well as other numbers < 30.
A BETWEEN syntax is not allowed in this case, however between only gets mapped to > 20 AND < 30 internally anyways
Quoting the docs on the Between Operators' mapping to > and < documentation:
There is no difference between the two respective forms apart from the
CPU cycles required to rewrite the first one into the second one
internally.
PS.:
Just to avoid adding a new question for this: how can i solve
id::int | numbers::int[] | name::text
1 | {1,3,200} | Alice
2 | {21,100} | Alicia
SELECT id FROM table WHERE ANY(name) LIKE 'Alic%'
result: 1, 2
i can only find examples of matching one value to multiple regex, but not matching one regex against a set of values :/. Besides the shown syntax is invalid, ANY has to be the second operand, but the second operand of LIKE has to be the regex.
exists (select * from (select unnest(array[1,101]) x ) q1 where x between 10 and 20 )
you can create a function based on on this query
second approach:
select int4range(10,20,'[]') #> any(array[1, 101])
for timestamps and dates its like:
select tsrange( '2015-01-01'::timestamp,'2015-05-01'::timestamp,'[]') #> any(array['2015-05-01', '2015-05-02']::timestamp[])
for more info read: range operators

Counting multiple values from one column in Tableau

I have a field from the data I am reading in that can contain multiple values. They are essentially tags.
For example, there could be a column called "persons responsible". This could read "Joe; Bob; Sue" or "Sue" for a given row.
Is it possible from within Tableau to read these in as separate categories? So that for this sample data:
Project | Persons
---------------------------
Zeta | Bob; Sue; Joe
Enne | Sue
Doble Ve | Bob
There could be a count of Bob (2), Sue (2), Joe (1)?
I am working on getting better data inputs, but I was wondering if there was a temporary solution at this level.
I would definitely work towards normalizing your schema.
In the meantime, there is a workaround that is almost reasonable if there is a small set of possible values for the tags (persons in your example).
If Bob, Sue and Joe are the only people in the system, you can use the contains() function to define a boolean calculated field for each person -- e.g. Bob_Is_Responsible = contains(Persons, 'Bob"), and similar fields for Sue and Joe. Then you could use those as building blocks, possibly with sets, to break the data up in different ways.
Of course, this approach gets cumbersome fast if the number of tags grows, or if it is unconstrained. But you asked for a temporary solution ...
If the number of elements is small, you write and union several queries with each one having the project and nth element.
Ideally, you'd reshape your data to look like this either in the database or with the above mentioned union technique. Then you could count() or countd() the elements by project.
Project | Persons
---------------------------
Zeta | Bob
Zeta | Sue
Zeta | Joe
Enne | Sue
Doble Ve | Bob

kdb ticker plant: where to find documentation on .u.upd?

I am aware of this resource. But it does not spell out what parameters .u.upd takes and how to check if it worked.
This statement executes without error, although it does not seem to do anything:
.u.upd[`t;(`$"abc";1;2;3)]
If I define the table beforehand, e.g.
t:([] name:"aaa";a:1;b:2;c:3)
then the above .u.upd still runs without error, and does not change t.
.u.upd has the same function signature as insert (see http://code.kx.com/q/ref/qsql/#insert) in prefix form. In the most simplest case, .u.upd may get defined as insert.
so:
.u.upd[`table;<records>]
For example:
q).u.upd:insert
q)show tbl:([] a:`x`y;b:10 20)
a b
----
x 10
y 20
q).u.upd[`tbl;(`z;30)]
,2
q)show tbl
a b
----
x 10
y 20
z 30
q).u.upd[`tbl;(`a`b`c;1 2 3)]
3 4 5
q)show tbl
a b
----
x 10
y 20
z 30
a 1
b 2
c 3
Documentation including the event sequence, connection diagram etc. for tickerplants can be found here:
http://www.timestored.com/kdb-guides/kdb-tick-data-store
.u.upd[tableName; tableData] accepts two arguments, for inserting data
to a named table. This function will normally be called from a
feedhandler. It takes the tableData, adds a time column if one is
present, inserts it into the in-memory table, appends to the log file
and finally increases the log file counter.