When merging two queries in Power BI, can I exact match on one key and fuzzy match on a second key? - merge

I am merging two tables in Power BI where I want to have an exact match on one field, then a fuzzy match on a second field.
In the example below, I want for there to be an exact match for the "Key" columns in Table 1 and Table 2. In table 2, the "Key" column is not a unique identifier and can have multiple names associated with a key. So, I want to then fuzzy match on the name column. Is there a way to do this in Power BI?
Table 1
Key
Name1
info_a
1
Michael
a
2
Robert
b
Table 2
Key
Name2
info_b
1
Mike
aa
1
Andrea
cc
2
Robbie
bb
2
Michelle
dd
Result
Key
Name1
Name2
info_a
info_b
1
Michael
Mile
a
aa
2
Robert
Robbie
b
bb

I ended up using a Python script to solve this problem.
I merged Table 1 and Table 2 on the field ("Key") where an exact match was required.
Then I added this Python script:
from fuzzywuzzy import fuzz
def get_fuzz_score(
df: pd.DataFrame, col1: str, col2: str, scorer=fuzz.token_sort_ratio
) -> pd.Series:
"""
Parameters
----------
df: pd.DataFrame
col1: str, name of column from df
col2: str, name of column from df
scorer: fuzzywuzzy scorer (e.g. fuzz.ratio, fuzz.Wratio, fuzz.partial_ratio, fuzz.token_sort_ratio)
Returns
-------
scores: pd.Series
"""
scores = []
for _, row in df.iterrows():
if row[col1]in [np.nan, None] or row[col2] in [np.nan, None]:
scores.append(None)
else:
scores.append(scorer(row[col1], row[col2]))
return scores
dataset['fuzzy_score'] = get_fuzz_score(dataset, 'Name1', 'Name2', fuzz.WRatio)
dataset['MatchRank'] = dataset.groupby(['Key'])['fuzzy_score'].rank('first', ascending=False)
Then I could just consider the matches where MatchRank = 1

Related

Type error of getting average by id in KDB

I am trying make a function for the aggregate consumption by mid in a kdb+ table (aggregate value by mid). Also this table is being imported from a csv file like this:
table: ("JJP";enlist",")0:`:data.csv
Where the meta data is for the table columns is:
mid is type long(j), value(j) is type long and ts is type timestamp (p).
Here is my function:
agg: {select avg value by mid from table}
but I get the
'type
[0] get select avg value by mid from table
But the type of value is type long (j). So I am not sure why I can't get the avg I also tried this with type int.
Value can't be used as a column name because it is keyword used in kdb+. Renaming the column should correct the issue.
value is a keyword and should not be used as a column name.
https://code.kx.com/q/ref/value/
You can remove it as a column name using .Q.id
https://code.kx.com/q/ref/dotq/#qid-sanitize
q)t:flip`value`price!(1 2;1 2)
q)t
value price
-----------
1 1
2 2
q)t:.Q.id t
q)t
value1 price
------------
1 1
2 2
Or xcol
https://code.kx.com/q/ref/cols/#xcol
q)(enlist[`value]!enlist[`val]) xcol t
val price
---------
1 1
2 2
You can rename the value column as you read it:
flip`mid`val`ts!("JJP";",")0:`:data.csv

select only those columns from table have not null values in q kdb

I have a table:
q)t:([] a:1 2 3; b:```; c:`a`b`c)
a b c
-----
1 a
2 b
3 c
From this table I want to select only the columns who have not null values, in this case column b should be omitted from output.(something similar to dropna method in pandas).
expected output
a c
---
1 a
2 b
3 c
I tried many things like
select from t where not null cols
but of no use.
Here is a simple solution that does just what you want:
q)where[all null t]_t
a c
---
1 a
2 b
3 c
[all null t] gives a dictionary that checks if the column values are all null or not.
q)all null t
a| 0
b| 1
c| 0
Where returns the keys of the dictionary where it is true
q)where[all null t]
,`b
Finally you use _ to drop the columns from table t
Hopefully this helps
A modification of Sander's solution which handles string columns (or any nested columns):
q)t:([] a:1 2 3; b:```; c:`a`b`c;d:" ";e:("";"";"");f:(();();());g:(1 1;2 2;3 3))
q)t
a b c d e f g
----------------
1 a "" 1 1
2 b "" 2 2
3 c "" 3 3
q)where[{$[type x;all null x;all 0=count each x]}each flip t]_t
a c g
-------
1 a 1 1
2 b 2 2
3 c 3 3
The nature of kdb is column based, meaning that where clauses function on the rows of a given column.
To make a QSQL query produce your desired behaviour, you would need to first examine all your columns and determine which are all null, and then feed that into a functional statement. Which would be horribly inefficient.
Given that you need to fully examine all the columns data regardless (to check if all the values are null) the following will achieve that
q)#[flip;;enlist] k!d k:key[d] where not all each null each value d:flip t
a c
---
1 a
2 b
3 c
Here I'm transforming the table into a dictionary, and extracting its values to determine if any columns consist only of nulls (all each null each). I'm then applying that boolean list to the keys of the dictionary (i.e., the column names) through a where statement. We can then reindex into the original dictionary with those keys and create a subset dictionary of non-null columns and convert that back into a table.
I've generalized the final transformation back into a table by habit with an error catch to ensure that the dictionary will be converted into a table even if only a single row is valid (preventing a 'rank error)

KDB: How to assign string datatype to all columns

When I created the table Tab, I specified the columns as string,
Tab: ([Key1:string()] Col1:string();Col2:string();Col3:string())
But the column datatype (t) is empty. I suppose specifying the column as string has no effect.
meta Tab
c t f a
--------------------
Key1
Col1
Col2
Col3
After I do a bulk upsert in Java...
c.Dict dict = new c.Dict((Object[]) columns.toArray(new String[columns.size()]), data);
c.Flip flip = new c.Flip(dict);
conn.c.ks("upsert", table, flip);
The datatypes are all symbols:
meta Tab
c t f a
--------------------
Key1 s
Col1 s
Col2 s
Col3 s
How can I specify the datatype of the columns as string and have it remain as string?
You cant define a column of the empty table with as strings as they are merely lists of lists of characters
You can just set them as empty lists which is what your code is doing.
But the column will then take on the type of whatever data is inserted into it.
Real question is what is your java process sending symbols when it should be sending strings. You need to make the change there before publishing to KDB
Note if you define as chars you still wont be able to upsert strings
q)Tab: ([Key1:`char$()] Col1:`char$();Col2:`char$();Col3:`char$())
q)Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
'rank
[0] Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
^
q)Tab: ([Key1:()] Col1:();Col2:();Col3:())
q)Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
Key1 | Col1 Col2 Col3
------| --------------------
"test"| "test" "test" "test"
KDB does not allow to define column types as list during creation of table. So that means you can not define your column type as String because that is also a list.
To do that only way is to define column as empty list like:
q) t:([]id:`int$();val:())
Then when you insert data to this table the column will automatically take type of that data.
q)`t insert (4;"row1")
q) meta t
c | t f a
---| -----
id | i
val| C
In your case, one option is to send string data from your Java process as mentioned by user 'emc211' or other option is to convert your data to string in KDB process before insertion.

Scenario based questions in Datastage

I have two scenario based questions here.
Question 1
Input Dataset
Col1
A
A
B
C
C
B
D
A
C
Output Dataset
Col1 Col2
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
D 1
Question2
Input data string
AA-BB-CC-DD-EE-FF (can be of any delimiter and string can have any length)
Output data string
string 1 -> AA
string 2 -> BB
string 3 -> CC
string 4 -> DD
Thanks & Regards,
Subhasree
Question 1: Can be solved with a transformer. Sort the data and use the lastrowingroup functionality.
For Col2 just create a counter as a stage variable and add 1 for each row - if reset it with a second stage variable if lastrowingroup is reached.
Aternatively you could use a rownumber column in SQL.
Question2: You have not provided enough information. Is string1 a column or row? If you do not know anything upfront about the structure (any delimiter) this will get hard...

Perl + PostgreSQL-- Selective Column to Row Transpose

I'm trying to find a way to use Perl to further process a PostgreSQL output. If there's a better way to do this via PostgreSQL, please let me know. I basically need to choose certain columns (Realtime, Value) in a file to concatenate certains columns to create a row while keeping ID and CAT.
First time posting, so please let me know if I missed anything.
Input:
ID CAT Realtime Value
A 1 time1 55
A 1 time2 57
B 1 time3 75
C 2 time4 60
C 3 time5 66
C 3 time6 67
Output:
ID CAT Time Values
A 1 time 1,time2 55,57
B 1 time3 75
C 2 time4 60
C 3 time5,time6 66,67
You could do this most simply in Postgres like so (using array columns)
CREATE TEMP TABLE output AS SELECT
id, cat, ARRAY_AGG(realtime) as time, ARRAY_AGG(value) as values
FROM input GROUP BY id, cat;
Then select whatever you want out of the output table.
SELECT id
, cat
, string_agg(realtime, ',') AS realtimes
, string_agg(value, ',') AS values
FROM input
GROUP BY 1, 2
ORDER BY 1, 2;
string_agg() requires PostgreSQL 9.0 or later and concatenates all values to a delimiter-separated string - while array_agg() (v8.4+) creates am array out of the input values.
About 1, 2 - I quote the manual on the SELECT command:
GROUP BY clause
expression can be an input column name, or the name or ordinal number
of an output column (SELECT list item), or ...
ORDER BY clause
Each expression can be the name or ordinal number of an output column
(SELECT list item), or
Emphasis mine. So that's just notational convenience. Especially handy with complex expressions in the SELECT list.