Scenario based questions in Datastage - datastage

I have two scenario based questions here.
Question 1
Input Dataset
Col1
A
A
B
C
C
B
D
A
C
Output Dataset
Col1 Col2
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
D 1
Question2
Input data string
AA-BB-CC-DD-EE-FF (can be of any delimiter and string can have any length)
Output data string
string 1 -> AA
string 2 -> BB
string 3 -> CC
string 4 -> DD
Thanks & Regards,
Subhasree

Question 1: Can be solved with a transformer. Sort the data and use the lastrowingroup functionality.
For Col2 just create a counter as a stage variable and add 1 for each row - if reset it with a second stage variable if lastrowingroup is reached.
Aternatively you could use a rownumber column in SQL.
Question2: You have not provided enough information. Is string1 a column or row? If you do not know anything upfront about the structure (any delimiter) this will get hard...

Related

When merging two queries in Power BI, can I exact match on one key and fuzzy match on a second key?

I am merging two tables in Power BI where I want to have an exact match on one field, then a fuzzy match on a second field.
In the example below, I want for there to be an exact match for the "Key" columns in Table 1 and Table 2. In table 2, the "Key" column is not a unique identifier and can have multiple names associated with a key. So, I want to then fuzzy match on the name column. Is there a way to do this in Power BI?
Table 1
Key
Name1
info_a
1
Michael
a
2
Robert
b
Table 2
Key
Name2
info_b
1
Mike
aa
1
Andrea
cc
2
Robbie
bb
2
Michelle
dd
Result
Key
Name1
Name2
info_a
info_b
1
Michael
Mile
a
aa
2
Robert
Robbie
b
bb
I ended up using a Python script to solve this problem.
I merged Table 1 and Table 2 on the field ("Key") where an exact match was required.
Then I added this Python script:
from fuzzywuzzy import fuzz
def get_fuzz_score(
df: pd.DataFrame, col1: str, col2: str, scorer=fuzz.token_sort_ratio
) -> pd.Series:
"""
Parameters
----------
df: pd.DataFrame
col1: str, name of column from df
col2: str, name of column from df
scorer: fuzzywuzzy scorer (e.g. fuzz.ratio, fuzz.Wratio, fuzz.partial_ratio, fuzz.token_sort_ratio)
Returns
-------
scores: pd.Series
"""
scores = []
for _, row in df.iterrows():
if row[col1]in [np.nan, None] or row[col2] in [np.nan, None]:
scores.append(None)
else:
scores.append(scorer(row[col1], row[col2]))
return scores
dataset['fuzzy_score'] = get_fuzz_score(dataset, 'Name1', 'Name2', fuzz.WRatio)
dataset['MatchRank'] = dataset.groupby(['Key'])['fuzzy_score'].rank('first', ascending=False)
Then I could just consider the matches where MatchRank = 1

KDB How to update column values

I have a table which has column of symbol type like below.
Name
Value
First
TP_RTD_FRV
Second
RF_QWE_FRV
Third
KF_FRV_POL
I need to update it as below, wherever I have FRV, I need to replace it with AB_FRV. How to achieve this?
Name
Value
First
TP_RTD_AB_FRV
Second
RF_QWE_AB_FRV
Third
KF_AB_FRV_POL
q)t
name v
---------------
0 TP_RTD_FRV
1 RF_QWE_FRV
2 KF_FRV_POL
3 THIS
4 THAT
q)update `$ssr[;"FRV";"AB_FRV"]each string v from t
name v
------------------
0 TP_RTD_AB_FRV
1 RF_QWE_AB_FRV
2 KF_AB_FRV_POL
3 THIS
4 THAT
or without using qSQL
q)#[t;`v;]{`$ssr[;"FRV";"AB_FRV"]each string x}
name v
------------------
0 TP_RTD_AB_FRV
1 RF_QWE_AB_FRV
2 KF_AB_FRV_POL
3 THIS
4 THAT
Depending on the uniqueness of the data, you might benefit from .Q.fu
q)t:1000000#t
q)\t #[t;`v;]{`$ssr[;"FRV";"AB_FRV"]each string x}
2343
q)\t #[t;`v;].Q.fu {`$ssr[;"FRV";"AB_FRV"]each string x}
10

select only those columns from table have not null values in q kdb

I have a table:
q)t:([] a:1 2 3; b:```; c:`a`b`c)
a b c
-----
1 a
2 b
3 c
From this table I want to select only the columns who have not null values, in this case column b should be omitted from output.(something similar to dropna method in pandas).
expected output
a c
---
1 a
2 b
3 c
I tried many things like
select from t where not null cols
but of no use.
Here is a simple solution that does just what you want:
q)where[all null t]_t
a c
---
1 a
2 b
3 c
[all null t] gives a dictionary that checks if the column values are all null or not.
q)all null t
a| 0
b| 1
c| 0
Where returns the keys of the dictionary where it is true
q)where[all null t]
,`b
Finally you use _ to drop the columns from table t
Hopefully this helps
A modification of Sander's solution which handles string columns (or any nested columns):
q)t:([] a:1 2 3; b:```; c:`a`b`c;d:" ";e:("";"";"");f:(();();());g:(1 1;2 2;3 3))
q)t
a b c d e f g
----------------
1 a "" 1 1
2 b "" 2 2
3 c "" 3 3
q)where[{$[type x;all null x;all 0=count each x]}each flip t]_t
a c g
-------
1 a 1 1
2 b 2 2
3 c 3 3
The nature of kdb is column based, meaning that where clauses function on the rows of a given column.
To make a QSQL query produce your desired behaviour, you would need to first examine all your columns and determine which are all null, and then feed that into a functional statement. Which would be horribly inefficient.
Given that you need to fully examine all the columns data regardless (to check if all the values are null) the following will achieve that
q)#[flip;;enlist] k!d k:key[d] where not all each null each value d:flip t
a c
---
1 a
2 b
3 c
Here I'm transforming the table into a dictionary, and extracting its values to determine if any columns consist only of nulls (all each null each). I'm then applying that boolean list to the keys of the dictionary (i.e., the column names) through a where statement. We can then reindex into the original dictionary with those keys and create a subset dictionary of non-null columns and convert that back into a table.
I've generalized the final transformation back into a table by habit with an error catch to ensure that the dictionary will be converted into a table even if only a single row is valid (preventing a 'rank error)

PySpark: How to concatenate two dataframes without duplicates rows?

I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
A B
0 1 2
1 3 1
Dataframe B:
A B
0 5 6
1 3 1
I wish to merge them such that the final DataFrame is of the following shape:
Final Dataframe:
A B
0 1 2
1 3 1
2 5 6
How can I do this?
pyspark.sql.DataFrame.union and pyspark.sql.DataFrame.unionAll seem to yield the same result with duplicates.
Instead, you can get the desired output by using direct SQL:
dfA.createTempView('dataframea')
dfB.createTempView('dataframeb')
aunionb = spark.sql('select * from dataframea union select * from dataframeb')
Using SQL produces the expected/correct result.
In order to remove any duplicate rows, just use union() followed by a distinct().
Mentioned in the documentation
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
"union(other)
Return a new DataFrame containing union of rows in this frame and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct."
You have just to drop duplicates after union.
df = dfA.union(dfB).dropDuplicates()

Perl + PostgreSQL-- Selective Column to Row Transpose

I'm trying to find a way to use Perl to further process a PostgreSQL output. If there's a better way to do this via PostgreSQL, please let me know. I basically need to choose certain columns (Realtime, Value) in a file to concatenate certains columns to create a row while keeping ID and CAT.
First time posting, so please let me know if I missed anything.
Input:
ID CAT Realtime Value
A 1 time1 55
A 1 time2 57
B 1 time3 75
C 2 time4 60
C 3 time5 66
C 3 time6 67
Output:
ID CAT Time Values
A 1 time 1,time2 55,57
B 1 time3 75
C 2 time4 60
C 3 time5,time6 66,67
You could do this most simply in Postgres like so (using array columns)
CREATE TEMP TABLE output AS SELECT
id, cat, ARRAY_AGG(realtime) as time, ARRAY_AGG(value) as values
FROM input GROUP BY id, cat;
Then select whatever you want out of the output table.
SELECT id
, cat
, string_agg(realtime, ',') AS realtimes
, string_agg(value, ',') AS values
FROM input
GROUP BY 1, 2
ORDER BY 1, 2;
string_agg() requires PostgreSQL 9.0 or later and concatenates all values to a delimiter-separated string - while array_agg() (v8.4+) creates am array out of the input values.
About 1, 2 - I quote the manual on the SELECT command:
GROUP BY clause
expression can be an input column name, or the name or ordinal number
of an output column (SELECT list item), or ...
ORDER BY clause
Each expression can be the name or ordinal number of an output column
(SELECT list item), or
Emphasis mine. So that's just notational convenience. Especially handy with complex expressions in the SELECT list.