how do i update or delete nested dictionary in kdb? - kdb

I would like to delete or update values in nested dictionary
Eg.
d:`date`tab`col!((2022.12.01;2022.12.03);`TRADE`SYM;`ID`CODE`PIN`NAME)
I would like to update `PIN to `Yen or maybe delete `PIN and `CODE from the dictionary.

I think this may be slightly fiddly due to the nested nature but replacing values could be done with a dictionary and fills. This would replace all instances of PIN if there were multiple.
#[d;`col;{x^(enlist[`PIN]!enlist`YEN) x}]
date| 2022.12.01 2022.12.03
tab | `TRADE`SYM
col | `ID`CODE`YEN`NAME
Deletions could be done with except.
q)#[d;`col;except[;`PIN`CODE]]
date| 2022.12.01 2022.12.03
tab | TRADE SYM
col | ID NAME
I wouldn't be surprised to find better ways to do both these actions.

You could do something like:
q)#[d;`col;{x where not x in`CODE`PIN}]
date| 2022.12.01 2022.12.03
tab | TRADE SYM
col | ID NAME

Very minor mods to #Thomas Smyth-Treliant:
#[d;`col;] {x^(.[!]1#'`PIN`Yen) x} / (1)
#[d;`col;] except[;`PIN`Yen] / (2)

You could also use amend in an implicit function to update nested values in a dictionary:
{.[d;(`col;x);:;`your`update]} where d[`col] in `ID`PIN
Output:
date| 2022.12.01 2022.12.03
tab | `TRADE`SYM
col | `your`CODE`update`NAME

Related

Excel: Select the newest date from a list that contains multiple rows with the same ID

In Excel, I have a list with multiple rows of the same ID (column A), each with various dates recorded (Column B). I need to extract one row for each ID that contains the newest date. See below for example:
|Column A | Column B|
|(ID) | (Date) |
|-----------|-----------|
|00001 | 01/01/2022|
|00001 | 02/01/2022|
|00001 | 03/01/2022| <-- I Need this one
|00002 | 01/02/2022|
|00002 | 02/02/2022|
|00002 | 03/02/2022| <-- I Need this one
|00003 | 01/03/2022|
|00003 | 02/03/2022|
|00003 | 03/03/2022| <-- I Need this one
|00004 | 01/04/2022|
|00004 | 02/04/2022|
|00004 | 03/04/2022| <-- I Need this one
|00005 | 01/05/2022|
|00005 | 02/05/2022|
|00005 | 03/05/2022| <-- I Need this one
I need to extract the above rows, where the row with the newest date is extracted for each unique ID. It needs to look like this:
|Column A | Column B |
|(ID) | (Date) |
|----------|--------------|
|00001 | 03/01/2022 |
|00002 | 03/02/2022 |
|00003 |03/03/2022 |
|00004 | 03/04/2022 |
|00005 | 03/05/2022 |
I'm totally stumped and I can't seem to find the right answer (probably because of how I'm wording the question!)
Thank you!
Google searches for the answer - no joy. I don't know where to start in excel with this function, I thought perhaps DISTINCT or similar...
Assuming you have Office 365 compatible version of Excel, you could do something like this:
(screenshot/here refers):
=INDEX(SORTBY(A2:B11,B2#,-1),SEQUENCE(1,1,1,1),SEQUENCE(1,2,1,1))
This formula is superfluous albeit convenient - you don't really require the first sequence (there's only one row being returned). However, as you can see in the screenshot, using the self-same formula, this time with a leading 2 in the first argument of that sequence returns the top two (descending order) dates, and so forth.
FOR THOSE w/ Office 365 you could do something like this....
=LARGE(B2#+(ROW(B2#)-ROW(B2))/1000,1)
i.e. adding a "little bit" to the dates that we can subtract later and use as a unique reference (row number, original unsorted list)
As mentioned, reverse engineer, throw into an index, and voila!
=INDEX(A2:A11,ROUND((H2-ROUND(H2,0))*1000,6))
caveats:
the round(<>,6) is purely to eliminate Excel's irritating lack of precision issue.
can work if you're looking up text strings (i.e. attempting to sort alphabetically) EXCEPT large doesn't work with string (no prob, just use unicode - but good luck with expanding out the string etc. ☺ with mid(<>,row(a1:offset(a1,len(<>)-1)..,1)..

How to count occurrences of a string in a list column?

I have a data frame with 2 columns: Role \ Skills. (the skills are split into a list)
I wish to find the top 10 most common skill among every skill.
How can I make a data frame that would display the count of each of these skills?
(where the first row might be 4G: 123... etc)
And the second thing I wish to accomplish is to check for overlapping skills between different roles.
So what I really want is a table where the first column is the full range of Skills and a 2nd column to count them, and a 3rd column to display a list that would display distinct Roles that has that skill in their list.
I am trying to make this work for several hours to no avail currently.
You can explode the skills array and regroup. Try this
import pyspark.sql.functions as F
test = spark.createDataFrame([('TL',['python','java']),('PM',['PMP','python']),('TM',['python','java','c'])],schema=['role','skill'])
test_exp = test.select('role',F.explode('skill').alias('skill'))
test_res = test_exp.groupby('skill').agg(F.count('role').alias('skill_count'),F.collect_set('role').alias('roles_associated'))
test_res.show()
+------+-----------+----------------+
| skill|skill_count|roles_associated|
+------+-----------+----------------+
|python| 3| [PM, TL, TM]|
| c| 1| [TM]|
| java| 2| [TL, TM]|
| PMP| 1| [PM]|
+------+-----------+----------------+

Pyspark SelectExp() not working for first() and last()

i have 2 statements which are to my knowledge exactly alike, but select() works fine, but selectExpr() generates following results.
+-----------------------+----------------------+
|first(StockCode, false)|last(StockCode, false)|
+-----------------------+----------------------+
| 85123A| 22138|
+-----------------------+----------------------+
+-----------+----------+
|first_value|last_value|
+-----------+----------+
| StockCode| StockCode|
+-----------+----------+
following is implementation.
df.select(first(col("StockCode")), last(col("StockCode"))).show()
df.selectExpr("""first('StockCode') as first_value""", """last('StockCode') as last_value""").show()
Can any 1 explain the behaviour?
selectExpr takes everything as select clause in sql.
Hence if you write anything in single quote', it will act as string in sql. if you wanted to pass the column to selectExpr use backtique (`) as below-
df.selectExpr("""first(`StockCode`) as first_value""", """last(`StockCode`) as last_value""").show()
backtique will help you to escape space in the column.
you can use without backtique also if your column name is not starting with number like 12col or it doesn't have spaces in between like column name
df.selectExpr("""first(StockCode) as first_value""", """last(StockCode) as last_value""").show()
You should pass like below
df_b = df_b.selectExpr('first(count) as first', 'last(count) as last')
df_b.show(truncate = False)
+-----+----+
|first|last|
+-----+----+
|2527 |13 |
+-----+----+

Spark explode multiple columns of row in multiple rows

I have a problem with converting one row using three 3 columns into 3 rows
For example:
<pre>
<b>ID</b> | <b>String</b> | <b>colA</b> | <b>colB</b> | <b>colC</b>
<em>1</em> | <em>sometext</em> | <em>1</em> | <em>2</em> | <em>3</em>
</pre>
I need to convert it into:
<pre>
<b>ID</b> | <b>String</b> | <b>resultColumn</b>
<em>1</em> | <em>sometext</em> | <em>1</em>
<em>1</em> | <em>sometext</em> | <em>2</em>
<em>1</em> | <em>sometext</em> | <em>3</em>
</pre>
I just have dataFrame which is connected with first schema(table).
val df: dataFrame
Note: I can do it using RDD, but do we have other way? Thanks
Assuming that df has the schema of your first snippet, I would try:
df.select($"ID", $"String", explode(array($"colA", $"colB",$"colC")).as("resultColumn"))
I you further want to keep the column names, you can use a trick that consists in creating a column of arrays that contains the array of the value and the name. First create your expression
val expr = explode(array(array($"colA", lit("colA")), array($"colB", lit("colB")), array($"colC", lit("colC"))))
then use getItem (since you can not use generator on nested expressions, you need 2 select here)
df.select($"ID, $"String", expr.as("tmp")).select($"ID", $"String", $"tmp".getItem(0).as("resultColumn"), $"tmp".getItem(1).as("columnName"))
It is a bit verbose though, there might be more elegant way to do this.

Query to remove all redundant entries from a table

I have a Postgres table that describes relationships between entities, this table is populated by a process which I cannot modify. This is an example of that table:
+-----+-----+
| e1 | e2 |
|-----+-----|
| A | B |
| C | D |
| D | C |
| ... | ... |
+-----+-----+
I want to write a SQL query that will remove all unecessary relationships from the table, for example the relationship [D, C] is redundant as it's already defined by [C, D].
I have a query that deletes using a self join but this removes everything to do with the relationship, e.g.:
DELETE FROM foo USING foo b WHERE foo.e2 = b.e1 AND foo.e1 = b.e2;
Results in:
+-----+-----+
| e1 | e2 |
|-----+-----|
| A | B |
| ... | ... |
+-----+-----+
However, I need a query that will leave me with one of the relationships, it doesn't matter which relationship remains, either [C, D] or [D, C] but not both.
I feel like there is a simple solution here but it's escaping me.
A general solution is to use the always unique pseudo-column ctid:
DELETE FROM foo USING foo b WHERE foo.e2 = b.e1 AND foo.e1 = b.e2
AND foo.ctid > b.ctid;
Incidentally it keeps the tuple whose physical location is nearest to the first data page of the table.
Assuming that an exact duplicate row is constrained against, there will always be at most two rows for a given relationship: (C,D) and (D,C) in your example. The same constraint also means the two columns have a distinct values: the pair (C,C) might be legal, but cannot be duplicated.
Assuming that the datatype involved has a sane definition of >, you can add a condition that the row to be deleted is the one where the first column > the second column, and leave the other untouched.
In your sample query, this would mean adding AND foo.e1 > foo.e2.