How to query with "IN" in Q (kdb)? - kdb

Let's assume that I have a table in KBD named "Automotive" with following data:
Manufacturer Country Sales Id
Mercedes United States 002
Mercedes Canada 002
Mercedes Germany 003
Mercedes Switzerland 003
Mercedes Japan 004
BMW United States 002
BMW Canada 002
BMW Germany 003
BMW Switzerland 003
BMW Japan 004
How would I structure a query in Q such that I can fetch the records matching United States and Canada without using an OR clause?
In SQL, it would look something like:
SELECT Manufacturer, Country from Automotive WHERE Country IN ('United States', 'Canada')
Thanks in advance for helping this Q beginner!

It's basically the same in kdb. The way you write you query depends on the data type. See below an example where manufacturer is a symbol, and country is a string.
q)tbl:([]manufacturer:`Merc`Merc`BMW`BMW`BMW;country:("United States";"Canada";"United States";"Germany";"Japan");ID:til 5)
q)
q)tbl
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
BMW "United States" 2
BMW "Germany" 3
BMW "Japan" 4
q)meta tbl
c | t f a
------------| -----
manufacturer| s
country | C
ID | j
q)select from tbl where manufacturer in `Merc`Ford
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
q)
q)select from tbl where country in ("United States";"Canada")
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
BMW "United States" 2
Check out how to use Q-sql here: https://code.kx.com/q4m3/9_Queries_q-sql/

Related

PySpark: Generating a Unique IDs for Data with Non-uniform and Repeated IDs

I have a dataset that is as indicated below. The first column is a date column, the second column comprises an id that can apply to any user (within the same department) and a given user can have multiple of these ids. The 3rd column is specific to a particular user but every so often, this id can be changed in case a user forgets it. The last column comprises of names and individuals can have similar names. I want to harmonize columns 1 and 2 prior to analysis as indicated below.
Input Dataset
Date
Id2
id3
Name
2018-01-01
001
AA
Name1
2018-01-02
001
AB
Name1
2018-01-01
001
AC
Name1
2018-01-04
003
AA
Name1
2018-01-01
004
AA
Name1
2018-01-01
001
AD
Name2
2018-01-04
002
AE
Name3
2018-01-04
005
AG
Name1
Desired output:
Date
Id2
id3
Name
2018-01-01
001
AA
Name1
2018-01-02
001
AA
Name1
2018-01-01
001
AA
Name1
2018-01-04
001
AA
Name1
2018-01-01
001
AA
Name1
2018-01-01
001
AD
Name2
2018-01-04
002
AE
Name3
2018-01-04
005
AG
Name1
so we have a table below and you are trying to map id2 column to just one id3 column because you said that they can be mapped to multiple ids, this would be my approach.
+---+----
|id2|id3|
+---+----
|001| AA|
|001| AC|
|003| BB|
|003| DE|
+---+---
First i would create a flag column that flags 1 ID2 column that has multiple ID3 with this code.
import pyspark.sql.functions as F
from pyspark.sql import Window
flag_df = dataframe.withColumn(
"dedup_flag",
F.dense_rank().over(
Window.partitionBy(
'id2'
).orderBy(F.col('id3').desc())) # this would help you pick the highest or lowest value in id3
)
which results in
+---+----+-----------+
|id2|id3|dedup_flag |
+---+----+-----------+
|001| AA| 1|
|001| AC| 2|
|003| BB| 1|
|003| DE| 2|
+---+----+-----------+
Then all you need to do is filter where dedup_flag = 1 and just do a left join with your original dataframe and flag dataframe on id2 to pull in all id3's associated to a given id2 with this code below.
final_df = dataframe.drop('id3').join(
flag_df.filter(F.col('dedup_flag') == 1).select('id2','id3'),
on= 'id2',
how= 'left'
)
which results to
+---+----
|id2|id3|
+---+----
|001| AA|
|001| AA|
|003| BB|
|003| BB|
+---+---
I hope this helped, acknowledge if this worked for you :) -Paula
before I can help I need to understand the question, so for each unique ID in column 2 generate a unique ID column?

Pyspark: How to fill null values based on value on another column

I want to fill null values on a Spark df based on the values of the id column.
Pyspark df:
index
id
animal
name
1
001
cat
doug
2
002
dog
null
3
001
cat
null
4
003
null
null
5
001
null
doug
6
002
null
bob
7
003
bird
larry
Expected result:
index
id
animal
name
1
001
cat
doug
2
002
dog
bob
3
001
cat
doug
4
003
bird
larry
5
001
cat
doug
6
002
dog
bob
7
003
bird
larry
You can use last (or first) with window function.
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('id')
df = (df.withColumn('animal', F.last('animal', ignorenulls=True).over(w))
.withColumn('name', F.last('name', ignorenulls=True).over(w)))

Extract jsonb fields as rows

I have a table
users
name: varchar(20)
data:jsonb
Records look something like this
adam, {"car": "chevvy", "fruit": "apple"}
john, {"car": "toyota", "fruit": "orange"}
I want to extract all the fields like this
name. |.type |. value
adam. car chevrolet
adam. fruit apple
john. car toyota
john. car orange
For your example you can do:
SELECT name, d.key AS type, d.value
FROM users u,
JSONB_EACH_TEXT(u.data) AS d
;
output:
name | type | value
------+-------+--------
adam | car | chevvy
adam | fruit | apple
john | car | toyota
john | fruit | orange
(4 rows)
There are good explanations here PostgreSQL - jsonb_each

How can i compare 2 tables in postgresql?

i have a table named hotel with 2 columns named : hotel_name , hotel_price
hotel_name | hotel_price
hotel1 | 5
hotel2 | 20
hotel3 | 100
hotel4 | 50
and another table named city that contains the column : city_name , average_prices
city_name | average_prices
paris | 20
london | 30
rome | 75
madrid | 100
I want to find which hotel has a price that's more expensive than average prices in the cities.For example i want in the end to find something like this:
hotel_name | city_name
hotel3 | paris --hotel3 is more expensive than the average price in paris
hotel3 | london --hotel3 is more expensive than the average price in london etc.
hotel3 | rome
hotel4 | paris
hotel4 | london
(I found the hotels that are more expensive than the average prices of the cities)
Any help would be valuable thank you .
A simple join is all that is needed. Typically tables are joined on a defined relationship (PK/FK) but there is nothing requiring that. See fiddle.
select h.hotel_name, c.city_name
from hotels h
join cities c
on h.hotel_price > c.average_prices;
However, while you can get the desired results, it's pretty meaningless. You cannot tell whether a particular hotel is even in a given city.

Remove duplicates in spark with 90 percent column match

Compare two rows in a dataframe in Spark and to remove the row if 90 percent of the columns matches(if there are 10 columns and if 9 matches). How to do this?
Name Country City Married Salary
Tony India Delhi Yes 30000
Carol USA Chicago Yes 35000
Shuaib France Paris No 25000
Dimitris Spain Madrid No 28000
Richard Italy Milan Yes 32000
Adam Portugal Lisbon Yes 36000
Tony India Delhi Yes 22000 <--
Carol USA Chicago Yes 21000 <--
Shuaib France Paris No 20000 <--
Have to remove the marked rows since 90 percent that 4 out of 5 column values are matching with already existing rows.How to do this in Pyspark Dataframe.TIA