Pyspark : Filter dataframe based on null values in two columns - pyspark

I have a dataframe like this
id customer_name city order
1 John dallas 5
2 steve 4
3 austin 3
4 Ryan houston 2
5 6
6 nyle austin 4
I want to filter out the rows where customer_name and city are both null. If one of them have value then they should not get filtered. Result should be
id customer_name city order
1 John dallas 5
2 steve 4
3 austin 3
4 Ryan houston 2
6 nyle austin 4
I can only find out the filter condition based on one column. How to filter based on two columns?

Use coalesce.
from pyspark.sql.functions import *
df.filter(coalesce('customer_name', 'city').isNotNull())

I believe this will work by using these and f alias for functions.
df.filter(f.col("customer_name").isNotNull() & f.col("city").isNotNull())

Related

Full Outer Joins In PostgreSql [duplicate]

This question already has answers here:
Left Outer Join Not Working?
(4 answers)
Closed 4 years ago.
I've created a table of students with columns student_id as primary key,
student_name and gender.
I've an another table gender which consists of gender_id and gender.
gender_id in student refers to table gender.
Tables data looks like this:
Student table
STUDENT_ID STUDENT_NAME GENDER
1 Ajith 1
2 Alan 1
3 Ann 2
4 Alexa 2
5 Amith 1
6 Nisha 2
7 Rathan 1
8 Rebecca 2
9 asdf null
10 asd null
11 dbss null
Gender Table
GENDER_ID GENDER
1 Male
2 Female
3 Others
My query and its result
SELECT S.STUDENT_NAME,
G.GENDER
FROM STUDENTS S
FULL OUTER JOIN GENDER G ON G.GENDER_ID = S.GENDER
result is giving with 12 rows including the Others value from the gender table.
STUDENT_ID STUDENT_NAME GENDER
1 Ajith Male
2 Alan Male
3 Ann Female
4 Alexa Female
5 Amith Male
6 Nisha Female
7 Rathan Male
8 Rebecca Female
Others
9 asdf
10 asd
11 dbss
I'm trying to restrict a particular student_id:
SELECT S.STUDENT_ID,
S.STUDENT_NAME,
G.GENDER
FROM STUDENTS S
FULL OUTER JOIN GENDER G ON G.GENDER_ID = S.GENDER
WHERE S.STUDENT_ID <> 11;
now the the total number of the rows are reduced to 10.
STUDENT_ID STUDENT_NAME GENDER
1 Ajith Male
2 Alan Male
3 Ann Female
4 Alexa Female
5 Amith Male
6 Nisha Female
7 Rathan Male
8 Rebecca Female
9 asdf
10 asd
Why has the one row with Others Values disappeared from the second select query?
I'm trying to find the cause of this issue.
That's because NULL <> 11 is not TRUE, but NULL, and only rows where the condition is TRUE are included in the result.
You'd have to write something like
WHERE s.student_id IS DISTINCT FROM 11
Your second select query returns all rows where student_id is different (<>) from 11.

kdb items being list and convert into row

I have the following kdb table
name value price
-------------------------
Paul 1 2 3 4
where value and price are lists. How can I convert them into
name value price
------------------------------
Paul 1 3
Paul 2 4
? Thanks!!
ungroup is what you're looking for here.
As an aside, "value" is a reserved word in q and you should get an 'assign error if you try to use it as a column name.
q)t:([]name:`Paul;value:enlist 1 2;price:enlist 3 4)
'assign
q)t:([]name:`Paul;val:enlist 1 2;price:enlist 3 4)
q)ungroup t
name val price
--------------
Paul 1 3
Paul 2 4

query multiple attribute in a table with single attribute in another table

I can't explain my problem in English well. So I write my problem in a personal way.
user_id name surname
1 john great
2 mary white
3 joseph alann
event_id official_id assistant_id date
1 1 2 2017-12-19
2 1 3 2017-12-20
3 2 3 2017-12-21
I want to get names at the same time when I query an event. I tried:
SELECT * FROM event a, user b WHERE a.official_id=b.user_id AND a.assistant_id=b.user_id
When I use "OR" instead of "AND" gives me cartesian result. I want the result like:
event_id off_id off_name asst_id asst_name date
1 1 john 2 mary 2017-12-19
2 1 john 3 joseph 2017-12-20
3 2 mary 3 joseph 2017-12-21

How do I return rows with a specific value last and order by name?

I want my query to return the rows of the table where a column (code) contains a specific value 3 or 4
then move bottom with order by customer
If I have a table something like this example:
Name Code
-------------
Arun 1
Arun 2
Arun 3
Arun 4
Babu 1
Babu 3
Raj 1
Raj 2
Ashok 1
Ashok 2
And using that table I want to my query to return the rows which column (code) contain value 3 or 4 bottom, and then the order by name. Is this possible to do using only one query?
Expected output
Name Code
------------
Ashok 1
Ashok 2
Raj 1
Raj 2
Arun 1
Arun 2
Arun 3
Arun 4
Babu 1
Babu 3
You could add a sub query in your ORDER BY clause which will allow for sorting by names which have 3/4 to be after other values:
SELECT atable.name, atable.code
FROM atable
ORDER BY (
SELECT a1.code
FROM atable a1
WHERE a1.name = atable.name
AND a1.code IN ( 3, 4 )
LIMIT 1 ) DESC, atable.name, atable.code;

Counting the size of a group in T-sql

I'd like to count the size of a group.
My table looks like that:
Name Number
Renee Scott 1
Bruno Cote 1
Andree Scott 2
Renee Scott 2
Pierre Dion 2
Pierre Dion 3
Louise Tremblay 3
Renee Scott 3
Andree Scott 3
Jean Barre 3
Bruno Cote 3
There are 2 Name associated with the Number 1, 3 Name with Number 2 and 6 Name with 3. I'd like to select this table where the Number is associated with 3 name or more.
Thank you.
SELECT * FROM TABLENAME WHERE NUMBER IN
(
SELECT NUMBER FROM TABLENAME GROUP BY NUMBER HAVING COUNT(*)>3
)