how to remove null columns from spark df with pyspark

how to remove null columns from spark df with pyspark - pyspark

i have a df with null values in every row,as
col1 col2 col3 col4
|--------|---------|---------|-------------|
|null | null | foo | null |
|--------|---------|---------|-------------|
| null | bar | null | null |
|--------|---------|---------|-------------|
| null | null | null | kid |
|--------|---------|---------|-------------|
| orange | null | null | null |
|--------|---------|---------|-------------|
and i need to remove all null columns and the output df should be a single row ,as
col1 col2 col3 col4
|--------|---------|---------|-------------|
|orange | bar | foo | kid |
|--------|---------|---------|-------------|
What should I do to achieve the desired result?
thanks

Another alternative-
df1.select(df1.columns.map(c => first(c, ignoreNulls = true).as(c)): _*)
.show(false)
/**
* +------+----+----+----+
* |col1 |col2|col3|col4|
* +------+----+----+----+
* |orange|bar |bar |kid |
* +------+----+----+----+
*/

Here is an example with my test dataframe:
+----+----+----+----+
|a |b |c |d |
+----+----+----+----+
|null|null|cc |null|
|null|null|null|dc |
|null|bb |null|null|
|aa |null|null|null|
+----+----+----+----+
and the test code:
from pyspark.sql.functions import col, max
df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
cols = [max(col(c)).alias(c) for c in df.columns]
df.groupBy().agg(*cols).show(10, False)
gives the results:
+---+---+---+---+
|a |b |c |d |
+---+---+---+---+
|aa |bb |cc |dc |
+---+---+---+---+
where I have used the groupBy and max function.

Related

SparkSQL- Add new column to DataFrame based on the aggregation

Having the following DataFrame:
+--------+----------+------------+
|user_id |level |new_columns |
+--------+----------+------------+
|4 |B |null |
|6 |B |null |
|5 |A |col1 |
|3 |B |col2 |
|5 |A |col2 |
|2 |A |null |
|1 |A |col3 |
+--------+----------+------------+
I need to convert each not null value of the new_columns column to a new column, which should be done based on the aggregation on the user_id column. The desired output would be
+--------+-------------+------+
|user_id | col1 | col2 | col3 |
+--------+------+------+------+
|4 | null | null | null |
|6 | null | null | null |
|5 | A | A | null |
|3 | null | B | null |
|2 | null | null | null |
|1 | null | null | A |
+--------+-------------+------+
As you can see, the value of the new columns comes from the level column in the base DF. I know how to use the withColumn method to add new columns on a DF but here the critical part is how to add new columns on the aggregated DF (for the case of the user_id = 5).
Every hint based on the DataFrame API would be appreciated.

You can do a pivot:
val df2 = df.groupBy("event_id")
.pivot("new_columns")
.agg(first("level"))
.drop("null")
df2.show
+--------+-------------+------+
|user_id | col1 | col2 | col3 |
+--------+------+------+------+
|4 | null | null | null |
|6 | null | null | null |
|5 | A | A | null |
|3 | null | B | null |
|2 | null | null | null |
|1 | null | null | A |
+--------+-------------+------+

You can collect the non-null values from new_columns first before doing pivot :
val nonNull = df.select("new_columns").filter("new_columns is not null").distinct().as[String].collect
val df1 = df.groupBy("user_id")
.pivot("new_columns", nonNull)
.agg(first("level"))
df1.show
//+-------+----+----+----+
//|user_id|col3|col1|col2|
//+-------+----+----+----+
//| 1| A|null|null|
//| 6|null|null|null|
//| 3|null|null| B|
//| 5|null| A| A|
//| 4|null|null|null|
//| 2|null|null|null|
//+-------+----+----+----+

Filter DF using the column of another DF (same col in both DF) Spark Scala

I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2, the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |ARG |5 |
|2015-12-14 |GER |1 |
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
|2015-12-14 |USA |1 |
+--------------+------------+
|USE | country_id |
+--------------+------------+
| F |RUS |
| F |CHN |
Expected:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?
Thanks in advance!

You can use left semi join:
val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")
DF3.show
//+----------+----------+-----+
//|country_id| Date|value|
//+----------+----------+-----+
//| RUS|2015-12-14| 1|
//| CHN|2015-12-14| 3|
//+----------+----------+-----+
You can also use inner join :
val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")

Scala group by with mapped keys

I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.

Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+

GroupBy based on conditions in Spark dataframe

I have two dataframe,
Dataframe1 contains key/value pairs:
+------+-----------------+
| Key | Value |
+------+-----------------+
| key1 | Column1 |
+------+-----------------+
| key2 | Column2 |
+------+-----------------+
| key3 | Column1,Column3 |
+------+-----------------+
Second dataframe:
This is actual dataframe where I need to apply groupBy operation
+---------+---------+---------+--------+
| Column1 | Column2 | Column3 | Amount |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A2 | XYZ | 10 |
+---------+---------+---------+--------+
| A | A3 | PQR | 100 |
+---------+---------+---------+--------+
| B | B1 | XYZ | 200 |
+---------+---------+---------+--------+
| B | B2 | PQR | 280 |
+---------+---------+---------+--------+
| B | B3 | XYZ | 20 |
+---------+---------+---------+--------+
Dataframe1 contains the key,value columns
It has to take the keys from dataframe1, it has to take the respective value and do the groupBy operation on the dataframe2
Dframe= df.groupBy($"key").sum("amount").show()
Expected Output: Generate three dataframes based on number of keys in dataframe
d1= df.grouBy($"key1").sum("amount").show()
it has to be : df.grouBy($"column1").sum("amount").show()
+---+-----+
| A | 310 |
+---+-----+
| B | 500 |
+---+-----+
Code:
d2=df.groupBy($"key2").sum("amount").show()
result: df.grouBy($"column2").sum("amount").show()
dataframe:
+----+-----+
| A1 | 200 |
+----+-----+
| A2 | 10 |
+----+-----+
Code :
d3.df.groupBy($"key3").sum("amount").show()
DataFrame:
+---+-----+-----+
| A | XYZ | 320 |
+---+-----+-----+
| A | PQR | 10 |
+---+-----+-----+
| B | XYZ | 220 |
+---+-----+-----+
| B | PQR | 280 |
+---+-----+-----+
In future, if I add more keys , it has to show the dataframe. Can someone help me.

Given the key value dataframe as ( which I suggest you not to form dataframe from the source data, reason is given below)
+----+---------------+
|Key |Value |
+----+---------------+
|key1|Column1 |
|key2|Column2 |
|key3|Column1,Column3|
+----+---------------+
and actual dataframe as
+-------+-------+-------+------+
|Column1|Column2|Column3|Amount|
+-------+-------+-------+------+
|A |A1 |XYZ |100 |
|A |A1 |XYZ |100 |
|A |A2 |XYZ |10 |
|A |A3 |PQR |100 |
|B |B1 |XYZ |200 |
|B |B2 |PQR |280 |
|B |B3 |XYZ |20 |
+-------+-------+-------+------+
I would suggest you not to convert the first dataframe to rdd maps as
val maps = df1.rdd.map(row => row(0) -> row(1)).collect()
And then loop the maps as
import org.apache.spark.sql.functions._
for(kv <- maps){
df2.groupBy(kv._2.toString.split(",").map(col): _*).agg(sum($"Amount")).show(false)
//you can store the results in separate dataframes or write them to files or database
}
You should have follwing outputs
+-------+-----------+
|Column1|sum(Amount)|
+-------+-----------+
|B |500 |
|A |310 |
+-------+-----------+
+-------+-----------+
|Column2|sum(Amount)|
+-------+-----------+
|A2 |10 |
|B2 |280 |
|B1 |200 |
|B3 |20 |
|A3 |100 |
|A1 |200 |
+-------+-----------+
+-------+-------+-----------+
|Column1|Column3|sum(Amount)|
+-------+-------+-----------+
|B |PQR |280 |
|B |XYZ |220 |
|A |PQR |100 |
|A |XYZ |210 |
+-------+-------+-----------+

spark add a col to dataframe with condtions on another df

I have the following problem: I want to add a column RealCity to dataframe A, when City value is 'noClue', I what to select from df B, using the Key, to get the City.
Table A:
+---------+--------+
| Key | City|
+---------+--------+
|a | PDX |
+---------+--------+
|b | noClue |
Table B:
+---------+--------+
| Key | Name |
+---------+--------+
|c | SYD |
+---------+--------+
|b | AKL |
I want to use .withColumnand when but I can't select value another table (table B) by doing it this way. What's a good way of doing this? Many Thanks!

Given that you have two dataframes
A:
+---+------+
|key|City |
+---+------+
|a |PDX |
|b |noClue|
+---+------+
B:
+---+----+
|key|Name|
+---+----+
|a |SYD |
|b |AKL |
+---+----+
You can simply join them with common Key and use withColumn and when function as
val finalDF = A.join(B, Seq("Key"), "left").withColumn("RealCity", when($"City" === "noClue", $"Name").otherwise($"City")).drop("Name")
you should have final output as
+---+------+--------+
|key|City |RealCity|
+---+------+--------+
|a |PDX |PDX |
|b |noClue|AKL |
+---+------+--------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to remove null columns from spark df with pyspark - pyspark

Another alternative- df1.select(df1.columns.map(c => first(c, ignoreNulls = true).as(c)): _*) .show(false) /** * +------+----+----+----+ * |col1 |col2|col3|col4| * +------+----+----+----+ * |orange|bar |bar |kid | * +------+----+----+----+ */

Related

SparkSQL- Add new column to DataFrame based on the aggregation

Filter DF using the column of another DF (same col in both DF) Spark Scala

Scala group by with mapped keys

GroupBy based on conditions in Spark dataframe

spark add a col to dataframe with condtions on another df

Categories

Resources