i have a df with null values in every row,as
col1 col2 col3 col4
|--------|---------|---------|-------------|
|null | null | foo | null |
|--------|---------|---------|-------------|
| null | bar | null | null |
|--------|---------|---------|-------------|
| null | null | null | kid |
|--------|---------|---------|-------------|
| orange | null | null | null |
|--------|---------|---------|-------------|
and i need to remove all null columns and the output df should be a single row ,as
col1 col2 col3 col4
|--------|---------|---------|-------------|
|orange | bar | foo | kid |
|--------|---------|---------|-------------|
What should I do to achieve the desired result?
thanks
Another alternative-
df1.select(df1.columns.map(c => first(c, ignoreNulls = true).as(c)): _*)
.show(false)
/**
* +------+----+----+----+
* |col1 |col2|col3|col4|
* +------+----+----+----+
* |orange|bar |bar |kid |
* +------+----+----+----+
*/
Here is an example with my test dataframe:
+----+----+----+----+
|a |b |c |d |
+----+----+----+----+
|null|null|cc |null|
|null|null|null|dc |
|null|bb |null|null|
|aa |null|null|null|
+----+----+----+----+
and the test code:
from pyspark.sql.functions import col, max
df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
cols = [max(col(c)).alias(c) for c in df.columns]
df.groupBy().agg(*cols).show(10, False)
gives the results:
+---+---+---+---+
|a |b |c |d |
+---+---+---+---+
|aa |bb |cc |dc |
+---+---+---+---+
where I have used the groupBy and max function.
Related
Having the following DataFrame:
+--------+----------+------------+
|user_id |level |new_columns |
+--------+----------+------------+
|4 |B |null |
|6 |B |null |
|5 |A |col1 |
|3 |B |col2 |
|5 |A |col2 |
|2 |A |null |
|1 |A |col3 |
+--------+----------+------------+
I need to convert each not null value of the new_columns column to a new column, which should be done based on the aggregation on the user_id column. The desired output would be
+--------+-------------+------+
|user_id | col1 | col2 | col3 |
+--------+------+------+------+
|4 | null | null | null |
|6 | null | null | null |
|5 | A | A | null |
|3 | null | B | null |
|2 | null | null | null |
|1 | null | null | A |
+--------+-------------+------+
As you can see, the value of the new columns comes from the level column in the base DF. I know how to use the withColumn method to add new columns on a DF but here the critical part is how to add new columns on the aggregated DF (for the case of the user_id = 5).
Every hint based on the DataFrame API would be appreciated.
You can do a pivot:
val df2 = df.groupBy("event_id")
.pivot("new_columns")
.agg(first("level"))
.drop("null")
df2.show
+--------+-------------+------+
|user_id | col1 | col2 | col3 |
+--------+------+------+------+
|4 | null | null | null |
|6 | null | null | null |
|5 | A | A | null |
|3 | null | B | null |
|2 | null | null | null |
|1 | null | null | A |
+--------+-------------+------+
You can collect the non-null values from new_columns first before doing pivot :
val nonNull = df.select("new_columns").filter("new_columns is not null").distinct().as[String].collect
val df1 = df.groupBy("user_id")
.pivot("new_columns", nonNull)
.agg(first("level"))
df1.show
//+-------+----+----+----+
//|user_id|col3|col1|col2|
//+-------+----+----+----+
//| 1| A|null|null|
//| 6|null|null|null|
//| 3|null|null| B|
//| 5|null| A| A|
//| 4|null|null|null|
//| 2|null|null|null|
//+-------+----+----+----+
I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2, the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |ARG |5 |
|2015-12-14 |GER |1 |
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
|2015-12-14 |USA |1 |
+--------------+------------+
|USE | country_id |
+--------------+------------+
| F |RUS |
| F |CHN |
Expected:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?
Thanks in advance!
You can use left semi join:
val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")
DF3.show
//+----------+----------+-----+
//|country_id| Date|value|
//+----------+----------+-----+
//| RUS|2015-12-14| 1|
//| CHN|2015-12-14| 3|
//+----------+----------+-----+
You can also use inner join :
val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")
I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.
Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+
I have two dataframe,
Dataframe1 contains key/value pairs:
+------+-----------------+
| Key | Value |
+------+-----------------+
| key1 | Column1 |
+------+-----------------+
| key2 | Column2 |
+------+-----------------+
| key3 | Column1,Column3 |
+------+-----------------+
Second dataframe:
This is actual dataframe where I need to apply groupBy operation
+---------+---------+---------+--------+
| Column1 | Column2 | Column3 | Amount |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A2 | XYZ | 10 |
+---------+---------+---------+--------+
| A | A3 | PQR | 100 |
+---------+---------+---------+--------+
| B | B1 | XYZ | 200 |
+---------+---------+---------+--------+
| B | B2 | PQR | 280 |
+---------+---------+---------+--------+
| B | B3 | XYZ | 20 |
+---------+---------+---------+--------+
Dataframe1 contains the key,value columns
It has to take the keys from dataframe1, it has to take the respective value and do the groupBy operation on the dataframe2
Dframe= df.groupBy($"key").sum("amount").show()
Expected Output: Generate three dataframes based on number of keys in dataframe
d1= df.grouBy($"key1").sum("amount").show()
it has to be : df.grouBy($"column1").sum("amount").show()
+---+-----+
| A | 310 |
+---+-----+
| B | 500 |
+---+-----+
Code:
d2=df.groupBy($"key2").sum("amount").show()
result: df.grouBy($"column2").sum("amount").show()
dataframe:
+----+-----+
| A1 | 200 |
+----+-----+
| A2 | 10 |
+----+-----+
Code :
d3.df.groupBy($"key3").sum("amount").show()
DataFrame:
+---+-----+-----+
| A | XYZ | 320 |
+---+-----+-----+
| A | PQR | 10 |
+---+-----+-----+
| B | XYZ | 220 |
+---+-----+-----+
| B | PQR | 280 |
+---+-----+-----+
In future, if I add more keys , it has to show the dataframe. Can someone help me.
Given the key value dataframe as ( which I suggest you not to form dataframe from the source data, reason is given below)
+----+---------------+
|Key |Value |
+----+---------------+
|key1|Column1 |
|key2|Column2 |
|key3|Column1,Column3|
+----+---------------+
and actual dataframe as
+-------+-------+-------+------+
|Column1|Column2|Column3|Amount|
+-------+-------+-------+------+
|A |A1 |XYZ |100 |
|A |A1 |XYZ |100 |
|A |A2 |XYZ |10 |
|A |A3 |PQR |100 |
|B |B1 |XYZ |200 |
|B |B2 |PQR |280 |
|B |B3 |XYZ |20 |
+-------+-------+-------+------+
I would suggest you not to convert the first dataframe to rdd maps as
val maps = df1.rdd.map(row => row(0) -> row(1)).collect()
And then loop the maps as
import org.apache.spark.sql.functions._
for(kv <- maps){
df2.groupBy(kv._2.toString.split(",").map(col): _*).agg(sum($"Amount")).show(false)
//you can store the results in separate dataframes or write them to files or database
}
You should have follwing outputs
+-------+-----------+
|Column1|sum(Amount)|
+-------+-----------+
|B |500 |
|A |310 |
+-------+-----------+
+-------+-----------+
|Column2|sum(Amount)|
+-------+-----------+
|A2 |10 |
|B2 |280 |
|B1 |200 |
|B3 |20 |
|A3 |100 |
|A1 |200 |
+-------+-----------+
+-------+-------+-----------+
|Column1|Column3|sum(Amount)|
+-------+-------+-----------+
|B |PQR |280 |
|B |XYZ |220 |
|A |PQR |100 |
|A |XYZ |210 |
+-------+-------+-----------+
I have the following problem: I want to add a column RealCity to dataframe A, when City value is 'noClue', I what to select from df B, using the Key, to get the City.
Table A:
+---------+--------+
| Key | City|
+---------+--------+
|a | PDX |
+---------+--------+
|b | noClue |
Table B:
+---------+--------+
| Key | Name |
+---------+--------+
|c | SYD |
+---------+--------+
|b | AKL |
I want to use .withColumnand when but I can't select value another table (table B) by doing it this way. What's a good way of doing this? Many Thanks!
Given that you have two dataframes
A:
+---+------+
|key|City |
+---+------+
|a |PDX |
|b |noClue|
+---+------+
B:
+---+----+
|key|Name|
+---+----+
|a |SYD |
|b |AKL |
+---+----+
You can simply join them with common Key and use withColumn and when function as
val finalDF = A.join(B, Seq("Key"), "left").withColumn("RealCity", when($"City" === "noClue", $"Name").otherwise($"City")).drop("Name")
you should have final output as
+---+------+--------+
|key|City |RealCity|
+---+------+--------+
|a |PDX |PDX |
|b |noClue|AKL |
+---+------+--------+