pyspark check whether each name has 3 data - pyspark

In pyspark, I have a DataFrame as follows. I want to check whether each name has 3 action data (0, 1, 2). If there are missing, add a new row, the score column is set to 0, and the other columns are unchanged(ex: str1, str2, str3).
+-----+--------+--------+--------+-------+-------+
| name| str1 | str2 | str3 | action| score |
+-----+--------+--------+--------+-------+-------+
| A | str_A1 | str_A2 | str_A3 | 0| 2|
| A | str_A1 | str_A2 | str_A3 | 1| 6|
| A | str_A1 | str_A2 | str_A3 | 2| 74|
| B | str_B1 | str_B2 | str_B3 | 0| 59|
| B | str_B1 | str_B2 | str_B3 | 1| 18|
| C | str_C1 | str_C2 | str_C3 | 0| 3|
| C | str_C1 | str_C2 | str_C3 | 1| 33|
| C | str_C1 | str_C2 | str_C3 | 2| 3|
+-----+--------+--------+--------+-------+-------+
For example, name B has no action 2, add a new row data as follows
+-----+--------+--------+--------+-------+-------+
| name| str1 | str2 | str3 | action| score |
+-----+--------+--------+--------+-------+-------+
| A | str_A1 | str_A2 | str_A3 | 0| 2|
| A | str_A1 | str_A2 | str_A3 | 1| 6|
| A | str_A1 | str_A2 | str_A3 | 2| 74|
| B | str_B1 | str_B2 | str_B3 | 0| 59|
| B | str_B1 | str_B2 | str_B3 | 1| 18|
| B | str_B1 | str_B2 | str_B3 | 2| 0|<---- new row data
| C | str_C1 | str_C2 | str_C3 | 0| 3|
| C | str_C1 | str_C2 | str_C3 | 1| 33|
| C | str_C1 | str_C2 | str_C3 | 2| 3|
+-----+--------+--------+--------+-------+-------+
It is also possible that there is only one row data for one name, and two new row data need to be added.
+-----+--------+--------+--------+-------+-------+
| name| str1 | str2 | str3 | action| score |
+-----+--------+--------+--------+-------+-------+
| A | str_A1 | str_A2 | str_A3 | 0| 2|
| A | str_A1 | str_A2 | str_A3 | 1| 6|
| A | str_A1 | str_A2 | str_A3 | 2| 74|
| B | str_B1 | str_B2 | str_B3 | 0| 59|
| B | str_B1 | str_B2 | str_B3 | 1| 18|
| B | str_B1 | str_B2 | str_B3 | 2| 0|
| C | str_C1 | str_C2 | str_C3 | 0| 3|
| C | str_C1 | str_C2 | str_C3 | 1| 33|
| C | str_C1 | str_C2 | str_C3 | 2| 3|
| D | str_D1 | str_D2 | str_D3 | 0| 45|
+-----+--------+--------+--------+-------+-------+
+-----+--------+--------+--------+-------+-------+
| name| str1 | str2 | str3 | action| score |
+-----+--------+--------+--------+-------+-------+
| A | str_A1 | str_A2 | str_A3 | 0| 2|
| A | str_A1 | str_A2 | str_A3 | 1| 6|
| A | str_A1 | str_A2 | str_A3 | 2| 74|
| B | str_B1 | str_B2 | str_B3 | 0| 59|
| B | str_B1 | str_B2 | str_B3 | 1| 18|
| B | str_B1 | str_B2 | str_B3 | 2| 0|
| C | str_C1 | str_C2 | str_C3 | 0| 3|
| C | str_C1 | str_C2 | str_C3 | 1| 33|
| C | str_C1 | str_C2 | str_C3 | 2| 3|
| D | str_D1 | str_D2 | str_D3 | 0| 45|
| D | str_D1 | str_D2 | str_D3 | 1| 0|<---- new row data
| D | str_D1 | str_D2 | str_D3 | 2| 0|<---- new row data
+-----+--------+--------+--------+-------+-------+
I am new to pyspark and don't know how to do this operation.
Thank you for your help.

Solution with a UDF
from pyspark.sql import functions as F, types as T
#F.udf(T.MapType(T.StringType(), T.IntegerType()))
def add_missing_values(values):
return {i: values.get(i, 0) for i in range(3)}
df = (
df.groupBy("name", "str1", "str2", "str3")
.agg(
F.map_from_entries(F.collect_list(F.struct("action", "score"))).alias("values")
)
.withColumn("values", add_missing_values(F.col("values")))
.select(
"name", "str1", "str2", "str3", F.explode("values").alias("action", "score")
)
)
df.show()
+----+------+------+------+------+-----+
|name| str1| str2| str3|action|score|
+----+------+------+------+------+-----+
| A|str_A1|str_A2|str_A3| 0| 2|
| A|str_A1|str_A2|str_A3| 1| 6|
| A|str_A1|str_A2|str_A3| 2| 74|
| B|str_B1|str_B2|str_B3| 0| 59|
| B|str_B1|str_B2|str_B3| 1| 18|
| B|str_B1|str_B2|str_B3| 2| 0|<---- new row data
| C|str_C1|str_C2|str_C3| 0| 3|
| C|str_C1|str_C2|str_C3| 1| 33|
| C|str_C1|str_C2|str_C3| 2| 3|
| D|str_D1|str_D2|str_D3| 0| 45|
| D|str_D1|str_D2|str_D3| 1| 0|<---- new row data
| D|str_D1|str_D2|str_D3| 2| 0|<---- new row data
+----+------+------+------+------+-----+
Full Spark solution :
df = (
df.groupBy("name", "str1", "str2", "str3")
.agg(
F.map_from_entries(F.collect_list(F.struct("action", "score"))).alias("values")
)
.withColumn(
"values",
F.map_from_arrays(
F.array([F.lit(i) for i in range(3)]),
F.array(
[F.coalesce(F.col("values").getItem(i), F.lit(0)) for i in range(3)]
),
),
)
.select(
"name", "str1", "str2", "str3", F.explode("values").alias("action", "score")
)
)

Related

Add column elements to a Dataframe Scala Spark

I have two dataframes, and I want to add one to all row of the other one.
My dataframes are like:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
2 | a | 2
2 | d | 4
name
a
b
c
d
e
And I want a result like this:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
1 | d | null
1 | e | null
2 | a | 2
2 | b | null
2 | c | null
2 | d | 4
2 | e | null
How can I do this?
It seems it's more than a simple join.
val df = df1.select("id").distinct().crossJoin(df2).join(
df1,
Seq("name", "id"),
"left"
).orderBy("id", "name")
df.show
+----+---+----+
|name| id|rate|
+----+---+----+
| a| 1| 3|
| b| 1| 4|
| c| 1| 1|
| d| 1|null|
| e| 1|null|
| a| 2| 2|
| b| 2|null|
| c| 2|null|
| d| 2| 4|
| e| 2|null|
+----+---+----+

Spark Dataframe ORDER BY giving mixed combination(asc + desc)

I have a Dataframe that I want to sort column by descending if the count value is greater than 10.
But I'm getting a mixed combination like ascending for couple of records then again descending and then again ascending and son on.
I'm using orderBy() function which sort the record in ascending by default.
Since i'm new to Scala and Spark I'm not getting the reason for why I'm getting this.
df.groupBy("Value").count().filter("count>5.0").orderBy("Value").show(1000);
reading the csv
val df = sparkSession
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/test.csv")
.toDF("Country_Code", "Country","Data_Source","Data_File","Category","Metric","Time","Data_Cut1","Option1_Dummy","Option1_Visible","Value")````
the records I'm getting by executing the above syntax:
+-------+-----+
| Value|count|
+-------+-----+
| 0| 225|
| 0.01| 12|
| 0.02| 13|
| 0.03| 12|
| 0.04| 15|
| 0.05| 9|
| 0.06| 11|
| 0.07| 9|
| 0.08| 6|
| 0.09| 10|
| 0.1| 66|
| 0.11| 12|
| 0.12| 9|
| 0.13| 12|
| 0.14| 8|
| 0.15| 10|
| 0.16| 14|
| 0.17| 11|
| 0.18| 14|
| 0.19| 21|
| 0.2| 78|
| 0.21| 16|
| 0.22| 15|
| 0.23| 13|
| 0.24| 7|
| 0.3| 85|
| 0.31| 7|
| 0.34| 8|
| 0.4| 71|
| 0.5| 103|
| 0.6| 102|
| 0.61| 6|
| 0.62| 9|
| 0.69| 7|
| 0.7| 98|
| 0.72| 6|
| 0.74| 8|
| 0.78| 7|
| 0.8| 71|
| 0.81| 10|
| 0.82| 9|
| 0.83| 8|
| 0.84| 6|
| 0.86| 8|
| 0.87| 10|
| 0.88| 12|
| 0.9| 95|
| 0.91| 9|
| 0.93| 6|
| 0.94| 6|
| 0.95| 8|
| 0.98| 8|
| 0.99| 6|
| 1| 254|
| 1.08| 8|
| 1.1| 80|
| 1.11| 6|
| 1.15| 9|
| 1.17| 7|
| 1.18| 6|
| 1.19| 9|
| 1.2| 94|
| 1.25| 7|
| 1.3| 91|
| 1.32| 8|
| 1.4| 215|
| 1.45| 7|
| 1.5| 320|
| 1.56| 6|
| 1.6| 280|
| 1.64| 6|
| 1.66| 10|
| 1.7| 310|
| 1.72| 7|
| 1.74| 6|
| 1.8| 253|
| 1.9| 117|
| 10| 78|
| 10.1| 45|
| 10.2| 49|
| 10.3| 30|
| 10.4| 40|
| 10.5| 38|
| 10.6| 52|
| 10.7| 35|
| 10.8| 39|
| 10.9| 42|
| 10.96| 7|------------mark
| 100| 200|
| 101.3| 7|
| 101.8| 8|
| 102| 6|
| 102.2| 6|
| 102.7| 8|
| 103.2| 6|--------------here
| 11| 93|
| 11.1| 32|
| 11.2| 38|
| 11.21| 6|
| 11.3| 42|
| 11.4| 32|
| 11.5| 34|
| 11.6| 38|
| 11.69| 6|
| 11.7| 42|
| 11.8| 25|
| 11.86| 6|
| 11.9| 39|
| 11.96| 9|
| 12| 108|
| 12.07| 7|
| 12.1| 31|
| 12.11| 6|
| 12.2| 34|
| 12.3| 28|
| 12.39| 6|
| 12.4| 32|
| 12.5| 31|
| 12.54| 7|
| 12.57| 6|
| 12.6| 18|
| 12.7| 33|
| 12.8| 20|
| 12.9| 21|
| 13| 85|
| 13.1| 25|
| 13.2| 19|
| 13.3| 30|
| 13.34| 6|
| 13.4| 32|
| 13.5| 16|
| 13.6| 15|
| 13.7| 31|
| 13.8| 8|
| 13.83| 7|
| 13.89| 7|
| 14| 46|
| 14.1| 10|
| 14.3| 10|
| 14.4| 7|
| 14.5| 15|
| 14.7| 6|
| 14.9| 11|
| 15| 52|
| 15.2| 6|
| 15.3| 9|
| 15.4| 12|
| 15.5| 21|
| 15.6| 11|
| 15.7| 14|
| 15.8| 18|
| 15.9| 18|
| 16| 44|
| 16.1| 30|
| 16.2| 26|
| 16.3| 29|
| 16.4| 26|
| 16.5| 32|
| 16.6| 42|
| 16.7| 44|
| 16.72| 6|
| 16.8| 40|
| 16.9| 54|
| 17| 58|
| 17.1| 48|
| 17.2| 51|
| 17.3| 47|
| 17.4| 57|
| 17.5| 51|
| 17.6| 51|
| 17.7| 46|
| 17.8| 33|
| 17.9| 38|---------again
|1732.04| 6|
| 18| 49|
| 18.1| 21|
| 18.2| 23|
| 18.3| 29|
| 18.4| 22|
| 18.5| 22|
| 18.6| 17|
| 18.7| 13|
| 18.8| 13|
| 18.9| 19|
| 19| 36|
| 19.1| 15|
| 19.2| 13|
| 19.3| 12|
| 19.4| 15|
| 19.5| 15|
| 19.6| 15|
| 19.7| 15|
| 19.8| 14|
| 19.9| 9|
| 2| 198|------------see after 19 again 2 came
| 2.04| 7|
| 2.09| 8|
| 2.1| 47|
| 2.16| 6|
| 2.17| 8|
| 2.2| 55|
| 2.24| 6|
| 2.26| 7|
| 2.27| 6|
| 2.29| 8|
| 2.3| 53|
| 2.4| 33|
| 2.5| 36|
| 2.54| 6|
| 2.59| 6|
Can you tell me what is wrong i'm doing.
My dataframe has column
"Country_Code", "Country","Data_Source","Data_File","Category","Metric","Time","Data_Cut1","Option1_Dummy","Option1_Visible","Value"
As we talked about in the comments, it seems your Value column is of type String. You can cast it to Double (for instance) to order it numerically.
This lines will cast the Value column to doubleType:
import org.apache.spark.sql.types._
df.withColumn("Value", $"Value".cast(DoubleType))
EXAMPLE INPUT
df.show
+-----+-------+
|Value|another|
+-----+-------+
| 10.0| b|
| 2| a|
+-----+-------+
With Value as Strings
df.orderBy($"Value").show
+-----+-------+
|Value|another|
+-----+-------+
| 10.0| b|
| 2| a|
+-----+-------+
Casting Value as Double
df.withColumn("Value", $"Value".cast(DoubleType)).orderBy($"Value").show
+-----+-------+
|Value|another|
+-----+-------+
| 2.0| a|
| 10.0| b|
+-----+-------+

Create a Vertical Table in Spark 2 [duplicate]

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Closed 4 years ago.
How to create a vertical table in Spark 2 SQL.
I am building a ETL using Spark 2 / SQL / Scala. I have data in normal table structure like.
Input Table:
| ID | A | B | C | D |
| 1 | A1 | B1 | C1 | D1 |
| 2 | A2 | B2 | C2 | D2 |
Output Table:
| ID | Key | Val |
| 1 | A | A1 |
| 1 | B | B1 |
| 1 | C | C1 |
| 1 | D | D1 |
| 2 | A | A2 |
| 2 | B | B2 |
| 2 | C | C2 |
| 2 | D | D2 |
This could do the trick as well:
Input Data:
+---+---+---+---+---+
|ID |A |B |C |D |
+---+---+---+---+---+
|1 |A1 |B1 |C1 |D1 |
|2 |A2 |B2 |C2 |D2 |
|3 |A3 |B3 |C3 |D3 |
+---+---+---+---+---+
Zip the column header and no of columns to be included:
val cols = Seq("A","B","C","D") zip Range(0,4,1)
df.flatMap(r => cols.map(i => (r.getString(0),i._1,r.getString(i._2 + 1)))).toDF("ID","KEY","VALUE").show()
Result should look like this:
+---+---+-----+
| ID|KEY|VALUE|
+---+---+-----+
| 1| A| A1|
| 1| B| B1|
| 1| C| C1|
| 1| D| D1|
| 2| A| A2|
| 2| B| B2|
| 2| C| C2|
| 2| D| D2|
| 3| A| A3|
| 3| B| B3|
| 3| C| C3|
| 3| D| D3|
+---+---+-----+
Good Luck!!

Pyspark Join Tables

I'm new in Pyspark. I have 'Table A' and 'Table B' and I need join both to get 'Table C'. Can anyone help-me please?
I'm using DataFrames...
I don't know how to join that tables all together in the right way...
Table A:
+--+----------+-----+
|id|year_month| qt |
+--+----------+-----+
| 1| 2015-05| 190 |
| 2| 2015-06| 390 |
+--+----------+-----+
Table B:
+---------+-----+
year_month| sem |
+---------+-----+
| 2016-01| 1 |
| 2015-02| 1 |
| 2015-03| 1 |
| 2016-04| 1 |
| 2015-05| 1 |
| 2015-06| 1 |
| 2016-07| 2 |
| 2015-08| 2 |
| 2015-09| 2 |
| 2016-10| 2 |
| 2015-11| 2 |
| 2015-12| 2 |
+---------+-----+
Table C:
The join add columns and also add rows...
+--+----------+-----+-----+
|id|year_month| qt | sem |
+--+----------+-----+-----+
| 1| 2015-05 | 0 | 1 |
| 1| 2016-01 | 0 | 1 |
| 1| 2015-02 | 0 | 1 |
| 1| 2015-03 | 0 | 1 |
| 1| 2016-04 | 0 | 1 |
| 1| 2015-05 | 190 | 1 |
| 1| 2015-06 | 0 | 1 |
| 1| 2016-07 | 0 | 2 |
| 1| 2015-08 | 0 | 2 |
| 1| 2015-09 | 0 | 2 |
| 1| 2016-10 | 0 | 2 |
| 1| 2015-11 | 0 | 2 |
| 1| 2015-12 | 0 | 2 |
| 2| 2015-05 | 0 | 1 |
| 2| 2016-01 | 0 | 1 |
| 2| 2015-02 | 0 | 1 |
| 2| 2015-03 | 0 | 1 |
| 2| 2016-04 | 0 | 1 |
| 2| 2015-05 | 0 | 1 |
| 2| 2015-06 | 390 | 1 |
| 2| 2016-07 | 0 | 2 |
| 2| 2015-08 | 0 | 2 |
| 2| 2015-09 | 0 | 2 |
| 2| 2016-10 | 0 | 2 |
| 2| 2015-11 | 0 | 2 |
| 2| 2015-12 | 0 | 2 |
+--+----------+-----+-----+
Code:
from pyspark import HiveContext
sqlContext = HiveContext(sc)
lA = [(1,"2015-05",190),(2,"2015-06",390)]
tableA = sqlContext.createDataFrame(lA, ["id","year_month","qt"])
tableA.show()
lB = [("2016-01",1),("2015-02",1),("2015-03",1),("2016-04",1),
("2015-05",1),("2015-06",1),("2016-07",2),("2015-08",2),
("2015-09",2),("2016-10",2),("2015-11",2),("2015-12",2)]
tableB = sqlContext.createDataFrame(lB,["year_month","sem"])
tableB.show()
It's not really a join more a cartesian product (cross join)
Spark 2
import pyspark.sql.functions as psf
tableA.crossJoin(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
Spark 1.6
tableA.join(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
+---+---+----------+---+
| id| qt|year_month|sem|
+---+---+----------+---+
| 1| 0| 2015-02| 1|
| 1| 0| 2015-03| 1|
| 1|190| 2015-05| 1|
| 1| 0| 2015-06| 1|
| 1| 0| 2016-01| 1|
| 1| 0| 2016-04| 1|
| 1| 0| 2015-08| 2|
| 1| 0| 2015-09| 2|
| 1| 0| 2015-11| 2|
| 1| 0| 2015-12| 2|
| 1| 0| 2016-07| 2|
| 1| 0| 2016-10| 2|
| 2| 0| 2015-02| 1|
| 2| 0| 2015-03| 1|
| 2| 0| 2015-05| 1|
| 2|390| 2015-06| 1|
| 2| 0| 2016-01| 1|
| 2| 0| 2016-04| 1|
| 2| 0| 2015-08| 2|
| 2| 0| 2015-09| 2|
| 2| 0| 2015-11| 2|
| 2| 0| 2015-12| 2|
| 2| 0| 2016-07| 2|
| 2| 0| 2016-10| 2|
+---+---+----------+---+

filter on data which are numeric

hi i have a dataframe with a column CODEARTICLE here is the dataframe
|CODEARTICLE| STRUCTURE| DES|TYPEMARK|TYP|IMPLOC|MARQUE|GAMME|TAR|
+-----------+-------------+--------------------+--------+---+------+------+-----+---+
| GENCFFRIST|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFMARC|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFESCO|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFTNA|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFEMBA|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| 789600010|9999999999998|xxxxxxxxxxxxxxxxx...| 7| 1| Local| | | |
| 799700040|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 799701000|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 899980490|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 9| Local| | | |
| 429600010|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 559970040|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| 679500010|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 679500040|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 679500060|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
+-----------+-------------+--------------------+--------+---+------+------+-----+---+
i would like to take only rows having a numeric CODEARTICLER
//connect to table TMP_STRUCTURE oracle
val spark = sparkSession.sqlContext
val articles_Gold = spark.load("jdbc",
Map("url" -> "jdbc:oracle:thin:System/maher#//localhost:1521/XE",
"dbtable" -> "IPTECH.TMP_ARTICLE")).select("CODEARTICLE", "STRUCTURE", "DES", "TYPEMARK", "TYP", "IMPLOC", "MARQUE", "GAMME", "TAR")
val filteredData =articles_Gold.withColumn("test",'CODEARTICLE.cast(IntegerType)).filter($"test"!==null)
thank you a lot
Use na.drop:
articles_Gold.withColumn("test",'CODEARTICLE.cast(IntegerType)).na.drop("test")
you can use .isNotNull function on the column in your filter function. You don't even need to create another column for your logic. You can simply do the following
val filteredData = articles_Gold.withColumn("CODEARTICLE",'CODEARTICLE.cast(IntegerType)).filter('CODEARTICLE.isNotNull)
I hope the answer is helpful