I'm trying to accomplish the following in PYSPARK. Sample source is provided below. We will be having more number of records in source.
Source:
Expected output:
You could use the stackfunction:
Setup of your example data:
from pyspark.sql import Row, SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(COLA='H', COLB='I', COLC='J',
COL_GRP_A_1=0.1, COL_GRP_A_2=1., COL_GRP_A_3=3.,
COL_GRP_B_1=4., COL_GRP_B_2=2.5, COL_GRP_B_3=6.,
COL_GRP_C_1=2., COL_GRP_C_2=5., COL_GRP_C_3=4.,
),
])
df.show()
# Output
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|COLA|COLB|COLC|COL_GRP_A_1|COL_GRP_A_2|COL_GRP_A_3|COL_GRP_B_1|COL_GRP_B_2|COL_GRP_B_3|COL_GRP_C_1|COL_GRP_C_2|COL_GRP_C_3|
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| H| I| J| 0.1| 1.0| 3.0| 4.0| 2.5| 6.0| 2.0| 5.0| 4.0|
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
Now the processing:
(
df
.selectExpr(
'COLA', 'COLB', 'COLC',
'stack(3, "COL_GRP_A", COL_GRP_A_1, COL_GRP_A_2, COL_GRP_A_3, "COL_GRP_B", COL_GRP_B_1, COL_GRP_B_2, COL_GRP_B_3, "COL_GRP_C", COL_GRP_C_1, COL_GRP_C_2, COL_GRP_C_3) AS (GRP, COL_VAL1, COL_VAL2, COL_VAL3)'
)
.show()
)
# Output:
+----+----+----+---------+--------+--------+--------+
|COLA|COLB|COLC| GRP|COL_VAL1|COL_VAL2|COL_VAL3|
+----+----+----+---------+--------+--------+--------+
| H| I| J|COL_GRP_A| 0.1| 1.0| 3.0|
| H| I| J|COL_GRP_B| 4.0| 2.5| 6.0|
| H| I| J|COL_GRP_C| 2.0| 5.0| 4.0|
+----+----+----+---------+--------+--------+--------+
Related
I have a data frame like below
data = [
(1, None,7,10,11,19),
(1, 4,None,10,43,58),
(None, 4,7,67,88,91),
(1, None,7,78,96,32)
]
df = spark.createDataFrame(data, ["A_min", "B_min","C_min","A_max", "B_max","C_max"])
df.show()
and I would want the columns which show name as 'min' to be replaced by their equivalent max column.
Example null values of A_min column should be replaced by A_max column
It should be like the data frame below.
+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
| 1| 11| 7| 10| 11| 19|
| 1| 4| 58| 10| 43| 58|
| 67| 4| 7| 67| 88| 91|
| 1| 96| 7| 78| 96| 32|
+-----+-----+-----+-----+-----+-----+
I have tried the code below by defining the columns but clearly this does not work. Really appreciate any help.
min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]
for i in min_cols
df = df.withColumn(i,when(f.col(i)=='',max_cols.otherwise(col(i))))
display(df)
Assuming you have the same number of max and min columns, you can use coalesce along with python's list comprehension to obtain your solution
from pyspark.sql.functions import coalesce
min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]
df.select(*[coalesce(df[val], df[max_cols[pos]]).alias(val) for pos, val in enumerate(min_cols)], *max_cols).show()
Output:
+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
| 1| 11| 7| 10| 11| 19|
| 1| 4| 58| 10| 43| 58|
| 67| 4| 7| 67| 88| 91|
| 1| 96| 7| 78| 96| 32|
+-----+-----+-----+-----+-----+-----+
I've a matrix which is 1000*10000 size. I want to convert this matrix into pyspark dataframe.
Can someone please tell me how to do it? This post has an example. But my number of columns is large. So, assigning column names manually will be difficult.
Thanks!
In order to create a Pyspark Dataframe, you can use the function createDataFrame()
matrix=([11,12,13,14,15],[21,22,23,24,25],[31,32,33,34,35],[41,42,43,44,45])
df=spark.createDataFrame(matrix)
df.show()
+---+---+---+---+---+
| _1| _2| _3| _4| _5|
+---+---+---+---+---+
| 11| 12| 13| 14| 15|
| 21| 22| 23| 24| 25|
| 31| 32| 33| 34| 35|
| 41| 42| 43| 44| 45|
+---+---+---+---+---+
As you can see above, the columns will be named automatically with numbers.
You can also pass your own column names to the createDataFrame() function:
columns=[ 'mycol_'+str(col) for col in range(5) ]
df=spark.createDataFrame(matrix,schema=columns)
df.show()
+-------+-------+-------+-------+-------+
|mycol_0|mycol_1|mycol_2|mycol_3|mycol_4|
+-------+-------+-------+-------+-------+
| 11| 12| 13| 14| 15|
| 21| 22| 23| 24| 25|
| 31| 32| 33| 34| 35|
| 41| 42| 43| 44| 45|
+-------+-------+-------+-------+-------+
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
prsn = sc.read.format("csv").option("delimiter", ",").option("header", "true").option("inferSchema", "true").load("C:/Users/amit.suar/IdeaProjects/LearningPyspark/prsn.csv")
prsn.show()
+-------------------------+------------------------+---+-----------+-----------------------------+
|PERSON_MEDIA_CONSUMER_KEY|PERSON_MEDIA_CONSUMER_ID|AGE|GENDER_CODE|EDUCATION_LEVEL_CATEGORY_CODE|
+-------------------------+------------------------+---+-----------+-----------------------------+
| 101| 3285854| 15| 1| 1|
| 102| 2313090| 25| 1| 3|
| 103| 2295854| 33| 2| 6|
| 104| 2295854| 33| 2| 6|
| 105| 2471554| 26| 2| 4|
| 106| 2471554| 26| 2| 4|
+-------------------------+------------------------+---+-----------+-----------------------------+
i want to capture this output as a string in a variable..how can i achieve it?
There is internal/private function that return the same string as .show() prints:
# Return dataframe as a table of first n records (20 by default)
dataframe._jdf.showString(n, 20)
Hello guys i have a dataframe that needs to be updated based on another dataframe there are field that we are going to sum and other to just take the new value provided by the second dataframe , here what i did
val hist1 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
.withColumn("article_id", 'article_id.cast(LongType))
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("qte", 'qte.cast(LongType))
.withColumn("ca", 'ca.cast(DoubleType))
hist1.show
val hist2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/his2.csv")
.withColumn("article_id", 'article_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(LongType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show
val df3 = hist1.unionAll(hist2)
//
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))
df4.show
+------+----------+----------+---+----+----------+
|pos_id|article_id| date|qte| ca|sale_price|
+------+----------+----------+---+----+----------+
| 1| 1|2000-01-07| 3| 3.5| 14.3|
| 2| 2|2000-01-07| 15|12.0| 13.2|
| 3| 2|2000-01-07| 4| 1.2| 14.3|
| 4| 2|2000-01-07| 4| 1.2| 12.3|
+------+----------+----------+---+----+----------+
+------+----------+----------+---+----+----------+
|pos_id|article_id| date|qte| ca|sale_price|
+------+----------+----------+---+----+----------+
| 1| 1|2000-01-08| 3| 3.5| 14.5|
| 2| 2|2000-01-08| 15|12.0| 20.2|
| 3| 2|2000-01-08| 4| 1.2| 17.5|
| 4| 2|2000-01-08| 4| 1.2| 18.2|
| 5| 3|2000-01-08| 15| 1.2| 11.2|
| 6| 1|2000-01-08| 2|1.25| 13.5|
| 6| 2|2000-01-08| 2|1.25| 14.3|
+------+----------+----------+---+----+----------+
+------+----------+----------+--------+-------+
|pos_id|article_id| max(date)|sum(qte)|sum(ca)|
+------+----------+----------+--------+-------+
| 2| 2|2000-01-08| 30| 24.0|
| 3| 2|2000-01-08| 8| 2.4|
| 1| 1|2000-01-08| 6| 7.0|
| 5| 3|2000-01-08| 15| 1.2|
| 6| 1|2000-01-08| 2| 1.25|
| 6| 2|2000-01-08| 2| 1.25|
| 4| 2|2000-01-08| 8| 2.4|
+------+----------+----------+--------+-------+
How would the request be , if i want to append the field sale_price and consider the new sale_price provided by the second dataframe
how would this request be
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))
Many thanks in advance
You can use join on the last line as the following
val df4 = df3.groupBy("pos_id", "article_id").agg(max("date"), sum("qte"), sum("ca")).join(hist2.select("pos_id", "article_id", "sale_price"), Seq("pos_id", "article_id"))
You should have your desired output
I have data in RDD which have 4 columns like geog, product, time and price. I want to calculate the running sum based on geog and time.
Given Data
I need result like.
[
I need this spark-Scala-RDD. I am new to this Scala world, i can achieve this easily in SQL. i want do this in spark -Scala -RDD like using (map,flatmap).
Advance thanks for your help.
This is possible by defining a window function:
>>> val data = List(
("India","A1","Q1",40),
("India","A2","Q1",30),
("India","A3","Q1",21),
("German","A1","Q1",50),
("German","A3","Q1",60),
("US","A1","Q1",60),
("US","A2","Q2",25),
("US","A4","Q1",20),
("US","A5","Q5",15),
("US","A3","Q3",10)
)
>>> val df = sc.parallelize(data).toDF("country", "part", "quarter", "result")
>>> df.show()
+-------+----+-------+------+
|country|part|quarter|result|
+-------+----+-------+------+
| India| A1| Q1| 40|
| India| A2| Q1| 30|
| India| A3| Q1| 21|
| German| A1| Q1| 50|
| German| A3| Q1| 60|
| US| A1| Q1| 60|
| US| A2| Q2| 25|
| US| A4| Q1| 20|
| US| A5| Q5| 15|
| US| A3| Q3| 10|
+-------+----+-------+------+
>>> val window = Window.partitionBy("country").orderBy("part", "quarter")
>>> val resultDF = df.withColumn("agg", sum(df("result")).over(window))
>>> resultDF.show()
+-------+----+-------+------+---+
|country|part|quarter|result|agg|
+-------+----+-------+------+---+
| India| A1| Q1| 40| 40|
| India| A2| Q1| 30| 70|
| India| A3| Q1| 21| 91|
| US| A1| Q1| 60| 60|
| US| A2| Q2| 25| 85|
| US| A3| Q3| 10| 95|
| US| A4| Q1| 20|115|
| US| A5| Q5| 15|130|
| German| A1| Q1| 50| 50|
| German| A3| Q1| 60|110|
+-------+----+-------+------+---+
You can do this using Window functions, please take a look at the Databrick blog about Windows:
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Hope this helps.
Happy Sparking! Cheers, Fokko
I think this will help others also. I tried in SCALA RDD.
val fileName_test_1 ="C:\\venkat_workshop\\Qintel\\Data_Files\\test_1.txt"
val rdd1 = sc.textFile(fileName_test_1).map { x => (x.split(",")(0).toString() ,
x.split(",")(1).toString(),
x.split(",")(2).toString(),
x.split(",")(3).toDouble
)
}.groupBy( x => (x._1,x._3) )
.mapValues
{
_.toList.sortWith
{
(a,b) => (a._4) > (b._4)
}.scanLeft("","","",0.0,0.0){
(a,b) => (b._1,b._2,b._3,b._4,b._4+a._5)
}.tail
}.flatMapValues(f => f).values