PySpark: How can I join one more column to a dataFrame? - pyspark

I'm work on a dataframe with two inicial columns, id and colA.
+---+-----+
|id |colA |
+---+-----+
| 1 | 5 |
| 2 | 9 |
| 3 | 3 |
| 4 | 1 |
+---+-----+
I need to merge that dataFrame to another column more, colB. I know that colB fits perfectly at the end of the dataFrame, I just need some way to join it all together.
+-----+
|colB |
+-----+
| 8 |
| 7 |
| 0 |
| 6 |
+-----+
In result of these, I need to obtain a new dataFrame like that below:
+---+-----+-----+
|id |colA |colB |
+---+-----+-----+
| 1 | 5 | 8 |
| 2 | 9 | 7 |
| 3 | 3 | 0 |
| 4 | 1 | 6 |
+---+-----+-----+
This is the pyspark code to obtain the first DataFrame:
l=[(1,5),(2,9), (3,3), (4,1)]
names=["id","colA"]
db=sqlContext.createDataFrame(l,names)
db.show()
How can I do it? Could anyone help me, please? Thanks

I've done! I've solved it by adding a temporary column with the indices of the rows and then I delete it.
code:
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
l=[(1,5),(2,9), (3,3), (4,1)]
names=["id","colA"]
db=sqlContext.createDataFrame(l,names)
db.show()
l=[5,9,3,1]
rdd = sc.parallelize(l).map(lambda x: Row(x))
test_df = rdd.toDF()
test_df2 = test_df.selectExpr("_1 as colB")
dbB = test_df2.select("colB")
db= db.withColum("columnindex", rowNumber().over(w))
dbB = dbB.withColum("columnindex", rowNumber().over(w))
testdf_out = db.join(dbB, db.columnindex == dbB.columnindex. 'inner').drop(db.columnindex).drop(dbB.columnindex)
testdf_out.show()

Related

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

How to combine pyspark dataframes with different shapes and different columns

I have two dataframes in Pyspark. One has more than 1000 rows and the other only 4 rows. The columns also are not matching.
df1 with more than 1000 rows:
+----+--------+--------------+-------------+
| ID | col1 | col2 | col 3 |
+----+--------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 |
| 2 | time 2 | value2_col2 | value2_col3 |
+----+--------+--------------+-------------+
...
df2 with only 4 rows:
+-----+--------------+--------------+
| key | col_c | col_d |
+-----+--------------+--------------+
| a | valuea_colc | valuea_cold |
| b | valueb_colc | valueb_cold |
+-----+--------------+--------------+
I want to create a dataframe looking like this:
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| ID | col1 | col2 | col 3 | a_col_c | a_col_d | b_col_c | b_col_d |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
| 2 | time 2 | value2_col2 | value2_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
Can you please help with this? I prefer not to use Pandas.
Thank you!
I actually figured this out using crossJoin.
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html explains how to use crossJoin with Pyspark DataFrames.

Scala - Spark - How can I get a new dataframe with distinct values of a dataframe column and the first date of this distinct values?

I have a Spark Dataframe with the following schema:
________________________
|id | no | date |
|1 | 123 |2018/10/01 |
|2 | 124 |2018/10/01 |
|3 | 123 |2018/09/28 |
|4 | 123 |2018/09/27 |
...
What I want is to have a new DataFrame with the following data:
___________________
| no | date |
| 123 |2018/09/27 |
| 124 |2018/10/01 |
Can someone help me on this?:) Thank you!!
You can resolve it by using the rank (https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html) on dataframe with spark sql:
use registerTempTable on sparkContext such as df_temp_table
Make this query:
select dftt.*,
dense_rank() OVER ( PARTITION BY dftt.no ORDER BY dftt.date DESC) AS Rank from
df_temp_table as dftt
you will get this dataframe:
|id | no | date | rank
|1 | 123 |2018/10/01 | 1
|2 | 124 |2018/10/01 | 1
|3 | 123 |2018/09/28 | 2
|4 | 123 |2018/09/27 | 3
on this df you can now filter the rank column by 1
Welcome,
you can try below Code :
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"no").orderBy($"date".asc)
val Resultdf = df.withColumn("rownum", row_number.over(w))
.where($"rownum" === 1).drop("rownum","id")
Resultdf.show()
Output:
+---+----------+
| no| date|
+---+----------+
|124|2018/10/01|
|123|2018/09/27|
+---+----------+

How filter one big dataframe many times(equal to small df‘s row count) by another small dataframe(row by row) ?

I have two spark dataframe,dfA and dfB.
I want to filter dfA by dfB's each row, which means if dfB have 10000 rows, i need to filter dfA 10000 times with 10000 different filter conditions generated by dfB. Then, after each filter i need to collect the filter result as a column in dfB.
dfA dfB
+------+---------+---------+ +-----+-------------+--------------+
| id | value1 | value2 | | id | min_value1 | max_value1 |
+------+---------+---------+ +-----+-------------+--------------+
| 1 | 0 | 4345 | | 1 | 0 | 3 |
| 1 | 1 | 3434 | | 1 | 5 | 9 |
| 1 | 2 | 4676 | | 2 | 1 | 4 |
| 1 | 3 | 3454 | | 2 | 6 | 8 |
| 1 | 4 | 9765 | +-----+-------------+--------------+
| 1 | 5 | 5778 | ....more rows, nearly 10000 rows.
| 1 | 6 | 5674 |
| 1 | 7 | 3456 |
| 1 | 8 | 6590 |
| 1 | 9 | 5461 |
| 1 | 10 | 4656 |
| 2 | 0 | 2324 |
| 2 | 1 | 2343 |
| 2 | 2 | 4946 |
| 2 | 3 | 4353 |
| 2 | 4 | 4354 |
| 2 | 5 | 3234 |
| 2 | 6 | 8695 |
| 2 | 7 | 6587 |
| 2 | 8 | 5688 |
+------+---------+---------+
......more rows,nearly one billons rows
so my expected result is
resultDF
+-----+-------------+--------------+----------------------------+
| id | min_value1 | max_value1 | results |
+-----+-------------+--------------+----------------------------+
| 1 | 0 | 3 | [4345,3434,4676,3454] |
| 1 | 5 | 9 | [5778,5674,3456,6590,5461] |
| 2 | 1 | 4 | [2343,4946,4353,4354] |
| 2 | 6 | 8 | [8695,6587,5688] |
+-----+-------------+--------------+----------------------------+
My stupid solutions is
def tempFunction(id:Int,dfA:DataFrame,dfB:DataFrame): DataFrame ={
val dfa = dfA.filter("id ="+ id)
val dfb = dfB.filter("id ="+ id)
val arr = dfb.groupBy("id")
.agg(collect_list(struct("min_value1","max_value1"))
.collect()
val rangArray = arr(0)(1).asInstanceOf[Seq[Row]] // get range array of id
// initial a resultDF to store each query's results
val min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val s = "value1 between "+min_value1+" and "+ max_value1
var resultDF = dfa.filter(s).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
for( i <-1 to timePairArr.length-1){
val temp_min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val temp_max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val query = "value1 between "+temp_min_value1+" and "+ temp_max_value1
val tempResultDF = dfa.filter(query).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
resultDF = resultDF.union(tempResultDF)
}
return resultDF
}
def myFunction():DataFrame = {
val dfA = spark.read.parquet(routeA)
val dfB = spark.read.parquet(routeB)
val idArrays = dfB.select("id").distinct().collect()
// initial result
var resultDF = tempFunction(idArrays(0).get(0).asInstanceOf[Int],dfA,dfB)
//tranverse all id
for(i<-1 to idArrays.length-1){
val tempDF = tempFunction(idArrays(i).get(0).asInstanceOf[Int],dfA,dfB)
resultDF = resultDF.union(tempDF)
}
return resultDF
}
Maybe you don't want to see my brute force code.it's idea is
finalResult = null;
for each id in dfB:
for query condition of this id:
tempResult = query dfA
union tempResult to finalResult
I've tried my algorithms, it cost almost 50 hours.
Does anybody has a more efficient way ? Very thanks.
Assuming that your DFB is small dataset, I am trying to give the below solution.
Try using a Broadcast Join like below
import org.apache.spark.sql.functions.broadcast
dfA.join(broadcast(dfB), col("dfA.id") === col("dfB.id") && col("dfA.value1") >= col("dfB.min_value1") && col("dfA.value1") <= col("dfB.max_value1")).groupBy(col("dfA.id")).agg(collect_list(struct("value2").as("results"));
BroadcastJoin is like a Map Side Join. This will materialize the smaller data to all the mappers. This will improve the performance by omitting the required sort-and-shuffle phase during a reduce step.
Some points i would like you to avoid:
Never use collect(). When a collect operation is issued on a RDD, the dataset is copied to the driver.
If your data is too big you might get memory out of bounds exception.
Try using take() or takeSample() instead.
It is obvious that when two dataframes/datasets are involved in calculation then join should be performed. So join is a must step for you. But when should you join is the important question.
I would suggest to aggregate and reduce rows in dataframes as much as possible before joining as it would reduce shuffling.
In your case you can reduce only dfA as you need exact dfB with a column added from dfA meeting the condition
So you can groupBy id and aggregate dfA so that you get one row of each id, then you can perform the join. And then you can use a udf function for your logic of calculation
comments are provided for clarity and explanation
import org.apache.spark.sql.functions._
//udf function to filter only the collected value2 which has value1 within range of min_value1 and max_value1
def selectRangedValue2Udf = udf((minValue: Int, maxValue: Int, list: Seq[Row])=> list.filter(row => row.getAs[Int]("value1") <= maxValue && row.getAs[Int]("value1") >= minValue).map(_.getAs[Int]("value2")))
dfA.groupBy("id") //grouping by id
.agg(collect_list(struct("value1", "value2")).as("collection")) //collecting all the value1 and value2 as structs
.join(dfB, Seq("id"), "right") //joining both dataframes with id
.select(col("id"), col("min_value1"), col("max_value1"), selectRangedValue2Udf(col("min_value1"), col("max_value1"), col("collection")).as("results")) //calling the udf function defined above
which should give you
+---+----------+----------+------------------------------+
|id |min_value1|max_value1|results |
+---+----------+----------+------------------------------+
|1 |0 |3 |[4345, 3434, 4676, 3454] |
|1 |5 |9 |[5778, 5674, 3456, 6590, 5461]|
|2 |1 |4 |[2343, 4946, 4353, 4354] |
|2 |6 |8 |[8695, 6587, 5688] |
+---+----------+----------+------------------------------+
I hope the answer is helpful

Convert row values into columns with its value from another column in spark scala [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I'm trying to convert values from row into different columns with its value from another column. For example -
Input dataframe is like -
+-----------+
| X | Y | Z |
+-----------+
| 1 | A | a |
| 2 | A | b |
| 3 | A | c |
| 1 | B | d |
| 3 | B | e |
| 2 | C | f |
+-----------+
And the output dataframe should be like this -
+------------------------+
| Y | 1 | 2 | 3 |
+------------------------+
| A | a | b | c |
| B | d | null | e |
| C | null | f | null |
+------------------------+
I've tried to groupBy the values based on Y and collect_list on X and Z and then zipped X & Z together to get some sort of key-value pairs. But some Xs may be missing for some values of Y so in order to fill them with null values, I cross joined all possible values of X and all possible values of Y and then joined it my original dataframe. This is approach is highly inefficient.
Is there any efficient method to approach this problem ? Thanks in advance.
You can simply use groupBy with pivot and first as aggregate function as
import org.apache.spark.sql.functions._
df.groupBy("Y").pivot("X").agg(first("z"))
Output:
+---+----+----+----+
|Y |1 |2 |3 |
+---+----+----+----+
|B |d |null|e |
|C |null|f |null|
|A |a |b |c |
+---+----+----+----+