I have a dataframe with three columns: id, index and value.
+---+-----+-------------------+
| id|index| value|
+---+-----+-------------------+
| A| 1023|0.09938822262205915|
| A| 1046| 0.3110047630613805|
| A| 1069| 0.8486710971453512|
+---+-----+-------------------+
root
|-- id: string (nullable = true)
|-- index: integer (nullable = false)
|-- value: double (nullable = false)
Then, I have another dataframe which shows desirable periods for each id:
+---+-----------+---------+
| id|start_index|end_index|
+---+-----------+---------+
| A| 1069| 1276|
| B| 2066| 2291|
| B| 1616| 1841|
| C| 3716| 3932|
+---+-----------+---------+
root
|-- id: string (nullable = true)
|-- start_index: integer (nullable = false)
|-- end_index: integer (nullable = false)
I have three templates as below
val template1 = Array(0.0, 0.1, 0.15, 0.2, 0.3, 0.33, 0.42, 0.51, 0.61, 0.7)
val template2 = Array(0.96, 0.89, 0.82, 0.76, 0.71, 0.65, 0.57, 0.51, 0.41, 0.35)
val template3 = Array(0.0, 0.07, 0.21, 0.41, 0.53, 0.42, 0.34, 0.25, 0.19, 0.06)
The goal is, for each row in dfIntervals, apply a function (let's assume it's correlation) in which the function receives value column from dfRaw and three template arrays and adds three columns to dfIntervals, each column related to each template.
Assumptions:
1 - Sizes of templates arrays are are exactly 10.
2 - There are no duplicates in index column of dfRaw
3 - start_index and end_index columns in dfIntervals exist in index column of dfRaw and when there are exactly 10 rows between them. For instance, dfRaw.filter($"id" === "A").filter($"index" >= 1069 && $"index" <= 1276).count (first row in dfIntervals) results in exactly 10.
Here's the code that generates these dataframes:
import org.apache.spark.sql.functions._
val mySeed = 1000
/* Defining templates for correlation analysis*/
val template1 = Array(0.0, 0.1, 0.15, 0.2, 0.3, 0.33, 0.42, 0.51, 0.61, 0.7)
val template2 = Array(0.96, 0.89, 0.82, 0.76, 0.71, 0.65, 0.57, 0.51, 0.41, 0.35)
val template3 = Array(0.0, 0.07, 0.21, 0.41, 0.53, 0.42, 0.34, 0.25, 0.19, 0.06)
/* Defining raw data*/
var dfRaw = Seq(
("A", (1023 to 1603 by 23).toArray),
("B", (341 to 2300 by 25).toArray),
("C", (2756 to 3954 by 24).toArray)
).toDF("id", "index")
dfRaw = dfRaw.select($"id", explode($"index") as "index").withColumn("value", rand(seed=mySeed))
/* Defining intervals*/
var dfIntervals = Seq(
("A", 1069, 1276),
("B", 2066, 2291),
("B", 1616, 1841),
("C", 3716, 3932)
).toDF("id", "start_index", "end_index")
There result is three columns added to dfIntervals dataframe with names corr_w_template1, corr_w_template2 and corr_w_template3
PS: I could not find a correlation function in Scala. Let's assume such a function exists (as below) and we are about to make a udf out of it is needed.
def correlation(arr1: Array[Double], arr2: Array[Double]): Double
Ok.
Let's define a UDF function.
For testing purpose, let'say it will always return 1.
val correlation = functions.udf( (values: mutable.WrappedArray[Double], template: mutable.WrappedArray[Double]) => {
1f
})
val orderUdf = udf((values: mutable.WrappedArray[Row]) => {
values.sortBy(r => r.getAs[Int](0)).map(r => r.getAs[Double](1))
})
Then let's join your 2 data frames with the defined rules & collect value into 1 column called values. Also, apply our orderUdf
val df = dfIntervals.join(dfRaw,dfIntervals("id") === dfRaw("id") && dfIntervals("start_index") <= dfRaw("index") && dfRaw("index") <= dfIntervals("end_index") )
.groupBy(dfIntervals("id"), dfIntervals("start_index"), dfIntervals("end_index"))
.agg(orderUdf(collect_list(struct(dfRaw("index"), dfRaw("value")))).as("values"))
Finally, apply our udf & show it out.
df.withColumn("corr_w_template1",correlation(df("values"), lit(template1)))
.withColumn("corr_w_template2",correlation(df("values"), lit(template2)))
.withColumn("corr_w_template3",correlation(df("values"), lit(template3)))
.show(10)
This is full of example code:
import org.apache.spark.sql.functions._
import scala.collection.JavaConverters._
val conf = new SparkConf().setAppName("learning").setMaster("local[2]")
val session = SparkSession.builder().config(conf).getOrCreate()
val mySeed = 1000
/* Defining templates for correlation analysis*/
val template1 = Array(0.0, 0.1, 0.15, 0.2, 0.3, 0.33, 0.42, 0.51, 0.61, 0.7)
val template2 = Array(0.96, 0.89, 0.82, 0.76, 0.71, 0.65, 0.57, 0.51, 0.41, 0.35)
val template3 = Array(0.0, 0.07, 0.21, 0.41, 0.53, 0.42, 0.34, 0.25, 0.19, 0.06)
val schema1 = DataTypes.createStructType(Array(
DataTypes.createStructField("id",DataTypes.StringType,false),
DataTypes.createStructField("index",DataTypes.createArrayType(DataTypes.IntegerType),false)
))
val schema2 = DataTypes.createStructType(Array(
DataTypes.createStructField("id",DataTypes.StringType,false),
DataTypes.createStructField("start_index",DataTypes.IntegerType,false),
DataTypes.createStructField("end_index",DataTypes.IntegerType,false)
))
/* Defining raw data*/
var dfRaw = session.createDataFrame(Seq(
("A", (1023 to 1603 by 23).toArray),
("B", (341 to 2300 by 25).toArray),
("C", (2756 to 3954 by 24).toArray)
).map(r => Row(r._1 , r._2)).asJava, schema1)
dfRaw = dfRaw.select(dfRaw("id"), explode(dfRaw("index")) as "index")
.withColumn("value", rand(seed=mySeed))
/* Defining intervals*/
var dfIntervals = session.createDataFrame(Seq(
("A", 1069, 1276),
("B", 2066, 2291),
("B", 1616, 1841),
("C", 3716, 3932)
).map(r => Row(r._1 , r._2,r._3)).asJava, schema2)
//Define udf
val correlation = functions.udf( (values: mutable.WrappedArray[Double], template: mutable.WrappedArray[Double]) => {
1f
})
val orderUdf = udf((values: mutable.WrappedArray[Row]) => {
values.sortBy(r => r.getAs[Int](0)).map(r => r.getAs[Double](1))
})
val df = dfIntervals.join(dfRaw,dfIntervals("id") === dfRaw("id") && dfIntervals("start_index") <= dfRaw("index") && dfRaw("index") <= dfIntervals("end_index") )
.groupBy(dfIntervals("id"), dfIntervals("start_index"), dfIntervals("end_index"))
.agg(orderUdf(collect_list(struct(dfRaw("index"), dfRaw("value")))).as("values"))
df.withColumn("corr_w_template1",correlation(df("values"), lit(template1)))
.withColumn("corr_w_template2",correlation(df("values"), lit(template2)))
.withColumn("corr_w_template3",correlation(df("values"), lit(template3)))
.show(10,false)
Related
I have a sparse vector in spark and I want to randomly shuffle (reorder) its contents. This vector is actually a tf-idf vector and what I want is to reorder it so that in my new dataset the features have different order. is there any way to do this using scala?
this is my code for generating tf-idf vectors:
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(data).cache()
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("rawFeatures")
.fit(wordsData)
val featurizedData = cvModel.transform(wordsData).cache()
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData).cache()
Perhaps this is useful-
Load the test data
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
df.show(false)
df.printSchema()
/**
* +---------------------+
* |features |
* +---------------------+
* |(5,[1,3],[1.0,7.0]) |
* |[2.0,0.0,3.0,4.0,5.0]|
* |[4.0,0.0,0.0,6.0,7.0]|
* +---------------------+
*
* root
* |-- features: vector (nullable = true)
*/
shuffle the vector
val shuffleVector = udf((vector: Vector) =>
Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray)
)
val p = df.withColumn("shuffled_vector", shuffleVector($"features"))
p.show(false)
p.printSchema()
/**
* +---------------------+---------------------+
* |features |shuffled_vector |
* +---------------------+---------------------+
* |(5,[1,3],[1.0,7.0]) |[1.0,0.0,0.0,0.0,7.0]|
* |[2.0,0.0,3.0,4.0,5.0]|[0.0,3.0,2.0,5.0,4.0]|
* |[4.0,0.0,0.0,6.0,7.0]|[4.0,7.0,6.0,0.0,0.0]|
* +---------------------+---------------------+
*
* root
* |-- features: vector (nullable = true)
* |-- shuffled_vector: vector (nullable = true)
*/
You can also use the above udf to create Transformer and put it in pipeline
please make sure to use import org.apache.spark.ml.linalg._
Update-1 convert shuffled vector to sparse
val shuffleVectorToSparse = udf((vector: Vector) =>
Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray).toSparse
)
val p1 = df.withColumn("shuffled_vector", shuffleVectorToSparse($"features"))
p1.show(false)
p1.printSchema()
/**
* +---------------------+-------------------------------+
* |features |shuffled_vector |
* +---------------------+-------------------------------+
* |(5,[1,3],[1.0,7.0]) |(5,[0,3],[1.0,7.0]) |
* |[2.0,0.0,3.0,4.0,5.0]|(5,[1,2,3,4],[5.0,3.0,2.0,4.0])|
* |[4.0,0.0,0.0,6.0,7.0]|(5,[1,3,4],[7.0,4.0,6.0]) |
* +---------------------+-------------------------------+
*
* root
* |-- features: vector (nullable = true)
* |-- shuffled_vector: vector (nullable = true)
*/
Let me explain this with an example. Starting with the following dataframe
val df = Seq((1, "CS", 0, Array(0.1, 0.2, 0.4, 0.5)),
(4, "Ed", 0, Array(0.4, 0.8, 0.3, 0.6)),
(7, "CS", 0, Array(0.2, 0.5, 0.4, 0.7)),
(101, "CS", 1, Array(0.5, 0.7, 0.3, 0.8)),
(5, "CS", 1, Array(0.4, 0.2, 0.6, 0.9))).toDF("id", "dept", "test", "array")
df.show()
+---+----+----+--------------------+
| id|dept|test| array|
+---+----+----+--------------------+
| 1| CS| 0|[0.1, 0.2, 0.4, 0.5]|
| 4| Ed| 0|[0.4, 0.8, 0.3, 0.6]|
| 7| CS| 0|[0.2, 0.5, 0.4, 0.7]|
|101| CS| 1|[0.5, 0.7, 0.3, 0.8]|
| 5| CS| 1|[0.4, 0.2, 0.6, 0.9]|
+---+----+----+--------------------+
Considering the following two common operations as example (but do not have to be limited to them):
import org.apache.spark.sql.functions._ // for `when`
val dfFilter1 = df.where($"dept" === "CS")
val dfFilter3 = df.withColumn("category", when($"dept" === "CS" && $"id" === 101, 10).otherwise(0))
Now, I have a string variable colName = "dept". And it is required that $"dept" in the previous operation has to be replaced by colName in some form to achieve the same functionality. I managed to achieve the first one as following:
val dfFilter2 = df.where(s"${colName} = 'CS'")
But similar operation fails in the second case:
val dfFilter4 = df.withColumn("category", when(s"${colName} = 'CS'" && $"id" === 101, 10).otherwise(0))
Specifically it gives the following error:
Name: Unknown Error
Message: <console>:35: error: value && is not a member of String
val dfFilter4 = df.withColumn("category", when(s"${colName} = 'CS'" && $"id" === 101, 10).otherwise(0))
My understanding so far is that after I use s"${variable}" to deal with a variable, everything becomes pure string, and it is difficult to have logic operation involved.
So, my question are:
1. What is the best way to use such string variable as colName for operations similar as the two I listed above (I also do not like the solution I have for .where())?
2. Are there any general guidelines to use such string variable in more general operations other than the two examples here (I always felt that it is very case-specific when I deal with string related operations).
You can use expr function as
val dfFilter4 = df.withColumn("category", when(expr(s"${colName} = 'CS' and id = 101"), 10).otherwise(0))
Reason of the error
where function when defined with string query as following is working
val dfFilter2 = df.where(s"${colName} = 'CS'")
because there are supporting apis for both string and column
/**
* Filters rows using the given condition. This is an alias forfilter.
* {{{
* // The following are equivalent:
* peopleDs.filter($"age" > 15)
* peopleDs.where($"age" > 15)
* }}}
*
* #group typedrel
* #since 1.6.0
*/
def where(condition: Column): Dataset[T] = filter(condition)
and
/**
* Filters rows using the given SQL expression.
* {{{
* peopleDs.where("age > 15")
* }}}
*
* #group typedrel
* #since 1.6.0
*/
def where(conditionExpr: String): Dataset[T] = {
filter(Column(sparkSession.sessionState.sqlParser.parseExpression(conditionExpr)))
}
But there is only one api for when function supporting only column type
/**
* Evaluates a list of conditions and returns one of multiple possible result expressions.
* If otherwise is not defined at the end, null is returned for unmatched conditions.
*
* {{{
* // Example: encoding gender string column into integer.
*
* // Scala:
* people.select(when(people("gender") === "male", 0)
* .when(people("gender") === "female", 1)
* .otherwise(2))
*
* // Java:
* people.select(when(col("gender").equalTo("male"), 0)
* .when(col("gender").equalTo("female"), 1)
* .otherwise(2))
* }}}
*
* #group normal_funcs
* #since 1.4.0
*/
def when(condition: Column, value: Any): Column = withExpr {
CaseWhen(Seq((condition.expr, lit(value).expr)))
}
So you cannot use string sql query for when function
So, correct way of doing is as following
val dfFilter4 = df.withColumn("category", when(col(s"${colName}") === "CS" && $"id" === 101, 10).otherwise(0))
or in short as
val dfFilter4 = df.withColumn("category", when(col(colName) === "CS" && col("id") === 101, 10).otherwise(0))
What is the best way to use such string variable as colName for operations similar as the two I listed above
You can use col function from org.apache.spark.sql.functions
import org.apache.spark.sql.functions._
val colName = "dept"
For dfFilter2
val dfFilter2 = df.where(col(colName) === "CS")
For dfFilter4
val dfFilter4 = df.withColumn("category", when(col(colName) === "CS" && $"id" === 101, 10).otherwise(0))
Let me explain what I want to achieve with an example.
Starting with a DataFrame as following:
val df = Seq((1, "CS", 0, (0.1, 0.2, 0.4, 0.5)),
(4, "Ed", 0, (0.4, 0.8, 0.3, 0.6)),
(7, "CS", 0, (0.2, 0.5, 0.4, 0.7)),
(101, "CS", 1, (0.5, 0.7, 0.3, 0.8)),
(5, "CS", 1, (0.4, 0.2, 0.6, 0.9)))
.toDF("id", "dept", "test", "array")
+---+----+----+--------------------+
| id|dept|test| array|
+---+----+----+--------------------+
| 1| CS| 0|[0.1, 0.2, 0.4, 0.5]|
| 4| Ed| 0|[0.4, 0.8, 0.3, 0.6]|
| 7| CS| 0|[0.2, 0.5, 0.4, 0.7]|
|101| CS| 1|[0.5, 0.7, 0.3, 0.8]|
| 5| CS| 1|[0.4, 0.2, 0.6, 0.9]|
+---+----+----+--------------------+
I want to change some elements of the array column according to the information in id, dept and test column. I first add the Index to each row for different dept as following:
#transient val w = Window.partitionBy("dept").orderBy("id")
val tempdf = df.withColumn("Index", row_number().over(w))
tempdf.show
+---+----+----+--------------------+-----+
| id|dept|test| array|Index|
+---+----+----+--------------------+-----+
| 1| CS| 0|[0.1, 0.2, 0.4, 0.5]| 1|
| 5| CS| 1|[0.4, 0.2, 0.6, 0.9]| 2|
| 7| CS| 0|[0.2, 0.5, 0.4, 0.7]| 3|
|101| CS| 1|[0.5, 0.7, 0.3, 0.8]| 4|
| 4| Ed| 0|[0.4, 0.8, 0.3, 0.6]| 1|
+---+----+----+--------------------+-----+
What I want to achieve is to minus a constant (0.1) from one element in array column with its location corresponds to the index of the row within each dept. For example, in "dept==CS" case, the final result should be:
+---+----+----+--------------------+-----+
| id|dept|test| array|Index|
+---+----+----+--------------------+-----+
| 1| CS| 0|[0.0, 0.2, 0.4, 0.5]| 1|
| 5| CS| 1|[0.4, 0.1, 0.6, 0.9]| 2|
| 7| CS| 0|[0.2, 0.5, 0.3, 0.7]| 3|
|101| CS| 1|[0.5, 0.7, 0.3, 0.7]| 4|
| 4| Ed| 0|[0.4, 0.8, 0.3, 0.6]| 1|
+---+----+----+--------------------+-----+
Currently, I am thinking of achieving this with udf as following:
def subUdf = udf((array: Seq[Double], dampFactor: Double, additionalIndex: Int) => additionalIndex match{
case 0 => array
case _ => { val temp = array.zipWithIndex
var mask = Array.fill(array.length)(0.0)
mask(additionalIndex-1) = dampFactor
val tempAdj = temp.map(x => if (additionalIndex == (x._2+1)) (x._1-mask, x._2) else x)
tempAdj.map(_._1)
}
}
)
val dampFactor = 0.1
val finaldf = tempdf.withColumn("array", subUdf(tempdf("array"), dampFactor, when(tempdf("dept") === "CS" && tempdf("test") === 0, tempdf("Index")).otherwise(lit(0)))).drop("Index")
The udf has a compile error due to the overloading method:
Name: Compile Error
Message: <console>:34: error: overloaded method value - with alternatives:
(x: Double)Double <and>
(x: Float)Double <and>
(x: Long)Double <and>
(x: Int)Double <and>
(x: Char)Double <and>
(x: Short)Double <and>
(x: Byte)Double
cannot be applied to (Array[Double])
val tempAdj = temp.map(x => if (additionalIndex == (x._2+1)) (x._1-mask, x._2) else x)
^
Two related questions:
How to resolve the compile error?
I am open to suggestion of using method other than udf to achieve this as well.
If I understand your requirement correctly, you can create a UDF that takes the dampFactor, the array column and the window index column to transform the dataframe as follows:
val df = Seq(
(1, "CS", 0, Seq(0.1, 0.2, 0.4, 0.5)),
(4, "Ed", 0, Seq(0.4, 0.8, 0.3, 0.6)),
(7, "CS", 0, Seq(0.2, 0.5, 0.4, 0.7)),
(101, "CS", 1, Seq(0.5, 0.7, 0.3, 0.8)),
(5, "CS", 1, Seq(0.4, 0.2, 0.6, 0.9))
).toDF("id", "dept", "test", "array")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("dept").orderBy("id")
val tempdf = df.withColumn("index", row_number().over(w))
def adjustSeq(dampFactor: Double) = udf(
(seq: Seq[Double], index: Int) =>
seq.indices.map(i =>
if (i == index - 1) seq(i) - dampFactor else seq(i)
)
)
val finaldf = tempdf.
withColumn("array", adjustSeq(0.1)($"array", $"index")).
drop("index")
finaldf.show(false)
// +---+----+----+------------------------------------+
// |id |dept|test|array |
// +---+----+----+------------------------------------+
// |1 |CS |0 |[0.0, 0.2, 0.4, 0.5] |
// |5 |CS |1 |[0.4, 0.1, 0.6, 0.9] |
// |7 |CS |0 |[0.2, 0.5, 0.30000000000000004, 0.7]|
// |101|CS |1 |[0.5, 0.7, 0.3, 0.7000000000000001] |
// |4 |Ed |0 |[0.30000000000000004, 0.8, 0.3, 0.6]|
// +---+----+----+------------------------------------+
Your sample code appears to include some additional logic not described in the requirement:
val finaldf = tempdf.withColumn("array", subUdf(tempdf("array"),
dampFactor, when(tempdf("dept") === "CS" && tempdf("test") === 0,
tempdf("Index")).otherwise(lit(0)))).drop("Index")
To factor in the additional logic:
def adjustSeq(dampFactor: Double) = udf(
(seq: Seq[Double], index: Int, dept: String, test: Int) =>
(`dept`, `test`) match {
case ("CS", 0) =>
seq.indices.map(i =>
if (i == index - 1) seq(i) - dampFactor else seq(i)
)
case _ => seq
}
)
val finaldf = tempdf.
withColumn("array", adjustSeq(0.1)($"array", $"index", $"dept", $"test")).
drop("index")
finaldf.show(false)
// +---+----+----+------------------------------------+
// |id |dept|test|array |
// +---+----+----+------------------------------------+
// |1 |CS |0 |[0.0, 0.2, 0.4, 0.5] |
// |5 |CS |1 |[0.4, 0.2, 0.6, 0.9] |
// |7 |CS |0 |[0.2, 0.5, 0.30000000000000004, 0.7]|
// |101|CS |1 |[0.5, 0.7, 0.3, 0.8] |
// |4 |Ed |0 |[0.4, 0.8, 0.3, 0.6] |
// +---+----+----+------------------------------------+
Using Python's Pandas, one can do bulk operations on multiple columns in one pass like this:
# assuming we have a DataFrame with, among others, the following columns
cols = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8']
df[cols] = df[cols] / df['another_column']
Is there a similar functionality using Spark in Scala?
Currently I end up doing:
val df2 = df.withColumn("col1", $"col1" / $"another_column")
.withColumn("col2", $"col2" / $"another_column")
.withColumn("col3", $"col3" / $"another_column")
.withColumn("col4", $"col4" / $"another_column")
.withColumn("col5", $"col5" / $"another_column")
.withColumn("col6", $"col6" / $"another_column")
.withColumn("col7", $"col7" / $"another_column")
.withColumn("col8", $"col8" / $"another_column")
You can use foldLeft to process the column list as below:
val df = Seq(
(1, 20, 30, 4),
(2, 30, 40, 5),
(3, 10, 30, 2)
).toDF("id", "col1", "col2", "another_column")
val cols = Array("col1", "col2")
val df2 = cols.foldLeft( df )( (acc, c) =>
acc.withColumn( c, df(c) / df("another_column") )
)
df2.show
+---+----+----+--------------+
| id|col1|col2|another_column|
+---+----+----+--------------+
| 1| 5.0| 7.5| 4|
| 2| 6.0| 8.0| 5|
| 3| 5.0|15.0| 2|
+---+----+----+--------------+
For completeness: a slightly different version from #Leo C's, not using foldLeft but a single select expression instead:
import org.apache.spark.sql.functions._
import spark.implicits._
val toDivide = List("col1", "col2")
val newColumns = toDivide.map(name => col(name) / col("another_column") as name)
val df2 = df.select(($"id" :: newColumns) :+ $"another_column": _*)
Produces the same output.
You can use plain select on operated columns. The solution is very similar to the Python Panda solution.
//Define the dataframe df1
case class ARow(col1: Int, col2: Int, anotherCol: Int)
val df1 = spark.createDataset(Seq(
ARow(1, 2, 3),
ARow(4, 5, 6),
ARow(7, 8, 9))).toDF
// Perform the operation using a map
val cols = Array("col1", "col2")
val opCols = cols.map(c => df1(c)/df1("anotherCol"))
// Select the columns operated
val df2 = df1.select(opCols: _*)
The .show on df2
df2.show()
+-------------------+-------------------+
|(col1 / anotherCol)|(col2 / anotherCol)|
+-------------------+-------------------+
| 0.3333333333333333| 0.6666666666666666|
| 0.6666666666666666| 0.8333333333333334|
| 0.7777777777777778| 0.8888888888888888|
+-------------------+-------------------+
I have 2 DataFrames
case class UserTransactions(id: Long, transactionDate: java.sql.Date, currencyUsed: String, value: Long)
ID, TransactionDate, CurrencyUsed, value
1, 2016-01-05, USD, 100
1, 2016-01-09, GBP, 150
1, 2016-02-01, USD, 50
1, 2016-02-10, JPN, 10
2, 2016-01-10, EURO, 50
2, 2016-01-10, GBP, 100
case class ReportingTime(userId: Long, reportDate: java.sql.Date)
userId, reportDate
1, 2016-01-05
1, 2016-01-31
1, 2016-02-15
2, 2016-01-10
2, 2016-02-01
Now I want to get summary by combining all previously used currencies by userId, reportDate and sum. The results should look like:
userId, reportDate, trasactionSummary
1, 2016-01-05, None
1, 2016-01-31, (USD -> 100)(GBP-> 150) // combined above 2 transactions less than 2016-01-31
1, 2016-02-15, (USD -> 150)(GBP-> 150)(JPN->10) // combined transactions less than 2016-02-15
2, 2016-01-10, None
2, 2016-02-01, (EURO-> 50) (GBP-> 100)
What is the best way to do this to do this? We have over 300 million transactions where each user can have up to 10,000 transactions.
The below snippet would achieve your requirement. Initial joining and aggregation is done via the Dataframe API of pyspark. Then the grouping of data (using reduceByKey) and final dataset preparation is done via RDD api since it is more suitable for such operations.
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import DateType
from pyspark.sql import functions as F
df1 = spark.createDataFrame([(1,'2016-01-05','USD',100),
(1,'2016-01-09','GBP',150),
(1,'2016-02-01','USD',50),
(1,'2016-02-10','JPN',10),
(2,'2016-01-10','EURO',50),
(2,'2016-01-10','GBP',100)],['id', 'tdate', 'currency', 'value'])
df2 = spark.createDataFrame([(1,'2016-01-05'),
(1,'2016-01-31'),
(1,'2016-02-15'),
(2,'2016-01-10'),
(2,'2016-02-01')],['user_id', 'report_date'])
func = udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType()) ### function to convert string data type to date data type
df2 = df2.withColumn('tdate', func(df2.report_date))
df1 = df1.withColumn('tdate', func(df1.tdate))
result = df2.join(df1, (df1.id == df2.user_id) & (df1.tdate < df2.report_date), 'left_outer').select('user_id', 'report_date', 'currency', 'value').groupBy('user_id', 'report_date', 'currency').agg(F.sum('value').alias('value'))
data = result.rdd.map(lambda x: (x.user_id,x.report_date,x.currency,x.value)).keyBy(lambda x: (x[0],x[1])).mapValues(lambda x: filter(lambda x: bool(x),[(x[2],x[3]) if x[2] else None])).reduceByKey(lambda x,y: x + y).map(lambda x: (x[0][0],x[0][1], x[1]))
The final result generated is as shown below.
>>> spark.createDataFrame([ (x[0],x[1],str(x[2])) for x in data.collect()], ['id', 'date', 'values']).orderBy('id', 'date').show(20, False)
+---+----------+--------------------------------------------+
|id |date |values |
+---+----------+--------------------------------------------+
|1 |2016-01-05|[] |
|1 |2016-01-31|[(u'USD', 100), (u'GBP', 150)] |
|1 |2016-02-15|[(u'USD', 150), (u'GBP', 150), (u'JPN', 10)]|
|2 |2016-01-10|[] |
|2 |2016-02-01|[(u'EURO', 50), (u'GBP', 100)] |
+---+----------+--------------------------------------------+
In case some one needs in Scala
case class Transaction(id: String, date: java.sql.Date, currency:Option[String], value: Option[Long])
case class Report(id:String, date:java.sql.Date)
def toDate(date: String): java.sql.Date = {
val sf = new SimpleDateFormat("yyyy-MM-dd")
new java.sql.Date(sf.parse(date).getTime)
}
val allTransactions = Seq(
Transaction("1", toDate("2016-01-05"),Some("USD"),Some(100L)),
Transaction("1", toDate("2016-01-09"),Some("GBP"),Some(150L)),
Transaction("1",toDate("2016-02-01"),Some("USD"),Some(50L)),
Transaction("1",toDate("2016-02-10"),Some("JPN"),Some(10L)),
Transaction("2",toDate("2016-01-10"),Some("EURO"),Some(50L)),
Transaction("2",toDate("2016-01-10"),Some("GBP"),Some(100L))
)
val allReports = Seq(
Report("1",toDate("2016-01-05")),
Report("1",toDate("2016-01-31")),
Report("1",toDate("2016-02-15")),
Report("2",toDate("2016-01-10")),
Report("2",toDate("2016-02-01"))
)
val transections:Dataset[Transaction] = spark.createDataFrame(allTransactions).as[Transaction]
val reports: Dataset[Report] = spark.createDataFrame(allReports).as[Report]
val result = reports.alias("rp").join(transections.alias("tx"), (col("tx.id") === col("rp.id")) && (col("tx.date") < col("rp.date")), "left_outer")
.select("rp.id", "rp.date", "currency", "value")
.groupBy("rp.id", "rp.date", "currency").agg(sum("value"))
.toDF("id", "date", "currency", "value")
.as[Transaction]
val data = result.rdd.keyBy(x => (x.id , x.date))
.mapValues(x => if (x.currency.isDefined) collection.Map[String, Long](x.currency.get -> x.value.get) else collection.Map[String, Long]())
.reduceByKey((x,y) => x ++ y).map(x => (x._1._1, x._1._2, x._2))
.toDF("id", "date", "map")
.orderBy("id", "date")
Console output
+---+----------+--------------------------------------+
|id |date |map |
+---+----------+--------------------------------------+
|1 |2016-01-05|Map() |
|1 |2016-01-31|Map(GBP -> 150, USD -> 100) |
|1 |2016-02-15|Map(USD -> 150, GBP -> 150, JPN -> 10)|
|2 |2016-01-10|Map() |
|2 |2016-02-01|Map(GBP -> 100, EURO -> 50) |
+---+----------+--------------------------------------+