I have a column of lists in a spark dataframe.
How do I convert the arrays to a spark dataframe where each element in the list is a column in the dataframe?
I am new in scala, and I want to use scala to solve it.
For example :
You can do it by creating a RDD of rows, creating a schema and using it to convert the RDD to a dataframe.
// A seq of seqs
val s = Seq(1 to 6, 1 to 6, 1 to 6)
// Let's create a RDD of Rows
val rdd = sc.parallelize(s).map(Row.fromSeq)
// Let's define a schema based on the first seq of s
val schema = StructType(
(1 to s(0).size).map(i => StructField("c"+i, IntegerType, true))
)
// And let's finally create the dataframe
val df = spark.createDataFrame(rdd, schema)
df.show
// +---+---+---+---+---+---+
// | c1| c2| c3| c4| c5| c6|
// +---+---+---+---+---+---+
// | 1| 2| 3| 4| 5| 6|
// | 1| 2| 3| 4| 5| 6|
// | 1| 2| 3| 4| 5| 6|
// +---+---+---+---+---+---+
If you have a dataframe as mentioned in the question with array column as
root
|-- features: array (nullable = true)
| |-- element: integer (containsNull = false)
then you can use following logic
val finalCols = Array("c1", "c2", "c3", "c4", "c5", "c6", "c7")
import org.apache.spark.sql.functions._
finalCols.zipWithIndex.foldLeft(df){(tempdf, c) => tempdf.withColumn(c._1, col("features")(c._2))}.select(finalCols.map(col): _*).show(false)
which should give you
+---+---+---+---+---+---+---+
|c1 |c2 |c3 |c4 |c5 |c6 |c7 |
+---+---+---+---+---+---+---+
|0 |45 |63 |0 |0 |0 |0 |
|0 |0 |0 |85 |0 |69 |0 |
|0 |89 |56 |0 |0 |0 |0 |
+---+---+---+---+---+---+---+
Or you can use a udf function as
import org.apache.spark.sql.functions._
def splitArrayUdf = udf((features: Seq[Int]) => testCaseClass(features(0), features(1), features(2), features(3), features(4), features(5), features(6)))
df.select(splitArrayUdf(col("features")).as("features")).select(col("features.*")).show(false)
which should give you the same the result
I hope the answer is helpful
Related
For a dataframe given below, i want a new column in dataframe which should have constant value of sum of freq column.
+------+----+
|number|freq|
+------+----+
| 8| 1|
| 6| 2|
| 2| 4|
+------+----+
The result should look like
+------+----+-------+
|number|freq|new_col|
+------+----+-------+
| 8| 1| 7|
| 6| 2| 7|
| 2| 4| 7|
+------+----+-------+
and i want this without groupBy or agg.
I tried :
var x = sum(df("freq"))
df.withColumn("new_col",lit(x))
or
df.withColumn("new_col",x)
or
df.withColumn("new_col",sum($"freq"))
But none worked.
You can try this but be careful, it uses a single partition :
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(8,1),
(6,2),
(2,4)
).toDF("number","freq")
df.withColumn("new_col", sum($"freq").over())
.show(false)
+------+----+-------+
|number|freq|new_col|
+------+----+-------+
|8 |1 |7 |
|6 |2 |7 |
|2 |4 |7 |
+------+----+-------+
You could use a window over the entire dataframe to do that but I highly recommend not to do it for all the data would need to go to only one partition which would be terrible in terms of performance.
A simple way to do it, very similar to your 1st approach, is:
import org.apache.spark.sql.Row
val Row(x) = df.select(sum('freq)).head
val new_df = df.withColumn("new_col", lit(x))
I would like to replicate rows according to their value for a given column. For example, I got this DataFrame:
+-----+
|count|
+-----+
| 3|
| 1|
| 4|
+-----+
I would like to get:
+-----+
|count|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
I tried to use withColumn method, according to this answer.
val replicateDf = originalDf
.withColumn("replicating", explode(array((1 until $"count").map(lit): _*)))
.select("count")
But $"count" is a ColumnName and cannot be used to represent its values in the above expression.
(I also tried with explode(Array.fill($"count"){1}) but same problem here.)
What do I need to change? Is there a cleaner way?
array_repeat is available from 2.4 onwards. If you need the solution in lower versions, you can use udf() or rdd. For Rdd, check this out
import scala.collection.mutable._
val df = Seq(3,1,4).toDF("count")
val rdd1 = df.rdd.flatMap( x=> { val y = x.getAs[Int]("count"); for ( p <- 0 until y ) yield Row(y) } )
spark.createDataFrame(rdd1,df.schema).show(false)
Results:
+-----+
|count|
+-----+
|3 |
|3 |
|3 |
|1 |
|4 |
|4 |
|4 |
|4 |
+-----+
With df() alone
scala> df.flatMap( r=> { (0 until r.getInt(0)).map( i => r.getInt(0)) } ).show
+-----+
|value|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
For udf(), below would work
val df = Seq(3,1,4).toDF("count")
def array_repeat(x:Int):Array[Int]={
val y = for ( p <- 0 until x )yield x
y.toArray
}
val udf_array_repeat = udf (array_repeat(_:Int):Array[Int] )
df.withColumn("count2", explode(udf_array_repeat('count))).select("count2").show(false)
EDIT :
Check #user10465355's answer below for more information about array_repeat.
You can use array_repeat function:
import org.apache.spark.sql.functions.{array_repeat, explode}
val df = Seq(1, 2, 3).toDF
df.select(explode(array_repeat($"value", $"value"))).show()
+---+
|col|
+---+
| 1|
| 2|
| 2|
| 3|
| 3|
| 3|
+---+
I am trying to flatten an RDD[(String,Map[String,Int])] to RDD[String,String,Int] and ultimately save it as a dataframe.
val rdd=hashedContent.map(f=>(f._1,f._2.flatMap(x=> (x._1, x._2))))
val rdd=hashedContent.map(f=>(f._1,f._2.flatMap(x=>x)))
All having type mismatch errors.
Any help on how to flatten structures like this one?
EDIT:
hashedContent -- ("A", Map("acs"->2, "sdv"->2, "sfd"->1)),
("B", Map("ass"->2, "fvv"->2, "ffd"->1)),
("c", Map("dg"->2, "vd"->2, "dgr"->1))
You were close:
rdd.flatMap(x => x._2.map(y => (x._1, y._1, y._2)))
.toDF()
.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| A|acs| 2|
| A|sdv| 2|
| A|sfd| 1|
| B|ass| 2|
| B|fvv| 2|
| B|ffd| 1|
| c| dg| 2|
| c| vd| 2|
| c|dgr| 1|
+---+---+---+
Data
val data = Seq(("A", Map("acs"->2, "sdv"->2, "sfd"->1)),
("B", Map("ass"->2, "fvv"->2, "ffd"->1)),
("c", Map("dg"->2, "vd"->2, "dgr"->1)))
val rdd = sc.parallelize(data)
For completeness: an alternative solution (which might be considered more readable) would be to first convert the RDD into a DataFrame, and then to transform its structure using explode:
import org.apache.spark.sql.functions._
import spark.implicits._
rdd.toDF("c1", "map")
.select($"c1", explode($"map"))
.show(false)
// same result:
// +---+---+-----+
// |c1 |key|value|
// +---+---+-----+
// |A |acs|2 |
// |A |sdv|2 |
// |A |sfd|1 |
// |B |ass|2 |
// |B |fvv|2 |
// |B |ffd|1 |
// |c |dg |2 |
// |c |vd |2 |
// |c |dgr|1 |
// +---+---+-----+
I have an input spark-dataframe named df as
+---------------+---+---+---+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+---+---+---+-----------+
| 725153| 1| 0| 2| 3|
| 873008| 0| 0| 3| 3|
| 625109| 1| 1| 0| 2|
+---------------+---+---+---+-----------+
Here,Total_Count is the sum of P1,P2,P3 and P1,P2,P3 were the product names. I need to find the frequency of each product by dividing the values of products with its Total_Count. I need to create a new spark-dataframe named frequencyTable as follows,
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+------------------+---+------------------+-----------+
| 725153|0.3333333333333333|0.0|0.6666666666666666| 3|
| 873008| 0.0|0.0| 1.0| 3|
| 625109| 0.5|0.5| 0.0| 2|
+---------------+------------------+---+------------------+-----------+
I have done this using Scala as,
val df_columns = df.columns.toSeq
var frequencyTable = df
for (index <- df_columns) {
if (index != "Main_CustomerID" && index != "Total_Count") {
frequencyTable = frequencyTable.withColumn(index, df.col(index) / df.col("Total_Count"))
}
}
But I don't prefer this for loop because my df is of larger size. What is the optimized solution?
If you have dataframe as
val df = Seq(
("725153", 1, 0, 2, 3),
("873008", 0, 0, 3, 3),
("625109", 1, 1, 0, 2)
).toDF("Main_CustomerID", "P1", "P2", "P3", "Total_Count")
+---------------+---+---+---+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+---+---+---+-----------+
|725153 |1 |0 |2 |3 |
|873008 |0 |0 |3 |3 |
|625109 |1 |1 |0 |2 |
+---------------+---+---+---+-----------+
You can simply use foldLeft on the columns except Main_CustomerID, Total_Count i.e. on P1 P2 and P3
val df_columns = df.columns.toSet - "Main_CustomerID" - "Total_Count" toList
df_columns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName, df.col(colName) / df.col("Total_Count"))}.show(false)
which should give you
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+------------------+---+------------------+-----------+
|725153 |0.3333333333333333|0.0|0.6666666666666666|3 |
|873008 |0.0 |0.0|1.0 |3 |
|625109 |0.5 |0.5|0.0 |2 |
+---------------+------------------+---+------------------+-----------+
I hope the answer is helpful
I'm searching for a scala analogue of python .transform()
Namely, i need to create a new feature - a group mean of a corresponding: class
val df = Seq(
("a", 1),
("a", 3),
("b", 3),
("b", 7)
).toDF("class", "val")
+-----+---+
|class|val|
+-----+---+
| a| 1|
| a| 3|
| b| 3|
| b| 7|
+-----+---+
val grouped_df = df.groupBy('class)
Here's python implementation:
df["class_mean"] = grouped_df["class"].transform(
lambda x: x.mean())
So, the desired result:
+-----+---+----------+
|class|val|class_mean|
+-----+---+---+------+
| a| 1| 2.0|
| a| 3| 2.0|
| b| 3| 5.0|
| b| 7| 5.0|
+-----+---+----------+
You can use
df.groupBy("class").agg(mean("val").as("class_mean"))
If you can want all the columns then you can use window function
val w = Window.partitionBy("class")
df.withColumn("class_mean", mean("val").over(w))
.show(false)
Output:
+-----+---+----------+
|class|val|class_mean|
+-----+---+----------+
|b |3 |5.0 |
|b |7 |5.0 |
|a |1 |2.0 |
|a |3 |2.0 |
+-----+---+----------+