Flatten RDD[(String,Map[String,Int])] to RDD[String,String,Int] - scala

I am trying to flatten an RDD[(String,Map[String,Int])] to RDD[String,String,Int] and ultimately save it as a dataframe.
val rdd=hashedContent.map(f=>(f._1,f._2.flatMap(x=> (x._1, x._2))))
val rdd=hashedContent.map(f=>(f._1,f._2.flatMap(x=>x)))
All having type mismatch errors.
Any help on how to flatten structures like this one?
EDIT:
hashedContent -- ("A", Map("acs"->2, "sdv"->2, "sfd"->1)),
("B", Map("ass"->2, "fvv"->2, "ffd"->1)),
("c", Map("dg"->2, "vd"->2, "dgr"->1))

You were close:
rdd.flatMap(x => x._2.map(y => (x._1, y._1, y._2)))
.toDF()
.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| A|acs| 2|
| A|sdv| 2|
| A|sfd| 1|
| B|ass| 2|
| B|fvv| 2|
| B|ffd| 1|
| c| dg| 2|
| c| vd| 2|
| c|dgr| 1|
+---+---+---+
Data
val data = Seq(("A", Map("acs"->2, "sdv"->2, "sfd"->1)),
("B", Map("ass"->2, "fvv"->2, "ffd"->1)),
("c", Map("dg"->2, "vd"->2, "dgr"->1)))
val rdd = sc.parallelize(data)

For completeness: an alternative solution (which might be considered more readable) would be to first convert the RDD into a DataFrame, and then to transform its structure using explode:
import org.apache.spark.sql.functions._
import spark.implicits._
rdd.toDF("c1", "map")
.select($"c1", explode($"map"))
.show(false)
// same result:
// +---+---+-----+
// |c1 |key|value|
// +---+---+-----+
// |A |acs|2 |
// |A |sdv|2 |
// |A |sfd|1 |
// |B |ass|2 |
// |B |fvv|2 |
// |B |ffd|1 |
// |c |dg |2 |
// |c |vd |2 |
// |c |dgr|1 |
// +---+---+-----+

Related

filter or label rows based on a Scala array

Is there a way to filter or label rows based on a Scala array?
Please keep in mind in reality there the number of rows is much larger.
sample data
val clients= List(List("1", "67") ,List("2", "77") ,List("3", "56"),List("4","90")).map(x =>(x(0), x(1)))
val df = clients.toDF("soc","ages")
+---+----+
|soc|ages|
+---+----+
| 1| 67|
| 2| 77|
| 3| 56|
| 4| 90|
| ..| ..|
+---+----+
I would like to filter all the ages that are in a Scala array lets say
var z = Array(90, 56,67).
df.where(($"ages" IN z)
or
df..withColumn("flag", when($"ages" >= 30 , 1)
.otherwise(when($"ages" <= 5, 2)
.otherwise(3))
You can also pass each element as an arg by using _* operator for an Array.
Then write an case when otherwise using isin
Ex:
val df1 = Seq((1, 67), (2, 77), (3, 56), (4, 90)).toDF("soc", "ages")
val z = Array(90, 56,67)
df1.withColumn("flag",
when('ages.isin(z: _*), "in Z array")
.otherwise("not in Z array"))
.show(false)
+---+----+--------------+
|soc|ages|flag |
+---+----+--------------+
|1 |67 |in Z array |
|2 |77 |not in Z array|
|3 |56 |in Z array |
|4 |90 |in Z array |
+---+----+--------------+
one option is an udf.
scala> val df1 = Seq((1, 67), (2, 77), (3, 56), (4, 90)).toDF("soc", "ages")
df1: org.apache.spark.sql.DataFrame = [soc: int, ages: int]
scala> df1.show
+---+----+
|soc|ages|
+---+----+
| 1| 67|
| 2| 77|
| 3| 56|
| 4| 90|
+---+----+
scala> val scalaAgesArray = Array(90, 56,67)
scalaAgesArray: Array[Int] = Array(90, 56, 67)
scala> val containsAgeUdf = udf((x: Int) => scalaAgesArray.contains(x))
containsAgeUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(IntegerType)))
scala> val outputDF = df1.withColumn("flag", containsAgeUdf($"ages"))
outputDF: org.apache.spark.sql.DataFrame = [soc: int, ages: int ... 1 more field]
scala> outputDF.show(false)
+---+----+-----+
|soc|ages|flag |
+---+----+-----+
|1 |67 |true |
|2 |77 |false|
|3 |56 |true |
|4 |90 |true |
+---+----+-----+

Hash the contents of an columntype Array[Int] individually

I have a DataFrame of Int, Array[Int] with the values of
+---+------+
| _1| _2|
+---+------+
| 1| [1]|
| 1| [2]|
| 2|[3, 4]|
+---+------+
I want to return DataFrame of
+---+------+------------------+
| _1| _2| _3|
+---+------+------------------+
| 1| [1]| [hash(1)]|
| 1| [2]| [hash(2)]|
| 2|[3, 4]|[hash(3), hash(4)]|
+---+------+------------------+
I originally attempted to convert the DataFrame into a dataset and to map the dataset. However, I am unable to reproduce the hash with MurmurHash3.
In short, I am unable to reproduce https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2165-L2168.
Any ideas on how to proceed?
I am open to any method to get my desired result.
Use transform:
val df = Seq((1, Seq(1)), (1, Seq(2)), (2, Seq(3, 4))).toDF
df.selectExpr("*", "transform(_2, x -> hash(x)) AS _3").show
+---+------+--------------------+
| _1| _2| _3|
+---+------+--------------------+
| 1| [1]| [-559580957]|
| 1| [2]| [1765031574]|
| 2|[3, 4]|[-1823081949, -39...|
+---+------+--------------------+

Replicating rows in Spark dataframe according values in a column

I would like to replicate rows according to their value for a given column. For example, I got this DataFrame:
+-----+
|count|
+-----+
| 3|
| 1|
| 4|
+-----+
I would like to get:
+-----+
|count|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
I tried to use withColumn method, according to this answer.
val replicateDf = originalDf
.withColumn("replicating", explode(array((1 until $"count").map(lit): _*)))
.select("count")
But $"count" is a ColumnName and cannot be used to represent its values in the above expression.
(I also tried with explode(Array.fill($"count"){1}) but same problem here.)
What do I need to change? Is there a cleaner way?
array_repeat is available from 2.4 onwards. If you need the solution in lower versions, you can use udf() or rdd. For Rdd, check this out
import scala.collection.mutable._
val df = Seq(3,1,4).toDF("count")
val rdd1 = df.rdd.flatMap( x=> { val y = x.getAs[Int]("count"); for ( p <- 0 until y ) yield Row(y) } )
spark.createDataFrame(rdd1,df.schema).show(false)
Results:
+-----+
|count|
+-----+
|3 |
|3 |
|3 |
|1 |
|4 |
|4 |
|4 |
|4 |
+-----+
With df() alone
scala> df.flatMap( r=> { (0 until r.getInt(0)).map( i => r.getInt(0)) } ).show
+-----+
|value|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
For udf(), below would work
val df = Seq(3,1,4).toDF("count")
def array_repeat(x:Int):Array[Int]={
val y = for ( p <- 0 until x )yield x
y.toArray
}
val udf_array_repeat = udf (array_repeat(_:Int):Array[Int] )
df.withColumn("count2", explode(udf_array_repeat('count))).select("count2").show(false)
EDIT :
Check #user10465355's answer below for more information about array_repeat.
You can use array_repeat function:
import org.apache.spark.sql.functions.{array_repeat, explode}
val df = Seq(1, 2, 3).toDF
df.select(explode(array_repeat($"value", $"value"))).show()
+---+
|col|
+---+
| 1|
| 2|
| 2|
| 3|
| 3|
| 3|
+---+

Convert List of List to Dataframe

I have a column of lists in a spark dataframe.
How do I convert the arrays to a spark dataframe where each element in the list is a column in the dataframe?
I am new in scala, and I want to use scala to solve it.
For example :
You can do it by creating a RDD of rows, creating a schema and using it to convert the RDD to a dataframe.
// A seq of seqs
val s = Seq(1 to 6, 1 to 6, 1 to 6)
// Let's create a RDD of Rows
val rdd = sc.parallelize(s).map(Row.fromSeq)
// Let's define a schema based on the first seq of s
val schema = StructType(
(1 to s(0).size).map(i => StructField("c"+i, IntegerType, true))
)
// And let's finally create the dataframe
val df = spark.createDataFrame(rdd, schema)
df.show
// +---+---+---+---+---+---+
// | c1| c2| c3| c4| c5| c6|
// +---+---+---+---+---+---+
// | 1| 2| 3| 4| 5| 6|
// | 1| 2| 3| 4| 5| 6|
// | 1| 2| 3| 4| 5| 6|
// +---+---+---+---+---+---+
If you have a dataframe as mentioned in the question with array column as
root
|-- features: array (nullable = true)
| |-- element: integer (containsNull = false)
then you can use following logic
val finalCols = Array("c1", "c2", "c3", "c4", "c5", "c6", "c7")
import org.apache.spark.sql.functions._
finalCols.zipWithIndex.foldLeft(df){(tempdf, c) => tempdf.withColumn(c._1, col("features")(c._2))}.select(finalCols.map(col): _*).show(false)
which should give you
+---+---+---+---+---+---+---+
|c1 |c2 |c3 |c4 |c5 |c6 |c7 |
+---+---+---+---+---+---+---+
|0 |45 |63 |0 |0 |0 |0 |
|0 |0 |0 |85 |0 |69 |0 |
|0 |89 |56 |0 |0 |0 |0 |
+---+---+---+---+---+---+---+
Or you can use a udf function as
import org.apache.spark.sql.functions._
def splitArrayUdf = udf((features: Seq[Int]) => testCaseClass(features(0), features(1), features(2), features(3), features(4), features(5), features(6)))
df.select(splitArrayUdf(col("features")).as("features")).select(col("features.*")).show(false)
which should give you the same the result
I hope the answer is helpful

transform a feature of a spark groupedBy DataFrame

I'm searching for a scala analogue of python .transform()
Namely, i need to create a new feature - a group mean of a corresponding: class
val df = Seq(
("a", 1),
("a", 3),
("b", 3),
("b", 7)
).toDF("class", "val")
+-----+---+
|class|val|
+-----+---+
| a| 1|
| a| 3|
| b| 3|
| b| 7|
+-----+---+
val grouped_df = df.groupBy('class)
Here's python implementation:
df["class_mean"] = grouped_df["class"].transform(
lambda x: x.mean())
So, the desired result:
+-----+---+----------+
|class|val|class_mean|
+-----+---+---+------+
| a| 1| 2.0|
| a| 3| 2.0|
| b| 3| 5.0|
| b| 7| 5.0|
+-----+---+----------+
You can use
df.groupBy("class").agg(mean("val").as("class_mean"))
If you can want all the columns then you can use window function
val w = Window.partitionBy("class")
df.withColumn("class_mean", mean("val").over(w))
.show(false)
Output:
+-----+---+----------+
|class|val|class_mean|
+-----+---+----------+
|b |3 |5.0 |
|b |7 |5.0 |
|a |1 |2.0 |
|a |3 |2.0 |
+-----+---+----------+