My dataframe, df, has columns comprising 2-dimensional (x,y) data. Combining these columns with VectorAssembler into the 'features' column results in all these pairs being flattened. Is there a way to have these columns represented in their original form i.e. as [[x1,y1],[x2,y2],[x3,y3]] instead of what I am getting: [x1,y1,x2,y2,x3,y3]
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
val df = Seq((Seq(1.0,2.0), Seq(3.0,4.0), Seq(5.0,6.0)),
(Seq(7.0,8.0), Seq(9.0,10.0), Seq(11.0,12.0))).toDF("f1", "f2", "f3")
//Ref https://stackoverflow.com/a/41091839/4106464
val seqAsVector = udf((xs: Seq[Double]) => Vectors.dense(xs.toArray))
val df_final = df.select(seqAsVector(col("f1")).as("f1"), seqAsVector(col("f2")).as("f2"), seqAsVector(col("f3")).as("f3"))
val assembler = new VectorAssembler().setInputCols(Array("f1","f2","f3")).setOutputCol("features")
val df_out = assembler.transform(df_final)
df.show
df_out.show(false)
// df
//+----------+-----------+------------+
//| f1| f2| f3|
//+----------+-----------+------------+
//|[1.0, 2.0]| [3.0, 4.0]| [5.0, 6.0]|
//|[7.0, 8.0]|[9.0, 10.0]|[11.0, 12.0]|
//+----------+-----------+------------+
// df_out with VectorAssembler
//+---------+----------+-----------+----------------------------+
//|f1 |f2 |f3 |features |
//+---------+----------+-----------+----------------------------+
//|[1.0,2.0]|[3.0,4.0] |[5.0,6.0] |[1.0,2.0,3.0,4.0,5.0,6.0] |
//|[7.0,8.0]|[9.0,10.0]|[11.0,12.0]|[7.0,8.0,9.0,10.0,11.0,12.0]|
//+---------+----------+-----------+----------------------------+
//Desired features column:
//+---------+----------+-----------+----------------------------------+
//|f1 |f2 |f3 |features |
//+---------+----------+-----------+----------------------------------+
//|[1.0,2.0]|[3.0,4.0] |[5.0,6.0] |[[1.0,2.0],[3.0,4.0],[5.0,6.0]] |
//|[7.0,8.0]|[9.0,10.0]|[11.0,12.0]|[[7.0,8.0],[9.0,10.0],[11.0,12.0]]|
//+---------+----------+-----------+----------------------------------+
I want to convert spark dense vectors to separate columns with their index in Scala? Hope some help, please~
I have a dataframe after minMaxScaler :
+---+--------+--------------------+---------------------+
| id|category|minMaxScalerFeatures|scaledFeatures_output|
+---+--------+--------------------+---------------------+
| 0| 66| [0.0,66.0]| [0.0,0.0]|
| 1| 98| [1.0,98.0]| [0.5,1.0]|
| 2| 90| [2.0,90.0]| [1.0,0.75]|
+---+--------+--------------------+---------------------+
I want to get the value after scaler with their index, like the pattern "index:value" which is String type:
+---+--------+--------------------+-----------------------+
| id|category|minMaxScalerFeatures|scaledFeatures_output |
+---+--------+--------------------+-----------------------+
| 0| 66| [0.0,66.0]| 0:0.0,1:0.0|
| 1| 98| [1.0,98.0]| 0:0.5,1:1.0|
| 2| 90| [2.0,90.0]| 0:1.0,1:0.75|
+---+--------+--------------------+-----------------------+
Code to generate the data:
import org.apache.spark.ml.feature._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
val df_1 = Seq((0, 66),(1, 98),(2, 90)).toDF("id", "category")
val minMax_columns = Array("id", "category")
val assembler = new VectorAssembler()
.setInputCols(minMax_columns)
.setOutputCol("minMaxScalerFeatures")
val scaler = new MinMaxScaler()
.setInputCol("minMaxScalerFeatures")
.setOutputCol("scaledFeatures_output")
val dataset = assembler.transform(df_1)
val scalerModel = scaler.fit(dataset)
val scaledData = scalerModel.transform(dataset)
Thank U very much~ :)
The question is how will you complete this task on an array. I would personnaly do that (which may not be optimal but it works) :
var s = ""
for(i <- 0 until array.length) s = s + s"$i:${a(i)}"
s = s.dropRight(1)
Now you can include that in an user defined function and it's done:
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions.udf
val myudf = udf((arr: DenseVector) => {
val a = arr.toArray
var s = ""
for(i <- 0 until a.length) s = s + s"$i:${a(i)},"
s.dropRight(1)
})
convert myMap = Map([Col_1->1],[Col_2->2],[Col_3->3])
to Spark scala Data-frame key as column and value as column value,i am not
getting expected result, please check my code and provide solution.
var finalBufferList = new ListBuffer[String]()
var finalDfColumnList = new ListBuffer[String]()
var myMap:Map[String,String] = Map.empty[String,String]
for ((k,v) <- myMap){
println(k+"->"+v)
finalBufferList += v
//finalDfColumnList += "\""+k+"\""
finalDfColumnList += k
}
val dff = Seq(finalBufferList.toSeq).toDF(finalDfColumnList.toList.toString())
dff.show()
My result :
+------------------------+
|List(Test, Rest, Incedo)|
+------------------------+
| [4, 5, 3]|
+------------------------+
Expected result :
+------+-------+-------+
|Col_1 | Col_2 | Col_3 |
+------+-------+-------+
| 4 | 5 | 3 |
+------+-------+-------+
please give me suggestion .
if you have defined your Map as
val myMap = Map("Col_1"->"1", "Col_2"->"2", "Col_3"->"3")
then you should create RDD[Row] using the values as
import org.apache.spark.sql.Row
val rdd = sc.parallelize(Seq(Row.fromSeq(myMap.values.toSeq)))
then you create a schema using the keys as
import org.apache.spark.sql.types._
val schema = StructType(myMap.keys.toSeq.map(StructField(_, StringType)))
then finally use createDataFrame function to create the dataframe as
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)
finally you should have
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
|1 |2 |3 |
+-----+-----+-----+
I hope the answer is helpful
But remember all this would be useless if you are working in small dataset.
I'd like to make a summary of a dataframe. I got some outputs. I would like to combine the three dataframe into a dataframe that will be exactly like the first one.
Here is what I did.
// Compute column summary statistics.
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val dataframe = spark.read.option("header", true).option("inferSchema", true).format("com.databricks.spark.csv").load("C:/Users/mhattabi/Desktop/donnee/cassandraTest_1.csv")
val colNames=dataframe.columns
val data=dataframe.describe().show()
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
|summary| Col0| Col1| Col2| Col3| Col4|
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
| count| 9999| 9999| 9999| 9999| 9999|
| mean| 0.4976937166129511| 0.5032998128645433| 0.5002933978916888| 0.5008783202471074|0.49977372871783293|
| stddev| 0.2893201326892155|0.28767789122296994|0.29041197844235034|0.28989958496291496| 0.2881033430504947|
| min|4.92436811557243E-6|3.20277176946531E-5|1.41602940923349E-5|6.53252937203857E-5| 5.4864212896146E-5|
| max| 0.999442967120299| 0.9999608020298| 0.999968873336897| 0.999836584087385| 0.999822016805327|
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
println("Skewness")
val Skewness = dataframe.columns.map(c => skewness(c).as(c))
val Skewness_ = dataframe.agg(Skewness.head, Skewness.tail: _*).show()
Skewness
+--------------------+--------------------+--------------------+--------------------+--------------------+
| Col0| Col1| Col2| Col3| Col4|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|0.015599787007160271|-0.00740111491496...|0.006096695102089171|0.003614431405637598|0.007869663345343194|
+--------------------+--------------------+--------------------+--------------------+--------------------+
println("Kurtosis")
val Kurtosis = dataframe.columns.map(c => kurtosis(c).as(c))
val Kurtosis_ = dataframe.agg(Kurtosis.head, Kurtosis.tail: _*).show//kurtosis
Kurtosis
+-------------------+-------------------+-------------------+-------------------+------------------+
| Col0| Col1| Col2| Col3| Col4|
+-------------------+-------------------+-------------------+-------------------+------------------+
|-1.2187774053075133|-1.1861812968784207|-1.2107252263053805|-1.2108988817869097|-1.199054929668751|
+-------------------+-------------------+-------------------+-------------------+------------------+
I would like to add to skewness and the kurtosis dataframe to the first one and add their names into the first colummns.
Thanks in advance
you need to add summary column to both skewness and kurtosis tables using withColumn
val Skewness_ = dataframe.agg(Skewness.head, Skewness.tail: _*).withColumn("summary", lit("Skewness"))
Do the same for kurtosis
val Kurtosis_ = dataframe.agg(Kurtosis.head, Kurtosis.tail: _*).withColumn("summary", lit("Kurtosis"))
Use Select in all dataframes to have the column names in order
val orderColumn = Vector("summary", "col0", "col1", "col2", "col3", "col4")
val Skewness_ordered = Skewness_.select(orderColumn.map(col):_*)
val Kurtosis_ordered = Kurtosis_.select(orderColumn.map(col):_*)
and union them.
val combined = dataframe.union(Skewness_ordered).union(Kurtosis_ordered)
In elegant way you can combine your dataframes Skewness and Kurtosis with initial one to new dataframe as:
import org.apache.spark.sql.functions._
val result = dataframe.union(Skewness.select(lit("Skewness"),Skewness.col("*")))
.union(Kurtosis.select(lit("Kurtosis"),Kurtosis.col("*")))
result.show()
I have a dataframe of format given below.
movieId1 | genreList1 | genreList2
--------------------------------------------------
1 |[Adventure,Comedy] |[Adventure]
2 |[Animation,Drama,War] |[War,Drama]
3 |[Adventure,Drama] |[Drama,War]
and trying to create another flag column which shows whether genreList2 is a subset of genreList1.
movieId1 | genreList1 | genreList2 | Flag
---------------------------------------------------------------
1 |[Adventure,Comedy] | [Adventure] |1
2 |[Animation,Drama,War] | [War,Drama] |1
3 |[Adventure,Drama] | [Drama,War] |0
I have tried this:
def intersect_check(a: Array[String], b: Array[String]): Int = {
if (b.sameElements(a.intersect(b))) { return 1 }
else { return 2 }
}
def intersect_check_udf =
udf((colvalue1: Array[String], colvalue2: Array[String]) => intersect_check(colvalue1, colvalue2))
data = data.withColumn("Flag", intersect_check_udf(col("genreList1"), col("genreList2")))
But this throws error
org.apache.spark.SparkException: Failed to execute user defined function.
P.S.: The above function (intersect_check) works for Arrays.
We can define an udf that calculates the length of the intersection between the two Array columns and checks whether it is equal to the length of the second column. If so, the second array is a subset of the first one.
Also, the inputs of your udf need to be class WrappedArray[String], not Array[String] :
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val same_elements = udf { (a: WrappedArray[String],
b: WrappedArray[String]) =>
if (a.intersect(b).length == b.length){ 1 }else{ 0 }
}
df.withColumn("test",same_elements(col("genreList1"),col("genreList2")))
.show(truncate = false)
+--------+-----------------------+------------+----+
|movieId1|genreList1 |genreList2 |test|
+--------+-----------------------+------------+----+
|1 |[Adventure, Comedy] |[Adventure] |1 |
|2 |[Animation, Drama, War]|[War, Drama]|1 |
|3 |[Adventure, Drama] |[Drama, War]|0 |
+--------+-----------------------+------------+----+
Data
val df = List((1,Array("Adventure","Comedy"), Array("Adventure")),
(2,Array("Animation","Drama","War"), Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))).toDF("movieId1","genreList1","genreList2")
Here is the solution converting using subsetOf
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
)).toDF("movieId1", "genreList1", "genreList2")
val subsetOf = udf((col1: Seq[String], col2: Seq[String]) => {
if (col2.toSet.subsetOf(col1.toSet)) 1 else 0
})
data.withColumn("flag", subsetOf(data("genreList1"), data("genreList2"))).show()
Hope this helps!
One solution may be to exploit spark array builtin functions: genreList2 is subset of genreList1 if the intersection between the two is equal to genreList2. In the code below a sort_array operation has been added to avoid a mismatch between two arrays with different ordering but same elements.
val spark = {
SparkSession
.builder()
.master("local")
.appName("test")
.getOrCreate()
}
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val df = Seq(
(1, Array("Adventure","Comedy"), Array("Adventure")),
(2, Array("Animation","Drama","War"), Array("War","Drama")),
(3, Array("Adventure","Drama"), Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
df
.withColumn("flag",
sort_array(array_intersect($"genreList1",$"genreList2"))
.equalTo(
sort_array($"genreList2")
)
.cast("integer")
)
.show()
The output is
+--------+--------------------+------------+----+
|movieId1| genreList1| genreList2|flag|
+--------+--------------------+------------+----+
| 1| [Adventure, Comedy]| [Adventure]| 1|
| 2|[Animation, Drama...|[War, Drama]| 1|
| 3| [Adventure, Drama]|[Drama, War]| 0|
+--------+--------------------+------------+----+
This can also work here and it does not use udf
import spark.implicits._
val data = Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
data
.withColumn("size",size(array_except($"genreList2",$"genreList1")))
.withColumn("flag",when($"size" === lit(0), 1) otherwise(0))
.show(false)
Spark 3.0+ (forall)
forall($"genreList2", x => array_contains($"genreList1", x)).cast("int")
Full example:
val df = Seq(
(1, Seq("Adventure", "Comedy"), Seq("Adventure")),
(2, Seq("Animation", "Drama","War"), Seq("War", "Drama")),
(3, Seq("Adventure", "Drama"), Seq("Drama", "War"))
).toDF("movieId1", "genreList1", "genreList2")
val df2 = df.withColumn("Flag", forall($"genreList2", x => array_contains($"genreList1", x)).cast("int"))
df2.show()
// +--------+--------------------+------------+----+
// |movieId1| genreList1| genreList2|Flag|
// +--------+--------------------+------------+----+
// | 1| [Adventure, Comedy]| [Adventure]| 1|
// | 2|[Animation, Drama...|[War, Drama]| 1|
// | 3| [Adventure, Drama]|[Drama, War]| 0|
// +--------+--------------------+------------+----+