How to encode string values into numeric values in Spark DataFrame - scala

I have a DataFrame with two columns:
df =
Col1 Col2
aaa bbb
ccc aaa
I want to encode String values into numeric values. I managed to do it in this way:
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val indexer1 = new StringIndexer()
.setInputCol("Col1")
.setOutputCol("Col1Index")
.fit(df)
val indexer2 = new StringIndexer()
.setInputCol("Col2")
.setOutputCol("Col2Index")
.fit(df)
val indexed1 = indexer1.transform(df)
val indexed2 = indexer2.transform(df)
val encoder1 = new OneHotEncoder()
.setInputCol("Col1Index")
.setOutputCol("Col1Vec")
val encoder2 = new OneHotEncoder()
.setInputCol("Col2Index")
.setOutputCol("Col2Vec")
val encoded1 = encoder1.transform(indexed1)
encoded1.show()
val encoded2 = encoder2.transform(indexed2)
encoded2.show()
The problem is that aaa is encoded in different ways in two columns.
How can I encode my DataFrame in order to get the new one correctly encoded, e.g.:
df_encoded =
Col1 Col2
1 2
3 1

Train single Indexer on both columns:
val df = Seq(("aaa", "bbb"), ("ccc", "aaa")).toDF("col1", "col2")
val indexer = new StringIndexer().setInputCol("col").fit(
df.select("col1").toDF("col").union(df.select("col2").toDF("col"))
)
and apply copy on each column
import org.apache.spark.ml.param.ParamMap
val result = Seq("col1", "col2").foldLeft(df){
(df, col) => indexer
.copy(new ParamMap()
.put(indexer.inputCol, col)
.put(indexer.outputCol, s"${col}_idx"))
.transform(df)
}
result.show
// +----+----+--------+--------+
// |col1|col2|col1_idx|col2_idx|
// +----+----+--------+--------+
// | aaa| bbb| 0.0| 1.0|
// | ccc| aaa| 2.0| 0.0|
// +----+----+--------+--------+

you can make yourself transform,the example is my pyspark code.
training a transform model as clf
sindex_pro = StringIndexer(inputCol='StringCol',outputCol='StringCol_c',stringOrderType="frequencyDesc",handleInvalid="keep").fit(province_df)`
define the self transformer load the clf
from pyspark.sql.functions import col
from pyspark.ml import Transformer
from pyspark.sql import DataFrame
class SelfSI(Transformer):
def __init__(self, clf,col_name):
super(SelfSI, self).__init__()
self.clf = clf
self.col_name=col_name
def rename_col(self,df,invers=False):
or_name = 'StringCol'
col_name = self.col_name
if invers:
df = df.withColumnRenamed(or_name,col_name)
or_name = col_name + '_c'
col_name = 'StringCol_c'
df = df.withColumnRenamed(col_name,or_name)
return df
def _transform(self, df: DataFrame) -> DataFrame:
df = self.rename_col(df)
df = self.clf.transform(df)
df = self.rename_col(df,invers=True)
return df
define the model by your need transforming column name
pro_si = SelfSI(sindex_pro,'pro_name')
pro_si.transform(df_or)
#or pipline
model = Pipeline(stages=[pro_si,pro_si2]).fit(df_or)
model.transform(df_or)
#result like
province_name|city_name|province_name_c|city_name_c|
+-------------+---------+---------------+-----------+
| 河北| 保定| 23.0| 18.0|
| 河北| 张家| 23.0| 213.0|
| 河北| 承德| 23.0| 126.0|
| 河北| 沧州| 23.0| 6.0|
| 河北| 廊坊| 23.0| 26.0|
| 北京| 北京| 13.0| 107.0|
| 天津| 天津| 10.0| 85.0|
| 河北| 石家| 23.0| 185.0|

Related

Constructing Spark ML features column with nested arrays

My dataframe, df, has columns comprising 2-dimensional (x,y) data. Combining these columns with VectorAssembler into the 'features' column results in all these pairs being flattened. Is there a way to have these columns represented in their original form i.e. as [[x1,y1],[x2,y2],[x3,y3]] instead of what I am getting: [x1,y1,x2,y2,x3,y3]
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
val df = Seq((Seq(1.0,2.0), Seq(3.0,4.0), Seq(5.0,6.0)),
(Seq(7.0,8.0), Seq(9.0,10.0), Seq(11.0,12.0))).toDF("f1", "f2", "f3")
//Ref https://stackoverflow.com/a/41091839/4106464
val seqAsVector = udf((xs: Seq[Double]) => Vectors.dense(xs.toArray))
val df_final = df.select(seqAsVector(col("f1")).as("f1"), seqAsVector(col("f2")).as("f2"), seqAsVector(col("f3")).as("f3"))
val assembler = new VectorAssembler().setInputCols(Array("f1","f2","f3")).setOutputCol("features")
val df_out = assembler.transform(df_final)
df.show
df_out.show(false)
// df
//+----------+-----------+------------+
//| f1| f2| f3|
//+----------+-----------+------------+
//|[1.0, 2.0]| [3.0, 4.0]| [5.0, 6.0]|
//|[7.0, 8.0]|[9.0, 10.0]|[11.0, 12.0]|
//+----------+-----------+------------+
// df_out with VectorAssembler
//+---------+----------+-----------+----------------------------+
//|f1 |f2 |f3 |features |
//+---------+----------+-----------+----------------------------+
//|[1.0,2.0]|[3.0,4.0] |[5.0,6.0] |[1.0,2.0,3.0,4.0,5.0,6.0] |
//|[7.0,8.0]|[9.0,10.0]|[11.0,12.0]|[7.0,8.0,9.0,10.0,11.0,12.0]|
//+---------+----------+-----------+----------------------------+
//Desired features column:
//+---------+----------+-----------+----------------------------------+
//|f1 |f2 |f3 |features |
//+---------+----------+-----------+----------------------------------+
//|[1.0,2.0]|[3.0,4.0] |[5.0,6.0] |[[1.0,2.0],[3.0,4.0],[5.0,6.0]] |
//|[7.0,8.0]|[9.0,10.0]|[11.0,12.0]|[[7.0,8.0],[9.0,10.0],[11.0,12.0]]|
//+---------+----------+-----------+----------------------------------+

How to convert Spark dense vector to separate colunms with their index in Scala?

I want to convert spark dense vectors to separate columns with their index in Scala? Hope some help, please~
I have a dataframe after minMaxScaler :
+---+--------+--------------------+---------------------+
| id|category|minMaxScalerFeatures|scaledFeatures_output|
+---+--------+--------------------+---------------------+
| 0| 66| [0.0,66.0]| [0.0,0.0]|
| 1| 98| [1.0,98.0]| [0.5,1.0]|
| 2| 90| [2.0,90.0]| [1.0,0.75]|
+---+--------+--------------------+---------------------+
I want to get the value after scaler with their index, like the pattern "index:value" which is String type:
+---+--------+--------------------+-----------------------+
| id|category|minMaxScalerFeatures|scaledFeatures_output |
+---+--------+--------------------+-----------------------+
| 0| 66| [0.0,66.0]| 0:0.0,1:0.0|
| 1| 98| [1.0,98.0]| 0:0.5,1:1.0|
| 2| 90| [2.0,90.0]| 0:1.0,1:0.75|
+---+--------+--------------------+-----------------------+
Code to generate the data:
import org.apache.spark.ml.feature._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
val df_1 = Seq((0, 66),(1, 98),(2, 90)).toDF("id", "category")
val minMax_columns = Array("id", "category")
val assembler = new VectorAssembler()
.setInputCols(minMax_columns)
.setOutputCol("minMaxScalerFeatures")
val scaler = new MinMaxScaler()
.setInputCol("minMaxScalerFeatures")
.setOutputCol("scaledFeatures_output")
val dataset = assembler.transform(df_1)
val scalerModel = scaler.fit(dataset)
val scaledData = scalerModel.transform(dataset)
Thank U very much~ :)
The question is how will you complete this task on an array. I would personnaly do that (which may not be optimal but it works) :
var s = ""
for(i <- 0 until array.length) s = s + s"$i:${a(i)}"
s = s.dropRight(1)
Now you can include that in an user defined function and it's done:
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions.udf
val myudf = udf((arr: DenseVector) => {
val a = arr.toArray
var s = ""
for(i <- 0 until a.length) s = s + s"$i:${a(i)},"
s.dropRight(1)
})

Convert Map(key-value) into spark scala Data-frame

convert myMap = Map([Col_1->1],[Col_2->2],[Col_3->3])
to Spark scala Data-frame key as column and value as column value,i am not
getting expected result, please check my code and provide solution.
var finalBufferList = new ListBuffer[String]()
var finalDfColumnList = new ListBuffer[String]()
var myMap:Map[String,String] = Map.empty[String,String]
for ((k,v) <- myMap){
println(k+"->"+v)
finalBufferList += v
//finalDfColumnList += "\""+k+"\""
finalDfColumnList += k
}
val dff = Seq(finalBufferList.toSeq).toDF(finalDfColumnList.toList.toString())
dff.show()
My result :
+------------------------+
|List(Test, Rest, Incedo)|
+------------------------+
| [4, 5, 3]|
+------------------------+
Expected result :
+------+-------+-------+
|Col_1 | Col_2 | Col_3 |
+------+-------+-------+
| 4 | 5 | 3 |
+------+-------+-------+
please give me suggestion .
if you have defined your Map as
val myMap = Map("Col_1"->"1", "Col_2"->"2", "Col_3"->"3")
then you should create RDD[Row] using the values as
import org.apache.spark.sql.Row
val rdd = sc.parallelize(Seq(Row.fromSeq(myMap.values.toSeq)))
then you create a schema using the keys as
import org.apache.spark.sql.types._
val schema = StructType(myMap.keys.toSeq.map(StructField(_, StringType)))
then finally use createDataFrame function to create the dataframe as
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)
finally you should have
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
|1 |2 |3 |
+-----+-----+-----+
I hope the answer is helpful
But remember all this would be useless if you are working in small dataset.

add dataframe to another one

I'd like to make a summary of a dataframe. I got some outputs. I would like to combine the three dataframe into a dataframe that will be exactly like the first one.
Here is what I did.
// Compute column summary statistics.
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val dataframe = spark.read.option("header", true).option("inferSchema", true).format("com.databricks.spark.csv").load("C:/Users/mhattabi/Desktop/donnee/cassandraTest_1.csv")
val colNames=dataframe.columns
val data=dataframe.describe().show()
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
|summary| Col0| Col1| Col2| Col3| Col4|
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
| count| 9999| 9999| 9999| 9999| 9999|
| mean| 0.4976937166129511| 0.5032998128645433| 0.5002933978916888| 0.5008783202471074|0.49977372871783293|
| stddev| 0.2893201326892155|0.28767789122296994|0.29041197844235034|0.28989958496291496| 0.2881033430504947|
| min|4.92436811557243E-6|3.20277176946531E-5|1.41602940923349E-5|6.53252937203857E-5| 5.4864212896146E-5|
| max| 0.999442967120299| 0.9999608020298| 0.999968873336897| 0.999836584087385| 0.999822016805327|
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
println("Skewness")
val Skewness = dataframe.columns.map(c => skewness(c).as(c))
val Skewness_ = dataframe.agg(Skewness.head, Skewness.tail: _*).show()
Skewness
+--------------------+--------------------+--------------------+--------------------+--------------------+
| Col0| Col1| Col2| Col3| Col4|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|0.015599787007160271|-0.00740111491496...|0.006096695102089171|0.003614431405637598|0.007869663345343194|
+--------------------+--------------------+--------------------+--------------------+--------------------+
println("Kurtosis")
val Kurtosis = dataframe.columns.map(c => kurtosis(c).as(c))
val Kurtosis_ = dataframe.agg(Kurtosis.head, Kurtosis.tail: _*).show//kurtosis
Kurtosis
+-------------------+-------------------+-------------------+-------------------+------------------+
| Col0| Col1| Col2| Col3| Col4|
+-------------------+-------------------+-------------------+-------------------+------------------+
|-1.2187774053075133|-1.1861812968784207|-1.2107252263053805|-1.2108988817869097|-1.199054929668751|
+-------------------+-------------------+-------------------+-------------------+------------------+
I would like to add to skewness and the kurtosis dataframe to the first one and add their names into the first colummns.
Thanks in advance
you need to add summary column to both skewness and kurtosis tables using withColumn
val Skewness_ = dataframe.agg(Skewness.head, Skewness.tail: _*).withColumn("summary", lit("Skewness"))
Do the same for kurtosis
val Kurtosis_ = dataframe.agg(Kurtosis.head, Kurtosis.tail: _*).withColumn("summary", lit("Kurtosis"))
Use Select in all dataframes to have the column names in order
val orderColumn = Vector("summary", "col0", "col1", "col2", "col3", "col4")
val Skewness_ordered = Skewness_.select(orderColumn.map(col):_*)
val Kurtosis_ordered = Kurtosis_.select(orderColumn.map(col):_*)
and union them.
val combined = dataframe.union(Skewness_ordered).union(Kurtosis_ordered)
In elegant way you can combine your dataframes Skewness and Kurtosis with initial one to new dataframe as:
import org.apache.spark.sql.functions._
val result = dataframe.union(Skewness.select(lit("Skewness"),Skewness.col("*")))
.union(Kurtosis.select(lit("Kurtosis"),Kurtosis.col("*")))
result.show()

Comparing two array columns in Scala Spark

I have a dataframe of format given below.
movieId1 | genreList1 | genreList2
--------------------------------------------------
1 |[Adventure,Comedy] |[Adventure]
2 |[Animation,Drama,War] |[War,Drama]
3 |[Adventure,Drama] |[Drama,War]
and trying to create another flag column which shows whether genreList2 is a subset of genreList1.
movieId1 | genreList1 | genreList2 | Flag
---------------------------------------------------------------
1 |[Adventure,Comedy] | [Adventure] |1
2 |[Animation,Drama,War] | [War,Drama] |1
3 |[Adventure,Drama] | [Drama,War] |0
I have tried this:
def intersect_check(a: Array[String], b: Array[String]): Int = {
if (b.sameElements(a.intersect(b))) { return 1 }
else { return 2 }
}
def intersect_check_udf =
udf((colvalue1: Array[String], colvalue2: Array[String]) => intersect_check(colvalue1, colvalue2))
data = data.withColumn("Flag", intersect_check_udf(col("genreList1"), col("genreList2")))
But this throws error
org.apache.spark.SparkException: Failed to execute user defined function.
P.S.: The above function (intersect_check) works for Arrays.
We can define an udf that calculates the length of the intersection between the two Array columns and checks whether it is equal to the length of the second column. If so, the second array is a subset of the first one.
Also, the inputs of your udf need to be class WrappedArray[String], not Array[String] :
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val same_elements = udf { (a: WrappedArray[String],
b: WrappedArray[String]) =>
if (a.intersect(b).length == b.length){ 1 }else{ 0 }
}
df.withColumn("test",same_elements(col("genreList1"),col("genreList2")))
.show(truncate = false)
+--------+-----------------------+------------+----+
|movieId1|genreList1 |genreList2 |test|
+--------+-----------------------+------------+----+
|1 |[Adventure, Comedy] |[Adventure] |1 |
|2 |[Animation, Drama, War]|[War, Drama]|1 |
|3 |[Adventure, Drama] |[Drama, War]|0 |
+--------+-----------------------+------------+----+
Data
val df = List((1,Array("Adventure","Comedy"), Array("Adventure")),
(2,Array("Animation","Drama","War"), Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))).toDF("movieId1","genreList1","genreList2")
Here is the solution converting using subsetOf
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
)).toDF("movieId1", "genreList1", "genreList2")
val subsetOf = udf((col1: Seq[String], col2: Seq[String]) => {
if (col2.toSet.subsetOf(col1.toSet)) 1 else 0
})
data.withColumn("flag", subsetOf(data("genreList1"), data("genreList2"))).show()
Hope this helps!
One solution may be to exploit spark array builtin functions: genreList2 is subset of genreList1 if the intersection between the two is equal to genreList2. In the code below a sort_array operation has been added to avoid a mismatch between two arrays with different ordering but same elements.
val spark = {
SparkSession
.builder()
.master("local")
.appName("test")
.getOrCreate()
}
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val df = Seq(
(1, Array("Adventure","Comedy"), Array("Adventure")),
(2, Array("Animation","Drama","War"), Array("War","Drama")),
(3, Array("Adventure","Drama"), Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
df
.withColumn("flag",
sort_array(array_intersect($"genreList1",$"genreList2"))
.equalTo(
sort_array($"genreList2")
)
.cast("integer")
)
.show()
The output is
+--------+--------------------+------------+----+
|movieId1| genreList1| genreList2|flag|
+--------+--------------------+------------+----+
| 1| [Adventure, Comedy]| [Adventure]| 1|
| 2|[Animation, Drama...|[War, Drama]| 1|
| 3| [Adventure, Drama]|[Drama, War]| 0|
+--------+--------------------+------------+----+
This can also work here and it does not use udf
import spark.implicits._
val data = Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
data
.withColumn("size",size(array_except($"genreList2",$"genreList1")))
.withColumn("flag",when($"size" === lit(0), 1) otherwise(0))
.show(false)
Spark 3.0+ (forall)
forall($"genreList2", x => array_contains($"genreList1", x)).cast("int")
Full example:
val df = Seq(
(1, Seq("Adventure", "Comedy"), Seq("Adventure")),
(2, Seq("Animation", "Drama","War"), Seq("War", "Drama")),
(3, Seq("Adventure", "Drama"), Seq("Drama", "War"))
).toDF("movieId1", "genreList1", "genreList2")
val df2 = df.withColumn("Flag", forall($"genreList2", x => array_contains($"genreList1", x)).cast("int"))
df2.show()
// +--------+--------------------+------------+----+
// |movieId1| genreList1| genreList2|Flag|
// +--------+--------------------+------------+----+
// | 1| [Adventure, Comedy]| [Adventure]| 1|
// | 2|[Animation, Drama...|[War, Drama]| 1|
// | 3| [Adventure, Drama]|[Drama, War]| 0|
// +--------+--------------------+------------+----+