Retrieve Spark Mllib StringIndexer column mapping - scala

How do I get the mapping out of a trained Spark MLlib StringIndexerModel?
val stringIndexer = new StringIndexer()
.setInputCol("myCol")
.setOutputCol("myColIdx")
val stringIndexerModel = stringIndexer.fit(data)
val res = stringIndexerModel.transform(data)
The code above will add a myColIdx to my DataFrame mapping values in myCol to an index based on the values frequency. i.e. Most frequent value -> 0, second most frequent -> 1, etc...
How do I retrieve that mapping from the model? If I serialize/deserialize the model, will the mapping be stable (i.e. Am I guaranteed to same result after the transform)?

StringIndexerModel exposes the mapping using labels attribute:
stringIndexerModel.labels: Array[String]
where values correspond to consecutive labels for example for:
val data = Seq("foo", "bar", "foo", "bar", "foobar", "bar").toDF("myCol")
you'll get following labels:
import org.apache.spark.ml.feature.IndexToString
Array(bar, foo, foobar)
with bar indexed as 0.0, foo as 1.0 and foobar as 2.0. This is property of the model and is preserved when model is saved.
When used in Pipeline you can also use IndexToString which will use column metadata to map indices back to labels.
indexToString.transform(stringIndexerModel.transform(data)).show
+------+--------+-------------+
| myCol|myColIdx|myColReversed|
+------+--------+-------------+
| foo| 1.0| foo|
| bar| 0.0| bar|
| foo| 1.0| foo|
| bar| 0.0| bar|
|foobar| 2.0| foobar|
| bar| 0.0| bar|
+------+--------+-------------+

Related

PySpark UDF: a fir transform example

I am really new to PySpark and am trying to translate some python code into pyspark.
I start with a panda, convert to a document - term matrix and then apply PCA.
The UDF:
class MultiLabelCounter():
def __init__(self, classes=None):
self.classes_ = classes
def fit(self,y):
self.classes_ =
sorted(set(itertools.chain.from_iterable(y)))
self.mapping = dict(zip(self.classes_,
range(len(self.classes_))))
return self
def transform(self,y):
yt = []
for labels in y:
data = [0]*len(self.classes_)
for label in labels:
data[self.mapping[label]] +=1
yt.append(data)
return yt
def fit_transform(self,y):
return self.fit(y).transform(y)
mlb = MultiLabelCounter()
df_grouped =
df_grouped.withColumnRenamed("collect_list(full)","full")
udf_mlb = udf(lambda x: mlb.fit_transform(x),IntegerType())
mlb_fitted = df_grouped.withColumn('full',udf_mlb(col("full")))
I am of course getting NULL results.
I am using spark 2.4.4 version.
EDIT
Adding sample input and output as per request
Input:
|id|val|
|--|---|
|1|[hello,world]|
|2|[goodbye, world]|
|3|[hello,hello]|
Output:
|id|hello|goodbye|world|
|--|-----|-------|-----|
|1|1|0|1|
|2|0|1|1|
|3|2|0|0|
Based upon input data shared, I tried replicating your output and it works. Please see below -
Input Data
df = spark.createDataFrame(data=[(1, ['hello', 'world']), (2, ['goodbye', 'world']), (3, ['hello', 'hello'])], schema=['id', 'vals'])
df.show()
+---+----------------+
| id| vals|
+---+----------------+
| 1| [hello, world]|
| 2|[goodbye, world]|
| 3| [hello, hello]|
+---+----------------+
Now, using explode to create separate rows out of vals list items. Thereafter, using pivot and count will calculate the frequency. Finally, replacing null values with 0 using fillna(0). See below -
from pyspark.sql.functions import *
df1 = df.select(['id', explode(col('vals'))]).groupBy("id").pivot("col").agg(count(col("col")))
df1.fillna(0).orderBy("id").show()
Output
+---+-------+-----+-----+
| id|goodbye|hello|world|
+---+-------+-----+-----+
| 1| 0| 1| 1|
| 2| 1| 0| 1|
| 3| 0| 2| 0|
+---+-------+-----+-----+

Extract feature columns results in an (numberOfFeatures,Array[nonZeroFeatIndexes],Array[nonZeroFeatures]) instead of an array of those columns

I'm using Spark MLLib with Scala to load a csv file and transform the features in a feature vector to use it to train some models; for that, I'm using the following code:
// Loading the data
val rawData = spark.read.option("header", "true").csv(data) // id, feat0, feat1, feat2,...
val rawLabels = spark.read.option("header", "true").csv(labels) // id, label
val rawDataSet = rawData.join(rawLabels,"id")
// Set features columns
val featureCols = rawTrainingDataSet.columns.drop(1) // drop the id column
// TypeString in the csv columns so need to cast to Double
val exprs = featureCols.map(c => col(c).cast("Double"))
// Assembler taking a sample of just 5 columns; it should use "featureCols" as parameter value for "setInputCols" in the real case
val assembler = new VectorAssembler()
.setInputCols(Array("feat0", "feat1", "feat2", "feat3", "feat4", "feat5"))
.setOutputCol("features")
// Select all the column values to create the "features" column with them
val result = assembler.transform(rawTrainingDataSet.select(exprs: _*)).select($"features")
result.show(5,false)
This is working but I'm not getting the expected results for the features column as shown in the documentation https://spark.apache.org/docs/2.4.4/ml-features.html#vectorassembler; instead I'm getting this:
feat0|feat1|feat2|feat3|feat4|feat5| features
39.0 |0.0 | 1.0| 0.0| 0.0| 1.0| [39.0,0.0,1.0,1.0,0.0,0.0]
29.0 |0.0 | 1.0| 0.0| 0.0| 0.0| (6,[0,2],[29.0,1.0])
53.0 |1.0 | 0.0| 0.0| 0.0| 0.0| (6,[0,1],[53.0,1.0])
31.0 |0.0 | 1.0| 0.0| 0.0| 1.0| (6,[0,2,5],[31.0,1.0,1.0])
37.0 |0.0 | 1.0| 0.0| 0.0| 0.0| (6,[0,2],[37.0,1.0])
As you can see, for features column I am getting (number_of_features, [indexes_for_non_0_features], [value_for_non_zero_features]) but for the first row where I have the expected value and what I would like to have for all the DataFrame rows, an Array with all the column values, no matter if they are zero values. Could you point me any hints to know what I am doing wrong?
Thank you!!
Convert sparse vector to dense as below -
val sparseToDense =
udf((v: org.apache.spark.ml.linalg.Vector) => v.toDense)
result.withColumn("features_dense", sparseToDense(col("features")));

Sum columns of a Spark dataframe and create another dataframe

I have a dataframe like below -
I am trying to create another dataframe from this which has 2 columns - the column name and the sum of values in each column like this -
So far, I've tried this (in Spark 2.2.0) but throws a stack trace -
val get_count: (String => Long) = (c: String) => {
df.groupBy("id")
.agg(sum(c) as "s")
.select("s")
.collect()(0)
.getLong(0)
}
val sqlfunc = udf(get_count)
summary = summary.withColumn("sum_of_column", sqlfunc(col("c")))
Are there any other alternatives of accomplishing this task?
I think that the most efficient way is to do an aggregation and then build a new dataframe. That way you avoid a costly explode.
First, let's create the dataframe. BTW, it's always nice to provide the code to do it when you ask a question. This way we can reproduce your problem in seconds.
val df = Seq((1, 1, 0, 0, 1), (1, 1, 5, 0, 0),
(0, 1, 0, 6, 0), (0, 1, 0, 4, 3))
.toDF("output_label", "ID", "C1", "C2", "C3")
Then we build the list of columns that we are interested in, the aggregations, and compute the result.
val cols = (1 to 3).map(i => s"C$i")
val aggs = cols.map(name => sum(col(name)).as(name))
val agg_df = df.agg(aggs.head, aggs.tail :_*) // See the note below
agg_df.show
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 5| 10| 4|
+---+---+---+
We almost have what we need, we just need to collect the data and build a new dataframe:
val agg_row = agg_df.first
cols.map(name => name -> agg_row.getAs[Long](name))
.toDF("column", "sum")
.show
+------+---+
|column|sum|
+------+---+
| C1| 5|
| C2| 10|
| C3| 4|
+------+---+
EDIT:
NB: df.agg(aggs.head, aggs.tail :_*) may seem strange. The idea is simply to compute all the aggregations computed in aggs. One would expect something more simple like df.agg(aggs : _*). Yet the signature of the agg method is as follows:
def agg(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)
maybe to ensure that at least one column is used, and this is why you need to split aggs in aggs.head and aggs.tail.
What i do is to define a method to create a struct from the desired values:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
This functions receives a list of columns to transpose (your 3 last columns in your case) and transform them in a struct with the column name as key and the column value as value
And then use that method to create an struct and process it as you want
df.withColumn("kv", kv(df.columns.tail.tail))
.select( $"kv.k".as("column"), $"kv.v".alias("values"))
.groupBy("column")
.agg(sum("values").as("sum"))
First apply the previous defined function to have the desired columns as the said struct, and then deconstruct the struct to have a column key and a column value in each row.
Then you can aggregate by the column name and sum the values
INPUT
+------------+---+---+---+---+
|output_label| id| c1| c2| c3|
+------------+---+---+---+---+
| 1| 1| 0| 0| 1|
| 1| 1| 5| 0| 0|
| 0| 1| 0| 6| 0|
| 0| 1| 0| 4| 3|
+------------+---+---+---+---+
OUTPUT
+------+---+
|column|sum|
+------+---+
| c1| 5|
| c3| 4|
| c2| 10|
+------+---+

I need to compare two dataframes for type validation and send a nonzero value as output

I am comparing two dataframes (basically these are schema of two different data sources one from hive and other from SAS9.2)
I need to validate structure for both data sources so I converted schema into two dataframes and here they are:
SAS schema will be in below format:
scala> metadata.show
+----+----------------+----+---+-----------+-----------+
|S_No| Variable|Type|Len| Format| Informat|
+----+----------------+----+---+-----------+-----------+
| 1| DATETIME| Num| 8|DATETIME20.|DATETIME20.|
| 2| LOAD_DATETIME| Num| 8|DATETIME20.|DATETIME20.|
| 3| SOURCE_BANK|Char| 1| null| null|
| 4| EMP_NAME|Char| 50| null| null|
| 5|HEADER_ROW_COUNT| Num| 8| null| null|
| 6| EMP_HOURS| Num| 8| 15.2| 15.1|
+----+----------------+----+---+-----------+-----------+
Similarly hive metadata will be in below format:
df2.show
+----------------+-------------+
| Variable| type|
+----------------+-------------+
| datetime|TimestampType|
| load_datetime|TimestampType|
| source_bank| StringType|
| emp_name| StringType|
|header_row_count| IntegerType|
| emp_hours| DoubleType|
+----------------+-------------+
Now, I need to compare both these on column type and validate structure.Like for "Num" type equivalent is "Integertype".
Finally I need to store anon zero value as output if schema validation is successful
How can I achieve this ?
you can join the two dataframes and then compare the two columns corressponding to the columns type via a Map and UDF.
This is a code sample that does that.
You need to complete the map with the right values
val sqlCtx = sqlContext
import sqlCtx.implicits._
val metadata: DataFrame= Seq(
(Some("1"), "DATETIME", "Num", "8", "DATETIME20", "DATETIME20"),
(Some("3"), "SOURCEBANK", "Num", "1", "null", "null")
).toDF("SNo", "Variable", "Type", "Len", "Format", "Informat")
val metadataAdapted: DataFrame = metadata
.withColumn("Name", functions.upper(col("Variable")))
.withColumnRenamed("Type", "TypeHive")
val sasDF = Seq(("datetime", "TimestampType"),
("datetime", "TimestampType")
).toDF("variable", "type")
val sasDFAdapted = sasDF
.withColumn("Name", functions.upper(col("variable")))
.withColumnRenamed("Type", "TypeSaS")
val res = sasDFAdapted.join(metadataAdapted, Seq("Name"), "inner")
val map = Map("TimestampType" -> "Num")
def udfType(dict: Map[String, String]) = functions.udf( (typeVar: String) => dict(typeVar))
val result = res.withColumn("correctMapping", udfType(map)(col("TypeSaS")) === col("TypeHive"))

How to aggregate a Spark data frame to get a sparse vector using Scala?

I have a data frame like the one below in Spark, and I want to group it by the id column and then for each line in the grouped data I need to create a sparse vector with elements from the weight column at indices specified by the index column. The length of the sparse vector is known, say 1000 for this example.
Dataframe df:
+-----+------+-----+
| id|weight|index|
+-----+------+-----+
|11830| 1| 8|
|11113| 1| 3|
| 1081| 1| 3|
| 2654| 1| 3|
|10633| 1| 3|
|11830| 1| 28|
|11351| 1| 12|
| 2737| 1| 26|
|11113| 3| 2|
| 6590| 1| 2|
+-----+------+-----+
I have read this which is sort of similar of what I want to do, but for a rdd. Does anyone know of a good way to do this for a data frame in Spark using Scala?
My attempt so far is to first collect the weights and indices as lists like this:
val dfWithLists = df
.groupBy("id")
.agg(collect_list("weight") as "weights", collect_list("index") as "indices"))
which looks like:
+-----+---------+----------+
| id| weights| indices|
+-----+---------+----------+
|11830| [1, 1]| [8, 28]|
|11113| [1, 3]| [3, 2]|
| 1081| [1]| [3]|
| 2654| [1]| [3]|
|10633| [1]| [3]|
|11351| [1]| [12]|
| 2737| [1]| [26]|
| 6590| [1]| [2]|
+-----+---------+----------+
Then I define a udf and do something like this:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.functions.udf
def toSparseVector: ((Array[Int], Array[BigInt]) => Vector) = {(a1, a2) => Vectors.sparse(1000, a1, a2.map(x => x.toDouble))}
val udfToSparseVector = udf(toSparseVector)
val dfWithSparseVector = dfWithLists.withColumn("SparseVector", udfToSparseVector($"indices", $"weights"))
but this doesn't seem to work, and it feels like there should be an easier way to do it without needing to collecting the weights and indices to lists first.
I'm pretty new to Spark, Dataframes and Scala, so any help is highly appreciated.
You have to collect them as vectors must be local, single machine: https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector
For creating the sparse vectors you have 2 options, using unordered (index, value) pairs or specifying the indices and values arrays:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$
If you can get the data into a different format (pivoted), you could also make use of the VectorAssembler:
https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
With some small tweaks you can get your approach working:
:paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val df = Seq((11830,1,8), (11113, 1, 3), (1081, 1,3), (2654, 1, 3), (10633, 1, 3), (11830, 1, 28), (11351, 1, 12), (2737, 1, 26), (11113, 3, 2), (6590, 1, 2)).toDF("id", "weight", "index")
val dfWithFeat = df
.rdd
.map(r => (r.getInt(0), (r.getInt(2), r.getInt(1).toDouble)))
.groupByKey()
.map(r => LabeledPoint(r._1, Vectors.sparse(1000, r._2.toSeq)))
.toDS
dfWithFeat.printSchema
dfWithFeat.show(10, false)
// Exiting paste mode, now interpreting.
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
+-------+-----------------------+
|label |features |
+-------+-----------------------+
|11113.0|(1000,[2,3],[3.0,1.0]) |
|2737.0 |(1000,[26],[1.0]) |
|10633.0|(1000,[3],[1.0]) |
|1081.0 |(1000,[3],[1.0]) |
|6590.0 |(1000,[2],[1.0]) |
|11830.0|(1000,[8,28],[1.0,1.0])|
|2654.0 |(1000,[3],[1.0]) |
|11351.0|(1000,[12],[1.0]) |
+-------+-----------------------+
dfWithFeat: org.apache.spark.sql.Dataset[org.apache.spark.mllib.regression.LabeledPoint] = [label: double, features: vector]