How to get min and max of clusters - scala

I created a scala program to apply k-means on a specific column of a dataframe. Dataframe name is df_items and column name is price.
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
val df_items = spark.read.format("csv").option("header","true").load(path.csv)
// need to cast because df_items("price") is String
df_items.createGlobalTempView("items")
val price = spark.sql("SELECT cast(price as double) price FROM global_temp.items")
case class Rows(price:Double)
val rows = price.as[Rows]
val assembler = new VectorAssembler().setInputCols(Array("price")).setOutputCol("features")
val data = assembler.transform(rows)
val kmeans = new KMeans().setK(6)
val model = kmeans.fit(data)
val predictions = model.summary.predictions
Predictions result :
+------+--------+----------+
| price|features|prediction|
+------+--------+----------+
| 58.9| [58.9]| 0|
| 239.9| [239.9]| 3|
| 199.0| [199.0]| 5|
| 12.99| [12.99]| 0|
| 199.9| [199.9]| 5|
| 21.9| [21.9]| 0|
| 19.9| [19.9]| 0|
| 810.0| [810.0]| 1|
|145.95|[145.95]| 5|
| ... | ... | ... |
My goal is to get the min and the max value of a cluster (or all clusters). It is possible?
Thank's a lot

If I understand your question correctly, you could use groupBy to group by prediction column.
predictions.groupBy("prediction")
.agg(min(col("price")).as("min_price"),
max(col("price")).as("max_price"))
Is this what you need?

Related

PySpark UDF: a fir transform example

I am really new to PySpark and am trying to translate some python code into pyspark.
I start with a panda, convert to a document - term matrix and then apply PCA.
The UDF:
class MultiLabelCounter():
def __init__(self, classes=None):
self.classes_ = classes
def fit(self,y):
self.classes_ =
sorted(set(itertools.chain.from_iterable(y)))
self.mapping = dict(zip(self.classes_,
range(len(self.classes_))))
return self
def transform(self,y):
yt = []
for labels in y:
data = [0]*len(self.classes_)
for label in labels:
data[self.mapping[label]] +=1
yt.append(data)
return yt
def fit_transform(self,y):
return self.fit(y).transform(y)
mlb = MultiLabelCounter()
df_grouped =
df_grouped.withColumnRenamed("collect_list(full)","full")
udf_mlb = udf(lambda x: mlb.fit_transform(x),IntegerType())
mlb_fitted = df_grouped.withColumn('full',udf_mlb(col("full")))
I am of course getting NULL results.
I am using spark 2.4.4 version.
EDIT
Adding sample input and output as per request
Input:
|id|val|
|--|---|
|1|[hello,world]|
|2|[goodbye, world]|
|3|[hello,hello]|
Output:
|id|hello|goodbye|world|
|--|-----|-------|-----|
|1|1|0|1|
|2|0|1|1|
|3|2|0|0|
Based upon input data shared, I tried replicating your output and it works. Please see below -
Input Data
df = spark.createDataFrame(data=[(1, ['hello', 'world']), (2, ['goodbye', 'world']), (3, ['hello', 'hello'])], schema=['id', 'vals'])
df.show()
+---+----------------+
| id| vals|
+---+----------------+
| 1| [hello, world]|
| 2|[goodbye, world]|
| 3| [hello, hello]|
+---+----------------+
Now, using explode to create separate rows out of vals list items. Thereafter, using pivot and count will calculate the frequency. Finally, replacing null values with 0 using fillna(0). See below -
from pyspark.sql.functions import *
df1 = df.select(['id', explode(col('vals'))]).groupBy("id").pivot("col").agg(count(col("col")))
df1.fillna(0).orderBy("id").show()
Output
+---+-------+-----+-----+
| id|goodbye|hello|world|
+---+-------+-----+-----+
| 1| 0| 1| 1|
| 2| 1| 0| 1|
| 3| 0| 2| 0|
+---+-------+-----+-----+

base64 decoding of a dataframe

I have an encoded dataframe and I managed to get it decoded using following code in PySpark. Is there any simple way where I can have an additional column in the dataframe itself through Scala/PySpark?
import base64
import numpy as np
df = spark.read.parquet("file_path")
encodedColumn = base64.decodestring(df.take(1)[0].column2)
t1 = np.frombuffer(encodedColumn ,dtype='<f4')
I looked up multiple similar questions, but couldnt get them to work.
Edit:
Got it working with help from a colleague.
def binaryToFloatArray(stringValue: String): Array[Float] = {
val t:Array[Byte] = Base64.getDecoder().decode(stringValue)
val b = ByteBuffer.wrap(t).order(ByteOrder.LITTLE_ENDIAN).asFloatBuffer()
val copy = new Array[Float](2048)
b.get(copy)
return copy
}
val binaryToFloatArrayUDF = udf(binaryToFloatArray _)
val finalResultDf = dftest.withColumn("myFloatArray", binaryToFloatArrayUDF(col("_2"))).drop("_2")
You have base64 and unbase64 functions for this.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=streaming#pyspark.sql.functions.base64
You could
from pyspark.sql.functions import unbase64,base64
got = spark.createDataFrame([(1, "Jon"), (2, "Danny"), (3, "Tyrion")], ("id", "name"))
+---+------+
| id| name|
+---+------+
| 1| Jon|
| 2| Danny|
| 3|Tyrion|
+---+------+
encoded_got = got.withColumn('encoded_base64_name', base64(got.name))
+---+------+-------------------+
| id| name|encoded_base64_name|
+---+------+-------------------+
| 1| Jon| Sm9u|
| 2| Danny| RGFubnk=|
| 3|Tyrion| VHlyaW9u|
+---+------+-------------------+
decoded_got = encoded_got.withColumn('decoded_base64', unbase64(encoded_got.encoded_base64).cast("string"))
# Need to use cast("string") to convert from binary to string
+---+------+--------------+--------------+
| id| name|encoded_base64|decoded_base64|
+---+------+--------------+--------------+
| 1| Jon| Sm9u| Jon|
| 2| Danny| RGFubnk=| Danny|
| 3|Tyrion| VHlyaW9u| Tyrion|
+---+------+--------------+--------------+

Converting from org.apache.spark.sql.Dataset to CoordinateMatrix

I have a spark SQL dataset whose schema defined as follows,
User_id <String> | Item_id <String> | Bought_Status <Boolean>
I would like to convert this to a Sparse matrix to apply recommender systems algorithms. This is very huge RDD datasets so I read that CoordinateMatrix is the right way to create a sparse matrix out of this.
However I got stuck at a point where the API doc says that RDD[MatrixEntry] is mandatory to create a CoordinateMatrix. Also MatrixEntry needs a format of int,int, long.
I am not able to convert my data scheme to this format. Can you please help me on how to convert this data to a sparse matrix in spark? I am currently programming using scala
Please note that matrix entity is of type long, long, double
Reference: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry
Also, as user/ item columns are string, those need to be indexed before processing. Here is how you can create coordinatematrix with scala:
//Imports needed
scala> import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
scala> import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
scala> import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.StringIndexer
//Let's create a dummy dataframe
scala> val df = spark.sparkContext.parallelize(List(
| ("u1","i1" ,true),
| ("u1","i2" ,true),
| ("u2","i3" ,false),
| ("u2","i4" ,false),
| ("u3","i1" ,true),
| ("u3","i3" ,true),
| ("u4","i3" ,false),
| ("u4","i4" ,false))).toDF("user","item","bought")
df: org.apache.spark.sql.DataFrame = [user: string, item: string ... 1 more field]
scala> df.show
+----+----+------+
|user|item|bought|
+----+----+------+
| u1| i1| true|
| u1| i2| true|
| u2| i3| false|
| u2| i4| false|
| u3| i1| true|
| u3| i3| true|
| u4| i3| false|
| u4| i4| false|
+----+----+------+
//Index user/ item columns
scala> val indexer1 = new StringIndexer().setInputCol("user").setOutputCol("userIndex")
indexer1: org.apache.spark.ml.feature.StringIndexer = strIdx_2de8d35b8301
scala> val indexed1 = indexer1.fit(df).transform(df)
indexed1: org.apache.spark.sql.DataFrame = [user: string, item: string ... 2 more fields]
scala> val indexer2 = new StringIndexer().setInputCol("item").setOutputCol("itemIndex")
indexer2: org.apache.spark.ml.feature.StringIndexer = strIdx_493ce45dbec3
scala> val indexed2 = indexer2.fit(indexed1).transform(indexed1)
indexed2: org.apache.spark.sql.DataFrame = [user: string, item: string ... 3 more fields]
scala> val tempDF = indexed2.withColumn("userIndex",indexed2("userIndex").cast("long")).withColumn("itemIndex",indexed2("itemIndex").cast("long")).withColumn("bought",indexed2("bought").cast("double")).select("userIndex","itemIndex","bought")
tempDF: org.apache.spark.sql.DataFrame = [userIndex: bigint, itemIndex: bigint ... 1 more field]
scala> tempDF.show
+---------+---------+------+
|userIndex|itemIndex|bought|
+---------+---------+------+
| 0| 1| 1.0|
| 0| 3| 1.0|
| 1| 0| 0.0|
| 1| 2| 0.0|
| 2| 1| 1.0|
| 2| 0| 1.0|
| 3| 0| 0.0|
| 3| 2| 0.0|
+---------+---------+------+
//Create coordinate matrix of size 4*4
scala> val corMat = new CoordinateMatrix(tempDF.rdd.map(m => MatrixEntry(m.getLong(0),m.getLong(1),m.getDouble(2))), 4, 4)
corMat: org.apache.spark.mllib.linalg.distributed.CoordinateMatrix = org.apache.spark.mllib.linalg.distributed.CoordinateMatrix#16be6b36
//Check the content of coordinate matrix
scala> corMat.entries.collect
res2: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(MatrixEntry(0,1,1.0), MatrixEntry(0,3,1.0), MatrixEntry(1,0,0.0), MatrixEntry(1,2,0.0), MatrixEntry(2,1,1.0), MatrixEntry(2,0,1.0), MatrixEntry(3,0,0.0), MatrixEntry(3,2,0.0))
Hope, this helps!

I need to compare hive table schema with a data frame which contains schema of a csv file

here I am trying to validate structure of hive table and CSV file which is stored in s3.
This is schema of CSV file in dataframe.
+----+----------------+----+---+-----------+-----------+
|S_No| Variable|Type|Len| Format| Informat|
+----+----------------+----+---+-----------+-----------+
| 1| DATETIME| Num| 8|DATETIME20.|DATETIME20.|
| 2| LOAD_DATETIME| Num| 8|DATETIME20.|DATETIME20.|
| 3| SOURCE_BANK|Char| 1 | null| null|
| 4| EMP_NAME|Char| 50| null| null|
| 5|HEADER_ROW_COUNT| Num| 8| null| null|
| 6| EMP _HOURS| Num| 8| 15.2| 15.1|
+----+----------------+----+---+-----------+-----------+
I need to compare it with O/p of
import org.apache.spark.sql.hive.HiveContext
val targetTableName = "TableA"
val hc = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val targetRawData = hc.sql("Select datetime,load_datetime,trim(source_bank) as source_bank,trim(emp_name) as emp_name,header_row_count, emp_hours from " + targetTableName)
val schema= targetRawData.schema
which is :schema: org.apache.spark.sql.types.StructType = StructType(StructField(datetime,TimestampType,true), StructField(load_datetime,TimestampType,true), StructField(source_bank,StringType,true), StructField(emp_name,StringType,true), StructField(header_row_count,IntegerType,true), StructField(emp_hours,DoubleType,true))
you can use MegaSparDiff an open source too that compares multiple types of datasources including S3, HIVE, CSV, JDBC etc.
https://github.com/FINRAOS/MegaSparkDiff
the below pair will return inLeftButNotInRight and inRightButNotInLeft as DataFrames.
Your success condition is if both DataFrames have zero records, this means the data is identical.
Now you might wanna consider not loading all the data since you are interested in comparing the schema only.
SparkFactory.initializeSparkContext();
AppleTable leftAppleTable = SparkFactory.parallelizeTextSource("S3://file1","table1");
AppleTable rightAppleTable = SparkFactory.parallelizeHiveSource("select * from hivetable","hivetable");
Pair<Dataset<Row>, Dataset<Row>> resultPair = SparkCompare.compareAppleTables(leftAppleTable, rightAppleTable);
if (resultPair.getLeft().count() != 0 && resultPair.getRight().count() != 0)
{
//success condition
}
SparkFactory.stopSparkContext();
Naveen
you can follow the below steps:
define a class:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SQLImplicits
import org.apache.spark.sql._
class columnMatch(spark:SparkSession,inputdf: DataFrame, requiredColNames: Array[String]) {
val missingColumns = requiredColNames.diff(inputdf.columns)
def missingColumnsMessage(): String = {
val missingColNames = missingColumns.mkString(", ")
val allColNames = inputdf.columns.mkString(", ")
s"The [${missingColNames}] columns are not included in the DataFrame with the following columns [${allColNames}]"
def validatePresenceOfColumns(): String = {
if (missingColumns.nonEmpty) {
missingColumnsMessage()
}
else{
s"No Mismatch in the column Name Found"
}
}
}

I need to compare two dataframes for type validation and send a nonzero value as output

I am comparing two dataframes (basically these are schema of two different data sources one from hive and other from SAS9.2)
I need to validate structure for both data sources so I converted schema into two dataframes and here they are:
SAS schema will be in below format:
scala> metadata.show
+----+----------------+----+---+-----------+-----------+
|S_No| Variable|Type|Len| Format| Informat|
+----+----------------+----+---+-----------+-----------+
| 1| DATETIME| Num| 8|DATETIME20.|DATETIME20.|
| 2| LOAD_DATETIME| Num| 8|DATETIME20.|DATETIME20.|
| 3| SOURCE_BANK|Char| 1| null| null|
| 4| EMP_NAME|Char| 50| null| null|
| 5|HEADER_ROW_COUNT| Num| 8| null| null|
| 6| EMP_HOURS| Num| 8| 15.2| 15.1|
+----+----------------+----+---+-----------+-----------+
Similarly hive metadata will be in below format:
df2.show
+----------------+-------------+
| Variable| type|
+----------------+-------------+
| datetime|TimestampType|
| load_datetime|TimestampType|
| source_bank| StringType|
| emp_name| StringType|
|header_row_count| IntegerType|
| emp_hours| DoubleType|
+----------------+-------------+
Now, I need to compare both these on column type and validate structure.Like for "Num" type equivalent is "Integertype".
Finally I need to store anon zero value as output if schema validation is successful
How can I achieve this ?
you can join the two dataframes and then compare the two columns corressponding to the columns type via a Map and UDF.
This is a code sample that does that.
You need to complete the map with the right values
val sqlCtx = sqlContext
import sqlCtx.implicits._
val metadata: DataFrame= Seq(
(Some("1"), "DATETIME", "Num", "8", "DATETIME20", "DATETIME20"),
(Some("3"), "SOURCEBANK", "Num", "1", "null", "null")
).toDF("SNo", "Variable", "Type", "Len", "Format", "Informat")
val metadataAdapted: DataFrame = metadata
.withColumn("Name", functions.upper(col("Variable")))
.withColumnRenamed("Type", "TypeHive")
val sasDF = Seq(("datetime", "TimestampType"),
("datetime", "TimestampType")
).toDF("variable", "type")
val sasDFAdapted = sasDF
.withColumn("Name", functions.upper(col("variable")))
.withColumnRenamed("Type", "TypeSaS")
val res = sasDFAdapted.join(metadataAdapted, Seq("Name"), "inner")
val map = Map("TimestampType" -> "Num")
def udfType(dict: Map[String, String]) = functions.udf( (typeVar: String) => dict(typeVar))
val result = res.withColumn("correctMapping", udfType(map)(col("TypeSaS")) === col("TypeHive"))