How to map MongoDB data in Spark for kmeans? - mongodb

I want to run k-means within Spark on data provided from a MongoDB.
I have a working example which acts against a flatfile:
sc = SparkContext(appName="KMeansExample") # SparkContext
data = sc.textFile("/home/mhoeller/kmeans_data.txt")
parsedData = data.map(lambda line: array([int(x) for x in line.split(' ')]))
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")
this is the format of the flatfile is:
0 0 1
1 1 1
2 2 2
9 9 6
Now I want to replace the flatfile with a MongoDB:
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/ycsb.usertable") \
.config("spark.mongodb.output.uri", "mongodb:/127.0.0.1/ycsb.usertable") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri","mongodb://127.0.0.1/ycsb.usertable").load()
# <<<< Here I am missing the parsing >>>>>
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")
I like to understand how to map data from the df so that it can be used as input for kmeans.
The "layout" of the database is:
root
|-- _id: string (nullable = true)
|-- field0: binary (nullable = true)
|-- field1: binary (nullable = true)
|-- field2: binary (nullable = true)
|-- field3: binary (nullable = true)
|-- field4: binary (nullable = true)
|-- field5: binary (nullable = true)
|-- field6: binary (nullable = true)
|-- field7: binary (nullable = true)
|-- field8: binary (nullable = true)
|-- field9: binary (nullable = true)

I like to understand how to map data from the df so that it can be used as input for kmeans.
Based on your snippet, I assumed that you're using PySpark.
If you look into clustering.KMeans Python API doc, you can see that the first parameter needs to be RDD of Vector or convertible sequence types
After you performed below code which load data from MongoDB using MongoDB Spark Connector
df = spark.read.format("com.mongodb.spark.sql.DefaultSource")
.option("uri","mongodb://127.0.0.1/ycsb.usertable")
.load()
What you have in df is a DataFrame, so we need to convert it into something convertible to a Vector type.
Since you are using numpy.array in your text file example, we can keep using this array type for learning transition.
Based on the provided layout, firstly we need to remove the _id column as it won't be needed for the clustering training. See also Vector data type for more information.
With the above information, let's get into it:
# Drop _id column and get RDD representation of the DataFrame
rowRDD = df.drop("_id").rdd
# Convert RDD of Row into RDD of numpy.array
parsedRdd = rowRDD.map(lambda row: array([int(x) for x in row]))
# Feed into KMeans
clusters = KMeans.train(parsedRdd, 2, maxIterations=10, initializationMode="random")
If you would like to keep the boolean value (True/False) instead of integer (1/0), then you can remove the int part. As below:
parsedRdd = rowRDD.map(lambda row: array([x for x in row]))
Putting all of them together :
from numpy import array
from pyspark.mllib.clustering import KMeans
import org.apache.spark.sql.SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/ycsb.usertable") \
.config("spark.mongodb.output.uri", "mongodb:/127.0.0.1/ycsb.usertable") \
.getOrCreate()
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
rowRDD = df.drop("_id").rdd
parsedRdd = rowRDD.map(lambda row: array([int(x) for x in row]))
clusters = KMeans.train(parsedRdd, 2, maxIterations=10, initializationMode="random")
clusters.clusterCenters

Related

Apply vectors.Dense() to an array float column in pyspark 3.2.1

In order to apply PCA from pyspark.ml.feature, I need to convert a org.apache.spark.sql.types.ArrayType:array<float> to org.apache.spark.ml.linalg.VectorUDT
Say I have the following dataframe :
df = spark.createDataFrame([
('string1',[5.0,4.0,0.5]),
('string2',[2.0,0.76,7.54]),
], schema='a string, b array<float>')
Whereas a = Vectors.dense(df.select('b').head(1)[0][0]) seems to work for one row, I was wondering how I can apply this function for all the rows.
You'd have to map it back to RDD and manually create a Vector using lambda function
from pyspark.ml.linalg import Vectors
# df = ... # your df
df2 = df.rdd.map(lambda x: (x['a'], Vectors.dense(x['b']))).toDF(['a', 'b'])
df2.show()
df2.printSchema()
+-------+------------------------------------------+
|a |b |
+-------+------------------------------------------+
|string1|[5.0,4.0,0.5] |
|string2|[2.0,0.7599999904632568,7.539999961853027]|
+-------+------------------------------------------+
root
|-- a: string (nullable = true)
|-- b: vector (nullable = true)

how to transform each column of a dataframe from binary to byte array

I have a dataset ds1 with following schema
root
|-- binary_col1: binary (nullable = true)
which i transform as needed using
val ds2 = ds1.map(row => row.getAs[Array[Byte]]("binary_col1"))
But how do I transform the dataset when it has two columns of binary type ?
root
|-- binary_col1: binary (nullable = true)
-- binary_col2: binary (nullable = false)
I want to create new dataset with 2 columns
( binary_col1.toByteArray , binary_col2.toByteArray)
You can use as on the dataframe/dataset, and provide a tuple2 type:
val ds2 = ds1.as[(Array[Byte], Array[Byte])]
This is better than using map because it preserves column names.
Of course, you can also use map, e.g.
val ds2 = ds1.map(row => (row.getAs[Array[Byte]]("binary_col1"), row.getAs[Array[Byte]]("binary_col2")))

How is the VectorAssembler used with Sparks Correlation util?

I'm trying to correlate a couple columns of a dataframe in spark scala by piping the columns of the original dataframe into the VectorAssembler followed by the Correlation util. For some reason the Vector assembler seems to be producing empty vectors as seen below. Here's what I have so far.
val numericalCols = Array(
"price", "bedrooms", "bathrooms",
"sqft_living", "sqft_lot"
)
val data: DataFrame = HousingDataReader(spark)
data.printSchema()
/*
...
|-- price: decimal(38,18) (nullable = true)
|-- bedrooms: decimal(38,18) (nullable = true)
|-- bathrooms: decimal(38,18) (nullable = true)
|-- sqft_living: decimal(38,18) (nullable = true)
|-- sqft_lot: decimal(38,18) (nullable = true)
...
*/
println("total record:"+data.count()) //total record:21613
val assembler = new VectorAssembler().setInputCols(numericalCols)
.setOutputCol("features").setHandleInvalid("skip")
val df = assembler.transform(data).select("features","price")
df.printSchema()
/*
|-- features: vector (nullable = true)
|-- price: decimal(38,18) (nullable = true)
*/
df.show
/* THIS IS ODD
+--------+-----+
|features|price|
+--------+-----+
+--------+-----+
*/
println("df row count:" + df.count())
// df row count:21613
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head //ERROR HERE
println("Pearson correlation matrix:\n" + coeff1.toString)
this ends up with the following exception
java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345)
at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:73)
at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:84)
at
Looks like any one of your feature columns contains a null value always. setHandleInvalid("skip") will skip any row that contains null in one of the features. Can you try filling the null values with fillna(0) and check the result. This must solve your issue.

Casting type of columns in a dataframe

My Spark program needs to read a file which contains a matrix of integers. Columns are separated with ",". Number of columns is not the same each time I run the program.
I read the file as a dataframe:
var df = spark.read.csv(originalPath);
but when I print schema it gives me all the columns as Strings.
I convert all columns to Integers as below but after that when I print the schema of df again, columns are still Strings.
df.columns.foreach(x => df.withColumn(x + "_new", df.col(x).cast(IntegerType))
.drop(x).withColumnRenamed(x + "_new", x));
I appreciate any help to solve the issue of casting.
Thanks.
DataFrames are immutable. Your code creates new DataFrame for each value and discards it.
It is best to use map and select:
val newDF = df.select(df.columns.map(c => df.col(c).cast("integer")): _*)
but you could foldLeft:
df.columns.foldLeft(df)((df, x) => df.withColumn(x , df.col(x).cast("integer")))
or even (please don't) mutable reference:
var df = Seq(("1", "2", "3")).toDF
df.columns.foreach(x => df = df.withColumn(x , df.col(x).cast("integer")))
Or as you mentioned your column numbers are not same each time, you could take the highest number of possible column and make a schema out of it, having IntegerType as column type. During loading the file infer this schema to automatically convert your dataframe columns from string to integer. No explicit conversion required in this case.
import org.apache.spark.sql.types._
val csvSchema = StructType(Array(
StructField("_c0", IntegerType, true),
StructField("_c1", IntegerType, true),
StructField("_c2", IntegerType, true),
StructField("_c3", IntegerType, true)))
val df = spark.read.schema(csvSchema).csv(originalPath)
scala> df.printSchema
root
|-- _c0: integer (nullable = true)
|-- _c1: integer (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: integer (nullable = true)

Euclidean distance in spark 2.1

I'm trying to calculate euclidean distance of two vectors. I have the following dataframe:
root
|-- h: string (nullable = true)
|-- id: string (nullable = true)
|-- sid: string (nullable = true)
|-- features: vector (nullable = true)
|-- episodeFeatures: vector (nullable = true)
import org.apache.spark.mllib.util.{MLUtils}
val jP2 = jP.withColumn("dist", MLUtils.fastSquaredDistance("features", 5, "episodeFeatures", 5))
I get an error like so:
error: method fastSquaredDistance in object MLUtils cannot be accessed in object org.apache.spark.mllib.util.MLUtils
Is there a way to access that private method?
MLUtils are internal package, and even if wasn't for that, it couldn't be used on Columns or (guessing from the version) ml vectors. You have to design your own udf:
import org.apache.spark.sql.functions._
import org.apache.spark.ml.linalg.Vector
val euclidean = udf((v1: Vector, v2: Vector) => ???) // Fill with preferred logic
val jP2 = jP.withColumn("dist", euclidean($"features", $"episodeFeatures"))