How to do normalization with MinMaxscaler within each group after using group by to a spark dataframe? [duplicate] - scala

I want to scale data with StandardScaler (from pyspark.mllib.feature import StandardScaler), by now I can do it by passing the values of RDD to transform function, but the problem is that I want to preserve the key. is there anyway that I scale my data by preserving its key?
Sample dataset
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,smurf.
Imports
import sys
import os
from collections import OrderedDict
from numpy import array
from math import sqrt
try:
from pyspark import SparkContext, SparkConf
from pyspark.mllib.clustering import KMeans
from pyspark.mllib.feature import StandardScaler
from pyspark.statcounter import StatCounter
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
Portion of code
sc = SparkContext(conf=conf)
raw_data = sc.textFile(data_file)
parsed_data = raw_data.map(Parseline)
Parseline function:
def Parseline(line):
line_split = line.split(",")
clean_line_split = [line_split[0]]+line_split[4:-1]
return (line_split[-1], array([float(x) for x in clean_line_split]))

Not exactly a pretty solution but you can adjust my answer to the similar Scala question. Lets start with an example data:
import numpy as np
np.random.seed(323)
keys = ["foo"] * 50 + ["bar"] * 50
values = (
np.vstack([np.repeat(-10, 500), np.repeat(10, 500)]).reshape(100, -1) +
np.random.rand(100, 10)
)
rdd = sc.parallelize(zip(keys, values))
Unfortunately MultivariateStatisticalSummary is just a wrapper around a JVM model and it is not really Python friendly. Luckily with NumPy array we can use standard StatCounter to compute statistics by key:
from pyspark.statcounter import StatCounter
def compute_stats(rdd):
return rdd.aggregateByKey(
StatCounter(), StatCounter.merge, StatCounter.mergeStats
).collectAsMap()
Finally we can map to normalize:
def scale(rdd, stats):
def scale_(kv):
k, v = kv
return (v - stats[k].mean()) / stats[k].stdev()
return rdd.map(scale_)
scaled = scale(rdd, compute_stats(rdd))
scaled.first()
## array([ 1.59879188, -1.66816084, 1.38546532, 1.76122047, 1.48132643,
## 0.01512487, 1.49336769, 0.47765982, -1.04271866, 1.55288814])

Related

How to calculate rolling covariance matrix from a spark dataframe

I have a Spark 2.2.0 DataFrame of currency prices where I add the returns to.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val spark = SparkSession.builder.getOrCreate()
val prices = spark.read.json("prices.json")
// make a window function and convert prices to returns
val window = Window.partitionBy("currency").orderBy("time")
val lagPrice = lag(col("close"), 1).over(window)
val percentReturn = col("close") / col("lastClose") - 1d
val logReturn = log(col("close") / col("lastClose"))
val returns = prices.withColumn("lastClose", lagPrice)
.withColumn("return", percentReturn)
.withColumn("logReturn", logReturn)
Now I want to calculate a rolling Covarance Matrix (like a moving average) of all currencies using a window function. But I can not find any documentation or examples.

Type conversion error from LabeledPoint in pyspark.mllib, for using linear regression model in pyspark.ml

I have the following code for linear regression using pyspark.ml package. However I get this error message for the last line, when the model is being fit:
IllegalArgumentException: u'requirement failed: Column features must
be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was
actually org.apache.spark.mllib.linalg.VectorUDT#f71b0bce.
Does anyone has an idea what is missing?
Is there any replacement in pyspark.ml for LabeledPoint in pyspark.mllib?
from pyspark import SparkContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.regression import LabeledPoint
import numpy as np
from pandas import *
data = sc.textFile("/FileStore/tables/w7baik1x1487076820914/randomTableSmall.csv")
def parsePoint(line):
values = [float(x) for x in line.split(',')]
return LabeledPoint(values[1], [values[0]])
points_df = data.map(parsePoint).toDF()
lr = LinearRegression()
model = lr.fit(points_df, {lr.regParam:0.0})
The problem is that newer versions of spark have a Vector class in linalg module of ml and you do not need to get it from mllib.linalg. Also the newer versions do not accept spark.mllib.linalg.VectorUDT in ml. here is the code that would work for you :
from pyspark import SparkContext
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
import numpy as np
data = sc.textFile("/FileStore/tables/w7baik1x1487076820914/randomTableSmall.csv")
def parsePoint(line):
values = [float(x) for x in line.split(',')]
return (values[1], Vectors.dense([values[0]]))
points_df = data.map(parsePoint).toDF(['label','features'])
lr = LinearRegression()
model = lr.fit(points_df)
Spark newer versions don't accept spark.mllib.linalg.VectorUDT (you do not need to get it from mllib.linalg).
try to replace
from pyspark.mllib.regression import LabeledPoint
by:
from pyspark.ml.linalg import Vectors

Spark Mllib .toBlockMatrix results in matrix of 0.0

I am trying to create a block matrix from a input data file. I have managed to get the data read from the data file and stored in IndexedRowMatrix and CoordinateMatrix format correct.
When I use .toBlockMatrix on the CoordinateMatrix the result is a block matrix containing only 0.0 with the same dimensions as the CoordinateMatrix.
I am using version 1.5.0-cdh5.5.0
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRow
import org.apache.spark.mllib.linalg.distributed.BlockMatrix
val conf = new SparkConf().setMaster("local").setAppName("Transpose");
val sc = new SparkContext(conf)
val dataRDD = sc.textFile("/user/cloudera/data/data.txt").map(line => Vectors.dense(line.split(" ").map(_.toDouble))).zipWithIndex.map(_.swap)
//Format of dataRDD is RDD[(Long, Vector)]
val rows = dataRDD.map{case(k,v) => IndexedRow(k,v)}
//Format of rows is RDD[IndexedRow]
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
val coordMat: CoordinateMatrix = mat.toCoordinateMatrix()
val blockMat: BlockMatrix = coordMat.toBlockMatrix().cache()
The data file is just simply two columns by sixty rows of integers.
140 123
141 310
310 381
480 321
... ...
Update:
I've done some investigating and have discovered that the groupByKey function is not working correctly, which is what is preventing the BlockMatrix from being formed correctly. I still however do not know why groupByKey, join, and groupBy are not working and always returning an empty result.
I have solved the problem by removing the lines of code:
val conf = new SparkConf().setMaster("local").setAppName("Transpose")
val sc = new SparkContext(conf)
I found the answer in the below linked page in a comment by Farzad Nozarian,
Unable to count words using reduceByKey((v1,v2) => v1 + v2) scala function in spark
As a side-note this might help people who are getting empty results for .groupByKey, .reduceByKey, .join, etc.

Cosine Similarity via DIMSUM in Spark

I have a very simple code to try Cosine Similarity:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
val rows= Array(((1,2,3,4,5),(1,2,3,4,5),(1,2,4,5,8),(3,4,1,2,7),(7,7,7,7,7)))
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
I run this code on Amazon AWS which has Spark 1.5 however I got the following message for the last two lines:
"Erroe: value columnSimilarities is not a memeber of org.apache.spark.rdd.RDD[(int,int)]"
Could you please help to resolve this issue?
I found the answer. I need to convert the matrix to RDD. Here is the right code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
import org.apache.spark.rdd._
import org.apache.spark.mllib.linalg._
def matrixToRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
val vectors = rows.map(row => new DenseVector(row.toArray))
sc.parallelize(vectors)
}
val dm: Matrix = Matrices.dense(5, 5,Array(1,2,3,4,5,1,2,3,4,5,1,2,4,5,8,3,4,1,2,7,7,7,7,7,7))
val rows = matrixToRDD(dm)
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
println("Pairwise similarities are: " + simsPerfect.entries.collect.mkString(", "))
println("Estimated pairwise similarities are: " + simsEstimate.entries.collect.mkString(", "))
Cheers

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)