I used Movielens 20 million dataset which contain file called rating .csv(UserId,MovieId,Rating) .I applied Alternating Least Square(ALS) which give output userId,FeatureVector in 10 parquet files . Dimensinality Reduction
I want to make normalize for featureVector using z-score method.
I want to subtract vector(featureVector) from constant scalar 2.484 ,divide value into 1.8305 and save value in parquet files.features column
val df = sqlContext.read.parquet("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet")
sqlContext.sql("select features from df")
df.withColumn("output", "features" -2.484).show(20)
How to subtract vector from each value of scalar?
if you want to subtract 2.484 from each vector value you can try it
import spark.implicits._
import org.apache.spark.ml.linalg.{DenseVector, Vectors}
import org.apache.spark.rdd.RDD
val df = Seq(Vector(1.0,2.0,2.5,3.0,3.5)).toDF("features")
val rdd: RDD[DenseVector] = df.select('features)
.rdd
.map(s => s.getAs[Seq[Double]]("features").toArray)
.map(s => Vectors.dense(s.map(s => s - 2.3333)).toDense)
rdd.take(10).foreach(println(_))
output:
[-1.3333,-0.33329999999999993,0.16670000000000007,0.6667000000000001,1.1667]
I've created an ML pipeline with Apache Spark using mllib.
The evaluator result is a DataFrame with a column "probability", which is a mllib vector of probabilities (similar to predict_proba in scikit-learn).
val rfPredictions = rfModels.bestModel.transform(testing)
val precision = evaluator.evaluate(rfPredictions)
I tried something like this with no success:
rfPredictions.select("probability").map{c => c.getAs[Vector](1).max}
<console>:166: error: value max is not a member of
org.apache.spark.mllib.linalg.Vector
I want a new column with the max of this probabilities. Any ideas?
Vector doesn't have a max method. Try toArray.max:
rfPredictions.select("probability").map{ c => c.getAs[Vector](1).toArray.max }
or argmax:
rfPredictions.select("probability").map{ c => {
val v = c.getAs[Vector](1)
v(v.argmax)
}}
To add the max as a new column, define a udf and use it with withColumn function:
val max_proba_udf = udf((v: Vector) => v.toArray.max)
rfPredictions.withColumn("max_prob", max_proba_udf($"probability"))
Spark > 2.0
With ml, not mllib this will work next way:
import org.apache.spark.ml.linalg.DenseVector
just_another_df.select("probability").map{ c => c.getAs[DenseVector](0).toArray.max }
Using udf
import org.apache.spark.ml.linalg.DenseVector
val max_proba_udf = udf((v: DenseVector) => v.toArray.max)
val rfPredictions = just_another_df.withColumn("MAX_PROB", max_proba_udf($"probability"))
I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element?
I've tried doing the following
from pyspark.sql.functions import udf
first_elem_udf = udf(lambda row: row.values[0])
df.select(first_elem_udf(df.features)).show()
but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0]) instead.
I also tried explode() but I get an error because it requires an array or map type.
This should be a common operation, I think.
Convert output to float:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
Example usage:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.select(ith("features", lit(1))).show()
## +-----------------+
## |ith_(features, 1)|
## +-----------------+
## | 2.0|
## | 9.0|
## +-----------------+
Explanation:
Output values have to be reserialized to equivalent Java objects. If you want to access values (beware of SparseVectors) you should use item method:
v.values.item(0)
which return standard Python scalars. Similarly if you want to access all values as a dense structure:
v.toArray().tolist()
If you prefer using spark.sql, you can use the follow custom function 'to_array' to convert the vector to array. Then you can manipulate it as an array.
from pyspark.sql.types import ArrayType, DoubleType
def to_array_(v):
return v.toArray().tolist()
from pyspark.sql import SQLContext
sqlContext=SQLContext(spark.sparkContext, sparkSession=spark, jsqlContext=None)
sqlContext.udf.register("to_array",to_array_, ArrayType(DoubleType()))
example
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.createOrReplaceTempView("tb")
spark.sql("""select * , to_array(features)[1] Second from tb """).toPandas()
output
id features Second
0 1 [1.0, 2.0, 3.0] 2.0
1 2 (0.0, 9.0, 0.0) 9.0
I ran into the same problem with not being able to use explode(). One thing you can do is use VectorSlice from the pyspark.ml.feature library. Like so:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
slicer = VectorSlicer(inputCol="features", outputCol="features_one", indices=[0])
output = slicer.transform(df)
output.select("features", "features_one").show()
For anyone trying to split the probability columns generated after training a PySpark ML model into usable columns. This does not use UDF or numpy. And this will only work for binary classification. Here lr_pred is the dataframe which has the predictions from the Logistic Regression Model.
prob_df1=lr_pred.withColumn("probability",lr_pred["probability"].cast("String"))
prob_df =prob_df1.withColumn('probabilityre',split(regexp_replace("probability", "^\[|\]", ""), ",")[1].cast(DoubleType()))
Since Spark 3.0.0 this can be done without using UDF.
from pyspark.ml.functions import vector_to_array
https://discuss.dizzycoding.com/how-to-split-vector-into-columns-using-pyspark/
Why is Vector[Double] is used in the results? That's not a very nice data type.
I want to scale data with StandardScaler (from pyspark.mllib.feature import StandardScaler), by now I can do it by passing the values of RDD to transform function, but the problem is that I want to preserve the key. is there anyway that I scale my data by preserving its key?
Sample dataset
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,smurf.
Imports
import sys
import os
from collections import OrderedDict
from numpy import array
from math import sqrt
try:
from pyspark import SparkContext, SparkConf
from pyspark.mllib.clustering import KMeans
from pyspark.mllib.feature import StandardScaler
from pyspark.statcounter import StatCounter
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
Portion of code
sc = SparkContext(conf=conf)
raw_data = sc.textFile(data_file)
parsed_data = raw_data.map(Parseline)
Parseline function:
def Parseline(line):
line_split = line.split(",")
clean_line_split = [line_split[0]]+line_split[4:-1]
return (line_split[-1], array([float(x) for x in clean_line_split]))
Not exactly a pretty solution but you can adjust my answer to the similar Scala question. Lets start with an example data:
import numpy as np
np.random.seed(323)
keys = ["foo"] * 50 + ["bar"] * 50
values = (
np.vstack([np.repeat(-10, 500), np.repeat(10, 500)]).reshape(100, -1) +
np.random.rand(100, 10)
)
rdd = sc.parallelize(zip(keys, values))
Unfortunately MultivariateStatisticalSummary is just a wrapper around a JVM model and it is not really Python friendly. Luckily with NumPy array we can use standard StatCounter to compute statistics by key:
from pyspark.statcounter import StatCounter
def compute_stats(rdd):
return rdd.aggregateByKey(
StatCounter(), StatCounter.merge, StatCounter.mergeStats
).collectAsMap()
Finally we can map to normalize:
def scale(rdd, stats):
def scale_(kv):
k, v = kv
return (v - stats[k].mean()) / stats[k].stdev()
return rdd.map(scale_)
scaled = scale(rdd, compute_stats(rdd))
scaled.first()
## array([ 1.59879188, -1.66816084, 1.38546532, 1.76122047, 1.48132643,
## 0.01512487, 1.49336769, 0.47765982, -1.04271866, 1.55288814])
Suppose I have an RDD of doubles and I want to “standardize” it as follows:
Compute the mean and sd for each col
For each col, subtract the column mean from each entry and divide the result by the column sd
Can this be done efficiently and easily (without converting the RDD into a double array at any stage)?
Thanks and regards,
You can use StandardScaler from Spark itself
/**
* Standardizes features by removing the mean and scaling to unit variance
* using column summary
*/
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD
val data: RDD[Vector] = ???
val scaler = new StandardScaler(true, true).fit(data)
data.foreach { vector =>
val scaled = scaler.transform(vector)
}