Random Forest Regression Implementation in PySpark - pyspark

I want to implement Random forest regression in pyspark after all data preparation. I want sample code for implementation.

From the doc (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.RandomForestRegressor) :
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.regression import RandomForestRegressor
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42)
>>> model = rf.fit(df)
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> allclose(model.treeWeights, [1.0, 1.0])
True
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> model.numFeatures
1
>>> model.trees
[DecisionTreeRegressionModel (uid=...) of depth..., DecisionTreeRegressionModel...]
>>> model.getNumTrees
2
>>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
0.5
>>> rfr_path = temp_path + "/rfr"
>>> rf.save(rfr_path)
>>> rf2 = RandomForestRegressor.load(rfr_path)
>>> rf2.getNumTrees()
2
>>> model_path = temp_path + "/rfr_model"
>>> model.save(model_path)
>>> model2 = RandomForestRegressionModel.load(model_path)
>>> model.featureImportances == model2.featureImportances
True

Related

How to get the Spark scala correlation output as a dataframe?

I am trying to calculate correlation for all columns in a Spark dataframe using the below code.
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.VectorAssembler
val spark = SparkSession
.builder
.appName("SparkCorrelation")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df = Seq(
(0.1, 0.3, 0.5),
(0.2, 0.4, 0.6),
).toDF("c1", "c2", "c3")
val assembler = new VectorAssembler()
.setInputCols(Array("c1", "c2", "c3"))
.setOutputCol("vectors")
val transformed = assembler.transform(df)
val corr = Correlation.corr(transformed, "vectors","pearson")
corr.show(100,false)
My output comes out as a dataframe with one column.
pearson(vectors)
1.0 1.0000000000000002 0.9999999999999998 \n1.0000000000000002 1.0 1.0000000000000002 \n0.9999999999999998 1.0000000000000002 1.0
but I want my output in the following format. Can somebody please help?
Column
c1
c2
c3
c1
1
0.97
0.92
c2
0.97
1
0.94
c3
0.92
0.94
1
Best you can do is this, but without cols:
val corr = Correlation.corr(transformed, "vectors", "pearson").head
println(s"Pearson correlation matrix:\n $corr")

AttributeError: 'HashingTF' object has no attribute '_java_obj'

when I use the pyspark.ml.Pipline to create pipline, it occurs the following problem:
File "/opt/module/spark-2.4.3-bin-hadoop2.7/Pipeline.py", line 18, in
hashingTF = HashingTF(ipnutCol=tokenizer.getOutputCol(),outputCol="features")
File "/opt/module/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/init.py", line 110
, in wrapper
TypeError: init() got an unexpected keyword argument 'ipnutCol'
Exception ignored in:
Traceback (most recent call last):
File "/opt/module/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 4
0, in del
AttributeError: 'HashingTF' object has no attribute '_java_obj'
I guess the API have changed, but I am not certain.
# 构建一个机器学习流水线
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml import Pipeline
# 创建一个SparkSession对象
spark = SparkSession.builder.master("local").appName("WorldCount").getOrCreate()
# 1. prepare training documents from a list of (id, text, label) tuples
training = spark.createDataFrame([
(0, 'a b c d e spark', 1.0),
(1, 'b d', 0.0),
(2, 'spark f g h', 1.0),
(3, 'hadoop mapreduce', 0.0)
],['id','text','label'])
# 2. 定义pipline 中各个流水线阶段PipelineStage.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(ipnutCol=tokenizer.getOutputCol(),outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
# 3. 按照具体的处理逻辑有序地组织PipelineStages,并创建一个Pipeline.
pipeline = Pipeline(stages=[tokenizer,hashingTF,lr])
# 4. 训练模型
model = pipeline.fit(training)
# 5. 构建测试数据
test = spark.createDataFrame([
(4, 'spark i j k'),
(5, 'i m n'),
(6, 'spark hadoop spark'),
(7, 'apache hadoop')
],['id', 'text'])
# 6. 调用之前训练好的PipelineModel的transform()方法,
# 让测试数据按照顺序通过拟合的流水线,生成预测结果
prediction = model.transform(test)
selected = prediction.select('id','text','probability','prediction')
for row in selected.collect():
rid, text, prob, prediction = row
print('({},{}) -> prob = {}, prediction={}'.format(rid, text, str(prob),prediction))
(4, spark i j k) --> prob=[0.155543713844,0.844456286156], prediction=1.000000 (5, l m n) --> prob=[0.830707735211,0.169292264789], prediction=0.000000 (6, spark hadoop spark) --> prob=[0.0696218406195,0.93037815938], prediction=1.000000 (7, apache hadoop) --> prob=[0.981518350351,0.018481649649], prediction=0.000000
Your spelling of input is wrong here (ipnutCol):
TypeError: init() got an unexpected keyword argument 'ipnutCol'

Reformatting Dataframe Containing Array to RowMatrix

I have this dataframe in the following format:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
Script to create sample dataframe:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
The feature column contains 4 features, and there are 3 row ids. I want to convert this data to a rowmatrix, where the columns and rows will be in the following mat format:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
Basically, I want to transpose the dataframe into the new format so that I can run the columnSimilarities() function. I have a much larger dataframe that contains 50 features, and 39000 rows.
Is this what you are trying to do? Hate using collect() but don't think it can be avoided here since you want to reshape/convert structured object to matrix ... right?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]
I figured it out, I used the following:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()

Spark DataFrame filter not working as expected with Random

This is my DataFrame
df.groupBy($"label").count.show
+-----+---------+
|label| count|
+-----+---------+
| 0.0|400000000|
| 1.0| 10000000|
+-----+---------+
I am trying to subsample the records with label == 0.0 with the following:
val r = scala.util.Random
val df2 = df.filter($"label" === 1.0 || r.nextDouble > 0.5) // keep 50% of 0.0
My output looks like this:
df2.groupBy($"label").count.show
+-----+--------+
|label| count|
+-----+--------+
| 1.0|10000000|
+-----+--------+
r.nextDouble is a constant in the expression so the actual evaluation is quite different from what you mean. Depending on the actual sampled value it is either
scala> r.setSeed(0)
scala> $"label" === 1.0 || r.nextDouble > 0.5
res0: org.apache.spark.sql.Column = ((label = 1.0) OR true)
or
scala> r.setSeed(4096)
scala> $"label" === 1.0 || r.nextDouble > 0.5
res3: org.apache.spark.sql.Column = ((label = 1.0) OR false)
so after simplification it is just:
true
(keeping all the records) or
label = 1.0
(keeping only ones, the case you observed) respectively.
To generate random numbers you should use corresponding SQL function
scala> import org.apache.spark.sql.functions.rand
import org.apache.spark.sql.functions.rand
scala> $"label" === 1.0 || rand > 0.5
res1: org.apache.spark.sql.Column = ((label = 1.0) OR (rand(3801516599083917286) > 0.5))
though Spark already provides stratified sampling tools:
df.stat.sampleBy(
"label", // column
Map(0.0 -> 0.5, 1.0 -> 1.0), // fractions
42 // seed
)

simple matrix multiplication in Spark

I am struggling with some very basic spark code. I would like to define a matrix x with 2 columns. This is what I have tried:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0) // in this case I want s to be both column 1 and column 2 of x
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala> val mat = new RowMatrix(ss, 5, 2)
<console>:17: error: type mismatch;
found : Array[Double]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val mat = new RowMatrix(ss, 5, 2)
I do not understand how I can get the right transformation in order to pass the values to the distributed matrix ^
EDIT:
Maybe I have been able to solve:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0)
scala> val ss = s.to
toArray toDenseMatrix toDenseVector toScalaVector toString
toVector
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> val x = new breeze.linalg.Dense
DenseMatrix DenseVector
scala> val x = new breeze.linalg.DenseMatrix(5, 2, ss)
x: breeze.linalg.DenseMatrix[Double] =
-3.0 -3.0
-1.5 -1.5
0.0 0.0
1.5 1.5
3.0 3.0
scala> val xDist = sc.parallelize(x.toArray)
xDist: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:18
Something like this. This typechecks, but for some reason won't run in my Scala worksheet.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
// the values for the column in each row
val col = List(-3.0, -1.5, 0.0, 1.5, 3.0) ;
// make two rows of the column values, transpose it,
// make Vectors of the result
val t = List(col,col).transpose.map(r=>Vectors.dense(r.toArray))
// make an RDD from the resultant sequence of Vectors, and
// make a RowMatrix from that.
val rm = new RowMatrix(sc.makeRDD(t));