I'm new to python and spark. Is there a way to do method chaining like Java 8 streams API? I have this in pyspark. Do I need another API like stream API?
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("MaxTemperatures")
sc = SparkContext(conf = conf)
def parseLine(line):
fields = line.split(",")
stationID = fields[0]
entryType = fields[2]
temperature = float(fields[3]) * 0.1 * (9.0 / 5.0) + 32.0
return(stationID, entryType, temperature)
lines = sc.textFile("file:///SparkCourse/1800.csv")
parsedLines = lines.map(parseLine)
maxTemps = parsedLines.filter(lambda x: "TMAX" in x[1])
stationTemps = maxTemps.map(lambda x: (x[0], x[2]))
maxTemps = stationTemps.reduceByKey(lambda x,y : max(x,y))
results = maxTemps.collect();
for result in results:
print(result[0] + "\t{:.2f}F".format(result[1]))
I get a synthax error when I want to chain the method
maxTemps = parsedLines.filter(lambda x: "TMAX" in x[1])
.map(lambda x: (x[0], x[2]))
.reduceByKey(lambda x,y : max(x,y))
.collect();
PySpark is organized by chaining transformers and estimators together into pipelines. You can read more about that here (old version of Pyspark but the ideas still apply). You can find more info on split and example code here. You might also get some use out of user defined functions for custom operations (UDFs) which you can read about more here.
As Alex Ott mentioned, you want to be using the modern ML API (DataFrames) and not the MLLib API (RDD) which is depreciated.
Related
I am a beginner for Scala and have been working on the following problem:
Example dataset named as given_dataset with player number and points scored
|player_no| |points|
1 25.0
1 20.0
1 21.0
2 15.0
2 18.0
3 24.0
3 25.0
3 29.0
Problem 1:
I have a dataset and need to calculate total points scored, average points per game, and number of games played. I am unable to explicitly set the data type to "double", "int", "float", when I apply the transformations. (Perhaps this is because they are untyped transformations?) Would anyone be able to help on this and correct my error?
No data type specified (but code is able to run)
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").orderBy("player_no")
Please note that I would like to retain the player number as I plan to merge total_points_dataset, games_played_dataset, and avg_points_dataset together.
Data type specified, but code crashes!
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[Double].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[Int].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[Double].orderBy("player_no")
Problem 2:
I would like to implement the above without using the library spark.sql.functions e.g. through functions such as map, groupByKey etc. If possible, could anyone provide an example for this and point me towards the right direction?
If you don't want to use import org.apache.spark.sql.types.{FloatType, IntegerType, StructType} then you have to cast it either at the time of reading or using as[(Int, Double)] in the dataset. Below is the example while reading from CSV file for your dataset:
/** A function that splits a line of input into (player_no, points) tuples. */
def parseLine(line: String): (Int, Float) = {
// Split by commas
val fields = line.split(",")
// Extract the player_no and points fields, and convert to integer & float
val player_no = fields(0).toInt
val points = fields(1).toFloat
// Create a tuple that is our result.
(player_no, points)
}
And then read as below:
val sc = new SparkContext("local[*]", "StackOverflow75354293")
val lines = sc.textFile("data/stackoverflowdata-noheader.csv")
val dataset = lines.map(parseLine)
val total_points_dataset2 = dataset.reduceByKey((x, y) => x + y)
val total_points_dataset2_sorted = total_points_dataset2.sortByKey(ascending = true)
total_points_dataset2_sorted.foreach(println)
val games_played_dataset2 = dataset.countByKey().toList.sorted
games_played_dataset2.foreach(println)
val avg_points_dataset2 =
dataset
.mapValues(x => (x, 1))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(x => x._1 / x._2)
.sortByKey(ascending = true)
avg_points_dataset2.collect().foreach(println)
I locally tried running both ways and both are working fine, we can check the below output also:
(3,78.0)
(1,66.0)
(2,33.0)
(1,3)
(2,2)
(3,3)
(1,22.0)
(2,16.5)
(3,26.0)
For details you can see it on mysql page
Regarding "Problem 1" try
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[(Int, Double)].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[(Int, Long)].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[(Int, Double)].orderBy("player_no")
i have a breeze.linalg.DenseMatrix[Double] as follow, i want to convert it to a dataframe
breeze.linalg.DenseMatrix[Double] =
0.009748169568491553 3.04248345416453E-4 -0.0018493112842201912 8.200326863261204E-4
3.0424834541645305E-4 0.00873118653317929 6.352723194418622E-4 1.84118791655692E-5
-0.001849311284220191 6.35272319441862E-4 0.008553284420541575 -6.407982513791382E-4
8.200326863261203E-4 1.8411879165568983E-5 -6.407982513791378E-4 0.008413484758510377
is there any way i can do that?
after couple times of try, i am able to create a dataframe that contains the flattened information of the matrix. and create a tempview, so that access from python as a dataframe
in scala
// covarianceMatrix (in scala)
val c = covarianceMatrix.toArray.toSeq
val covarianceMatrix_df = c.toDF("number")
covarianceMatrix_df.createOrReplaceTempView("covarianceMatrix_df")
in python
covarianceMatrix_df=spark.sql('''SELECT * FROM covarianceMatrix_df ''')
covarianceMatrix_pd = covarianceMatrix_df.toPandas()
nrows = np.sqrt(len(covarianceMatrix_pd))
covarianceMatrix_pd = covarianceMatrix_pd.to_numpy().reshape((int(nrows),int(nrows)))
covarianceMatrix_pd
I have an rdd that i am trying to filter for only float type. Do Spark rdds provide any way of doing this?
I have a csv where I need only float values greater than 40 into a new rdd. To achieve this, i am checking if it is an instance of type float and filtering them. When I filter with a !, all the strings are still there in the output and when i dont use !, the output is empty.
val airports1 = airports.filter(line => !line.split(",")(6).isInstanceOf[Float])
val airports2 = airports1.filter(line => line.split(",")(6).toFloat > 40)
At the .toFloat , i run into NumberFormatException which I've tried to handle in a try catch block.
Since you have a plain string and you are trying to get float values from it, you are not actually filtering by type. But, if they can be parsed to float instead.
You can accomplish that using a flatMap together with Option.
import org.apache.spark.sql.SparkSession
import scala.util.Try
val spark = SparkSession.builder.master("local[*]").appName("Float caster").getOrCreate()
val sc = spark.sparkContext
val data = List("x,10", "y,3.3", "z,a")
val rdd = sc.parallelize(data) // rdd: RDD[String]
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption) // filtered: RDD[Float]
filtered.collect() // res0: Array[Float] = Array(10.0, 3.3)
For the > 40 part you can either, perform another filter after or filter the inner Option.
(Both should perform more or less equals due spark laziness, thus choose the one is more clear for you).
// Option 1 - Another filter.
val filtered2 = filtered.filter(x => x > 40)
// Option 2 - Filter the inner option in one step.
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption.filter(x => x > 40))
Let me know if you have any question.
I have successfully trained an XGBoost model where trainDF is a dataframe hacing two columns: features and label where we have 11k 1s and 57M 0's (unbalanced dataset). Everything works fine.
val udnersample = 0.1
// Undersampling of 0's -- choosing 10%
val training1 = output1.filter($"datestr" < end_period1 &&
$"label" === 1)
val training0 = output1.filter($"datestr" < end_period1 &&
$"label" === 0).sample(
false, undersample)
val training = training0.unionAll(training1)
val traindDF = training.select("label",
"features").toDF("label", "features")}
val paramMap = List("eta" -> 0.05,
"max_depth" -> 6,
"objective" -> "binary:logistic").toMap
val num_trees = 400
val num_cores = 200
val XGBModel = XGBoost.trainWithDataFrame(trainDF,
paramMap,
num_trees,
num_cores,
useExternalMemory = true)
Then, I want to change the y label with some windowing, so that in each group, I can predict y label earlier.
val sum_label = "sum_label"
val label_window_length = 19
val sliding_window_label = Window.partitionBy("id").orderBy(
asc("timestamp")).rowsBetween(0, label_window_length)
val training_source = output1.filter($"datestr" <
end_period1).withColumn(
sum_label, sum($"label").over(sliding_window_label)).drop(
"label").withColumnRenamed(sum_label, "label")
val training1 = training_source.filter(col("label") === 1)
val training0 = training_source.filter(col("label") === 0).sample(false, 0.099685)
val training = training0.unionAll(training1)
val traindDF = training.select("label",
"features").toDF("label", "features")}
The result has 57M 0's and 214k 1's (soughly the same number of rows though). No NAs in "label" column of trainDF and the type is still double (nullable=true). Then xgboost fails:
Name: ml.dmlc.xgboost4j.java.XGBoostError
Message: XGBoostModel training failed
StackTrace: at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:316)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithRDD(XGBoost.scala:293)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:138)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:35)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:169)
I can include the logs as needed. My confusion is that using the windowing function and literally not changing any other setting, causes XGB to fail. I would appreciate any thoughts on this.
It turns out that saving the table traindDF in hive and reloading it into Spark solves the problem:
traindDF.write.mode("overwrite").saveAsTable("database.tablename")
Then, you can easily load the table:
val traindDF = spark.sql("""select * from database.tablename""")
This trick solved the problem. It seems like spark windowing function is a bit unstable and saving the result into a hive table makes it work.
A better way to do this is using windowing functions in hive instead of Spark.
I want to take sample from rdds in a Dstream. As Dstream doesn't have sample() transformation and it is a sequence of rdds so I did this to take sample from a Dstream and apply a wordcount on it:
from pyspark import SparkContext
from pyspark import SparkConf
# Optionally configure Spark Settings
conf=SparkConf()
conf.set("spark.executor.memory", "1g")
conf.set("spark.cores.max", "2")
conf.setAppName("SRS")
sc = SparkContext('local[3]', conf=conf)
from pyspark.streaming import StreamingContext
streamContext = StreamingContext(sc,3)
lines = streamContext.socketTextStream("localhost", 9000)
def sampleWord(rdd):
return rdd.sample(false,0.5,10)
lineSample = lines.foreachRDD(sampleWord)
words = lineSample.flatMap(lambda line: line.split(" "))
pairs = words.map(lambda word: (word , 1))
wordCount = pairs.reduceByKey(lambda x, y: x + y)
wordCount.pprint(60)
streamContext.start()
streamContext.stop()
With this code, Spark starts but nothing exactly happens. I don't know why rdd.sample() doesn't work this way? Using foreachRDD, we can have access to each rdd in the stream, so I think now we can use the transformation which is specific for rdd.
Using transform instead of foreachRDD. Also, there is a typo in your code.
def sampleWord(rdd):
return rdd.sample(False,0.5,10) //False, not false
lineSample = lines.transform(sampleWord)
words = lineSample.flatMap(lambda line: line.split(" "))
pairs = words.map(lambda word: (word , 1))
wordCount = pairs.reduceByKey(lambda x, y: x + y)
wordCount.pprint(60)
Use transform:
lineSample = lines.transform(sampleWord)