How to Cache an Array of Dataframes/Values in Spark - scala

I am trying to built a large amount of random forest models by group using Spark. My approach is to cache a large input data file, split it into pieces based on the school_id, cache the individual school input file in memory, run a model on each of them, and then extract the label and predictions.
model_input.cache()
val schools = model_input.select("School_ID").distinct.collect.flatMap(_.toSeq)
val bySchoolArray = schools.map(School_ID => model_input.where($"School_ID" <=> School_ID).cache)
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel}
def trainModel(df: DataFrame): PipelineModel = {
val rf = new RandomForestClassifier()
//omit some parameters
val pipeline = new Pipeline().setStages(Array(rf))
pipeline.fit(df)
}
val bySchoolArrayModels = bySchoolArray.map(df => trainModel(df))
val preds = (0 to schools.length -1).map(i => bySchoolArrayModels(i).transform(bySchoolArray(i)).select("prediction", "label")
preds.write.format("com.databricks.spark.csv").
option("header","true").
save("predictions/pred"+schools(i))
The code works fine on a small subset but it takes longer than I expected. It seems to me every time I run an individual model, Spark reads the entire file and it takes forever to complete all the model runs. I was wondering whether I did not cache the files correctly or anything went wrong with the way I code it.
Any suggestions would be useful. Thanks!

rdd's methods are immutable, so rdd.cache() returns a new rdd. So you need to assign the cachedRdd to an other variable and then re-use that. Otherwise your are not using the cached rdd.
val cachedModelInput = model_input.cache()
val schools = cachedModelInput.select("School_ID").distinct.collect.flatMap(_.toSeq)
....

Related

Using map() and filter() in Spark instead of spark.sql

I have two datasets that I want to INNER JOIN to give me a whole new table with the desired data. I used SQL and manage to get it. But now I want to try it with map() and filter(), is it possible?
This is my code using the SPARK SQL:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("quest9")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()
val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")
census.createOrReplaceTempView("census")
zip_codes.createOrReplaceTempView("zip")
//val query = spark.sql("SELECT * FROM census")
val query = spark.sql("SELECT DISTINCT census.Total_Males AS male, census.Total_Females AS female FROM census INNER JOIN zip ON census.Zip_Code=zip.Zip_Code WHERE zip.City = 'Inglewood' AND zip.County = 'Los Angeles'")
query.show()
query.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")
sc.stop()
}
}
The only sensible way, in general to do this would be to use the join() method of `Dataset̀€. I would urge you to question the need to use only map/filter to do this, as this is not intuitive, and will probably confuse any experienced spark developer (or simply put, make him roll his eyes). It may also lead to scalability issues should the dataset grow.
That said, in your use case, it is pretty simple to avoid using join. Another possibility would be to issue two separate jobs to spark :
fetch the zip code(s) that interests you
filter on the census data on that (those) zip code(s)
Step 1 collect the zip codes of interest (not sure of the exact syntax as I do not have a spark shell at hand, but it should be trivial to find the right one).
var codes: Seq[String] = zip_codes
// filter on the city
.filter(row => row.getAs[String]("City").equals("Inglewood"))
// filter on the county
.filter(row => row.getAs[String]("County").equals("Los Angeles"))
// map to zip code as a String
.map(row => row.getAs[String]("Zip_Code"))
.as[String]
// Collect on the driver side
.collect()
Then again, writing it this way instead of using select/where is pretty strange to anyone being used to spark.
Yet, the reason this will work is because we can be sure that zip codes matching a given town and county will be really small. So it is safe to perform driver side collcetion of the result.
Now on to step 2 :
census.filter(row => codes.contains(row.getAs[String]("Zip_Code")))
.map( /* whatever to get your data out */ )
What you need is a join, your query roughly translates to :
census.as("census")
.join(
broadcast(zip_codes
.where($"City"==="Inglewood")
.where($"County"==="Los Angeles")
.as("zip"))
,Seq("Zip_Code"),
"inner" // "leftsemi" would also be sufficient
)
.select(
$"census.Total_Males".as("male"),
$"census.Total_Females".as("female")
).distinct()

Spark Multiple Output Paths result in Multiple Input Reads

First, apologies for the title, I wasn't sure how to eloquently describe this succinctly.
I have a spark job that parses logs into JSON, and then using spark-sql converts specific columns into ORC and writes to various paths. For example:
val logs = sc.textFile("s3://raw/logs")
val jsonRows = logs.mapPartitions(partition => {
partition.map(log => {
logToJson.parse(log)
}
}
jsonRows.foreach(r => {
val contentPath = "s3://content/events/"
val userPath = "s3://users/events/"
val contentDf = sqlSession.read.schema(contentSchema).json(r)
val userDf = sqlSession.read.schema(userSchema).json(r)
val userDfFiltered = userDf.select("*").where(userDf("type").isin("users")
// Save Data
val contentWriter = contentDf.write.mode("append").format("orc")
eventWriter.save(contentPath)
val userWriter = userDf.write.mode("append").format("orc")
userWriter.save(userPath)
When I wrote this I expected that the parsing would occur one time, and then it would write to the respective locations afterward. However, it seems that it is executing all of the code in the file twice - once for content and once for users. Is this expected? I would prefer that I don't end up transferring the data from S3 and parsing twice, as that is the largest bottleneck. I am attaching an image from the Spark UI to show the duplication of tasks for a single Streaming Window. Thanks for any help you can provide!
Okay, this kind of nested DFs is a no go. DataFrames are meant to be a data structure for big datasets that won't fit into normal data structures (like Seq or List) and that needs to be processed in a distributed way. It is not just another kind of array. What you are attempting to do here is to create a DataFrame per log line, which makes little sense.
As far as I can tell from the (incomplete) code you have posted here, you want to create two new DataFrames from your original input (the logs) which you then want to store in two different locations. Something like this:
val logs = sc.textFile("s3://raw/logs")
val contentPath = "s3://content/events/"
val userPath = "s3://users/events/"
val jsonRows = logs
.mapPartitions(partition => {
partition.map(log => logToJson.parse(log))
}
.toDF()
.cache() // Or use persist() if dataset is larger than will fit in memory
jsonRows
.write
.format("orc")
.save(contentPath)
jsonRows
.filter(col("type").isin("users"))
.write
.format("orc")
.save(userPath)
Hope this helps.

Spark is duplicating work

I am facing a strange behaviour from Spark. Here's my code:
object MyJob {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new hive.HiveContext(sc)
val query = "<Some Hive Query>"
val rawData = sqlContext.sql(query).cache()
val aggregatedData = rawData.groupBy("group_key")
.agg(
max("col1").as("max"),
min("col2").as("min")
)
val redisConfig = new RedisConfig(new RedisEndpoint(sc.getConf))
aggregatedData.foreachPartition {
rows =>
writePartitionToRedis(rows, redisConfig)
}
aggregatedData.write.parquet(s"/data/output.parquet")
}
}
Against my intuition the spark scheduler yields two jobs for each data sink (Redis, HDFS/Parquet). The problem is the second job is also performing the hive query and doubling the work. I assumed both write operations would share the data from aggregatedData stage. Is something wrong or is it behaviour to be expected?
You've missed a fundamental concept of spark: Lazyness.
An RDD does not contain any data, all it is is a set of instructions that will be executed when you call an action (like writing data to disk/hdfs). If you reuse an RDD (or Dataframe), there's no stored data, just store instructions that will need to be evaluated everytime you call an action.
If you want to reuse data without needing to reevaluate an RDD, use .cache() or preferably persist. Persisting an RDD allows you to store the result of a transformation so that the RDD doesn't need to be reevaluated in future iterations.

How do i pass Spark context to a function from foreach

I need to pass SparkContext to my function and please suggest me how to do that for below scenario.
I have a Sequence, each element refers to specific data source from which we gets RDD and process them. I have defined a function which takes spark context and the data source and does the necessary things. I am curretly using while loop. But, i would like to do it with foreach or map, so that i can imply parallel processing. I need to spark context for the function, but how can i pass it from the foreach.?
Just a SAMPLE code, as i cannot present the actual code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object RoughWork {
def main(args: Array[String]) {
val str = "Hello,hw:How,sr:are,ws:You,re";
val conf = new SparkConf
conf.setMaster("local");
conf.setAppName("app1");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val rdd = sc.parallelize(str.split(":"))
rdd.map(x => {println("==>"+x);passTest(sc, x)}).collect();
}
def passTest(context: SparkContext, input: String) {
val rdd1 = context.parallelize(input.split(","));
rdd1.foreach(println)
}
}
You cannot pass the SparkContext around like that. passTest will be run on an/the executor(s), while the SparkContext runs on the driver.
If I would have to do a double split like that, one approach would be to use flatMap:
rdd
.zipWithIndex
.flatMap(l => {
val parts = l._1.split(",");
List.fill(parts.length)(l._2) zip parts})
.countByKey
There may be prettier ways, but basically the idea is that you can use zipWithIndex to keep track which line an item came from and then use key-value pair RDD methods to work on your data.
If you have more than one key, or just more structured data in general, you can look into using Spark SQL with DataFrames (or DataSets in latest version) and explode instead of flatMap.

Using Spark ML's OneHotEncoder on multiple columns

I've been able to create a pipeline that will allow me to index multiple string columns at once, but I am getting stuck encoding them, because unlike indexing, the encoder is not an estimator so I never call fit
according to the OneHotEncoder example in the docs.
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler,
OneHotEncoder}
import org.apache.spark.ml.Pipeline
val data = sqlContext.read.parquet("s3n://map2-test/forecaster/intermediate_data")
val df = data.select("win","bid_price","domain","size", "form_factor").na.drop()
//indexing columns
val stringColumns = Array("domain","size", "form_factor")
val index_transformers: Array[org.apache.spark.ml.PipelineStage] = stringColumns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
// Add the rest of your pipeline like VectorAssembler and algorithm
val index_pipeline = new Pipeline().setStages(index_transformers)
val index_model = index_pipeline.fit(df)
val df_indexed = index_model.transform(df)
//encoding columns
val indexColumns = df_indexed.columns.filter(x => x contains "index")
val one_hot_encoders: Array[org.apache.spark.ml.PipelineStage] = indexColumns.map(
cname => new OneHotEncoder()
.setInputCol(cname)
.setOutputCol(s"${cname}_vec")
)
val one_hot_pipeline = new Pipeline().setStages(one_hot_encoders)
val df_encoded = one_hot_pipeline.transform(df_indexed)
The OneHotEncoder object doesn't have a fit method so putting it in the same pipeline as the indexers will not work- it throws an error when I call fit on the pipeline. I can also not call transform on the pipeline that I made with the array of pipeline stages, one_hot_encoders.
I have not found a good solution for using the OneHotEncoder without individually creating and calling transform on that transforming itself for all of the columns I want to encode
Spark >= 3.0:
In Spark 3.0 OneHotEncoderEstimator has been renamed to OneHotEncoder:
import org.apache.spark.ml.feature.{OneHotEncoder, OneHotEncoderModel}
val encoder = new OneHotEncoder()
.setInputCols(indexColumns)
.setOutputCols(indexColumns map (name => s"${name}_vec"))
Spark >= 2.3
Spark 2.3 introduced new classes OneHotEncoderEstimator, OneHotEncoderModel, which required fitting even if used outside Pipeline, and operate on multiple columns at the same time.
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, OneHotEncoderModel}
val encoder = new OneHotEncoderEstimator()
.setInputCols(indexColumns)
.setOutputCols(indexColumns map (name => s"${name}_vec"))
encoder.fit(df_indexed).transform(df_indexed)
Spark < 2.3
Even if transformers you use don't require fitting you have to use fit method to create a PipelineModel which can be used to transform data.
one_hot_pipeline.fit(df_indexed).transform(df_indexed)
On a side note you can combine indexing and encoding into a single Pipeline:
val pipeline = new Pipeline()
.setStages(index_transformers ++ one_hot_encoders)
val model = pipeline.fit(df)
model.transform(df)
Edit:
Error you see means that one of your columns contains an empty String. It is accepted by indexer but cannot be used for encoding. Depending on you requirements you can drop these or use a dummy label. Unfortunately you cannot use NULLs until SPARK-11569) is resolved.