Spark query difference in performance on same kind of data - scala

I am new to Spark, so I was experimenting something like this
val values1= sparkSession.range(1,1000000)
val values2= sparkSession.range(1,1000000)
val values3= sparkSession.range(0,100000,2)
val values4= sparkSession.range(0,100000,2)
private val frame1: DataFrame = values1.join(values3,"id")
frame1.count()
private val frame3: DataFrame = values2.join(values4,"id")
frame3.count()
My question is why later task takes so less time though I am using different data (might be identical in content). ?

Related

How can I introspect and pre-load all collections from MongoDB into the Spark SQL catalog?

When learning Spark SQL, I've been using the following approach to register a collection into the Spark SQL catalog and query it.
val persons: Seq[MongoPerson] = Seq(MongoPerson("John", "Doe"))
sqlContext.createDataset(persons)
.write
.format("com.mongodb.spark.sql.DefaultSource")
.option("collection", "peeps")
.mode("append")
.save()
sqlContext.read
.format("com.mongodb.spark.sql.DefaultSource")
.option("collection", "peeps")
.load()
.as[Peeps]
.show()
However, when querying it, it seems that I need to register it as a temporary view in order to access it using SparkSQL.
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:37017/test", "collection" -> "morepeeps"), Some(ReadConfig(spark)))
val people: DataFrame = MongoSpark.load[Peeps](spark, readConfig)
people.show()
people.createOrReplaceTempView("peeps")
spark.catalog.listDatabases().show()
spark.catalog.listTables().show()
sqlContext.sql("SELECT * FROM peeps")
.as[Peeps]
.show()
For a database with quite a few collections, is there a way to hydrate the Spark SQL schema catalog so that this op isn't so verbose?
So there's a couple things going on. First of all, simply loading the Dataset using sqlContext.read will not register it with SparkSQL catalog. The end of the function chain you have in your first code sample returns a Dataset at .as[Peeps]. You need to tell Spark that you want to use it as a view.
Depending on what you're doing with it, I might recommend leaning on the Scala Dataset API rather than SparkSQL. However, if SparkSQL is absolutely essential, you can likely speed things up programmatically.
In my experience, you'll need to run that boilerplate on each table you want to import. Fortunately, Scala is a proper programming language, so we can cut down on code duplication substantially by using a function, and calling it as such:
val MongoDbUri: String = "mongodb://localhost:37017/test" // store this as a constant somewhere
// T must be passed in as some case class
// Note, you can also add a second parameter to change the view name if so desired
def loadTableAsView[T <: Product : TypeTag](table: String)(implicit spark: SparkSession): Dataset[T] {
val configMap = Map(
"uri" -> MongoDbUri,
"collection" -> table
)
val readConfig = ReadConfig(configMap, Some(ReadConfig(spark)))
val df: DataFrame = MongoSpark.load[T](spark, readConfig)
df.createOrReplaceTempView(table)
df.as[T]
}
And to call it:
// Note: if spark is defined implicitly, e.g. implicit val spark: SparkSession = spark, you won't need to pass it explicitly
val peepsDS: Dataset[Peeps] = loadTableAsView[Peeps]("peeps")(spark)
val chocolatesDS: Dataset[Chocolates] = loadTableAsView[Chocolates]("chocolates")(spark)
val candiesDS: Dataset[Candies] = loadTableAsView[Candies]("candies")(spark)
spark.catalog.listDatabases().show()
spark.catalog.listTables().show()
peepsDS.show()
chocolatesDS.show()
candiesDS.show()
This will substantially cut down your boilerplate, and also allow you to more easily write some tests for that repeated bit of code. There's also probably a way to create a map of table names to case classes that you can then iterate over, but I don't have an IDE handy to test it out.

Using map() and filter() in Spark instead of spark.sql

I have two datasets that I want to INNER JOIN to give me a whole new table with the desired data. I used SQL and manage to get it. But now I want to try it with map() and filter(), is it possible?
This is my code using the SPARK SQL:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("quest9")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()
val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")
census.createOrReplaceTempView("census")
zip_codes.createOrReplaceTempView("zip")
//val query = spark.sql("SELECT * FROM census")
val query = spark.sql("SELECT DISTINCT census.Total_Males AS male, census.Total_Females AS female FROM census INNER JOIN zip ON census.Zip_Code=zip.Zip_Code WHERE zip.City = 'Inglewood' AND zip.County = 'Los Angeles'")
query.show()
query.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")
sc.stop()
}
}
The only sensible way, in general to do this would be to use the join() method of `Dataset̀€. I would urge you to question the need to use only map/filter to do this, as this is not intuitive, and will probably confuse any experienced spark developer (or simply put, make him roll his eyes). It may also lead to scalability issues should the dataset grow.
That said, in your use case, it is pretty simple to avoid using join. Another possibility would be to issue two separate jobs to spark :
fetch the zip code(s) that interests you
filter on the census data on that (those) zip code(s)
Step 1 collect the zip codes of interest (not sure of the exact syntax as I do not have a spark shell at hand, but it should be trivial to find the right one).
var codes: Seq[String] = zip_codes
// filter on the city
.filter(row => row.getAs[String]("City").equals("Inglewood"))
// filter on the county
.filter(row => row.getAs[String]("County").equals("Los Angeles"))
// map to zip code as a String
.map(row => row.getAs[String]("Zip_Code"))
.as[String]
// Collect on the driver side
.collect()
Then again, writing it this way instead of using select/where is pretty strange to anyone being used to spark.
Yet, the reason this will work is because we can be sure that zip codes matching a given town and county will be really small. So it is safe to perform driver side collcetion of the result.
Now on to step 2 :
census.filter(row => codes.contains(row.getAs[String]("Zip_Code")))
.map( /* whatever to get your data out */ )
What you need is a join, your query roughly translates to :
census.as("census")
.join(
broadcast(zip_codes
.where($"City"==="Inglewood")
.where($"County"==="Los Angeles")
.as("zip"))
,Seq("Zip_Code"),
"inner" // "leftsemi" would also be sufficient
)
.select(
$"census.Total_Males".as("male"),
$"census.Total_Females".as("female")
).distinct()

Partitionning the rdf datasets by subject in spark scala

I am a newbie to functional programming language and I am trying to learn spark scala
The goal is to partition the rdf datset by subject
the code is below:
object SimpleApp {
def main(args: Array[String]): Unit = {
val sparkConf =
new SparkConf().
setAppName("SimpleApp").
setMaster("local[2]").
set("spark.executor.memory", "1g")
val sc = new SparkContext(sparkConf)
val data = sc.textFile("/home/hduser/Bureau/11.txt")
val subject = data.map(_.split("\\s+")(0)).distinct.collect
}
}
So I get to recover the subjects but it returns an array of string also mapPartitions(func) and mapPartitionsWithIndex(func) : the func need to be iterator
So how do I proceed?
Partitioning your RDD by subject would probably best be done by using a HashPartitioner. The HashPartitioner works by taking an RDD of N-tuples and sorting the data by key eg
myPairRDD:
("sub1", "desc1")
("sub2", "desc2")
("sub1", "desc3")
("sub2", "desc4")
myPairRDD.partitionBy(new HashPartitioner(2))
becomes:
partition 1:
("sub1", "desc1")
("sub1", "desc3")
partition 2:
("sub2", "desc2")
("sub2", "desc4")
Therefore, your subjects RDD should probably be created more like this (note the extra brackets which create a tuple/pair RDD):
val subjectTuples = data.map((_.split("\\s+")(0), _.split("\\s+")(1)))
See the diagrams here for more info: https://blog.knoldus.com/2015/06/19/shufflling-and-repartitioning-of-rdds-in-apache-spark/

Spark is duplicating work

I am facing a strange behaviour from Spark. Here's my code:
object MyJob {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new hive.HiveContext(sc)
val query = "<Some Hive Query>"
val rawData = sqlContext.sql(query).cache()
val aggregatedData = rawData.groupBy("group_key")
.agg(
max("col1").as("max"),
min("col2").as("min")
)
val redisConfig = new RedisConfig(new RedisEndpoint(sc.getConf))
aggregatedData.foreachPartition {
rows =>
writePartitionToRedis(rows, redisConfig)
}
aggregatedData.write.parquet(s"/data/output.parquet")
}
}
Against my intuition the spark scheduler yields two jobs for each data sink (Redis, HDFS/Parquet). The problem is the second job is also performing the hive query and doubling the work. I assumed both write operations would share the data from aggregatedData stage. Is something wrong or is it behaviour to be expected?
You've missed a fundamental concept of spark: Lazyness.
An RDD does not contain any data, all it is is a set of instructions that will be executed when you call an action (like writing data to disk/hdfs). If you reuse an RDD (or Dataframe), there's no stored data, just store instructions that will need to be evaluated everytime you call an action.
If you want to reuse data without needing to reevaluate an RDD, use .cache() or preferably persist. Persisting an RDD allows you to store the result of a transformation so that the RDD doesn't need to be reevaluated in future iterations.

How to Cache an Array of Dataframes/Values in Spark

I am trying to built a large amount of random forest models by group using Spark. My approach is to cache a large input data file, split it into pieces based on the school_id, cache the individual school input file in memory, run a model on each of them, and then extract the label and predictions.
model_input.cache()
val schools = model_input.select("School_ID").distinct.collect.flatMap(_.toSeq)
val bySchoolArray = schools.map(School_ID => model_input.where($"School_ID" <=> School_ID).cache)
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel}
def trainModel(df: DataFrame): PipelineModel = {
val rf = new RandomForestClassifier()
//omit some parameters
val pipeline = new Pipeline().setStages(Array(rf))
pipeline.fit(df)
}
val bySchoolArrayModels = bySchoolArray.map(df => trainModel(df))
val preds = (0 to schools.length -1).map(i => bySchoolArrayModels(i).transform(bySchoolArray(i)).select("prediction", "label")
preds.write.format("com.databricks.spark.csv").
option("header","true").
save("predictions/pred"+schools(i))
The code works fine on a small subset but it takes longer than I expected. It seems to me every time I run an individual model, Spark reads the entire file and it takes forever to complete all the model runs. I was wondering whether I did not cache the files correctly or anything went wrong with the way I code it.
Any suggestions would be useful. Thanks!
rdd's methods are immutable, so rdd.cache() returns a new rdd. So you need to assign the cachedRdd to an other variable and then re-use that. Otherwise your are not using the cached rdd.
val cachedModelInput = model_input.cache()
val schools = cachedModelInput.select("School_ID").distinct.collect.flatMap(_.toSeq)
....