Spark DataFrame repartition : number of partition not preserved - scala

According to the docs of Spark 1.6.3, repartition(partitionExprs: Column*) should preserve the number of partitions in the resulting dataframe:
Returns a new DataFrame partitioned by the given partitioning
expressions preserving the existing number of partitions
(taken from https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.sql.DataFrame)
But the following example seems to show something else (note that spark-master is local[4] in my case):
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[4]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x")
myDF.rdd.getNumPartitions // 4
myDF.repartition($"x").rdd.getNumPartitions // 200 !
How can that be explained? I'm using Spark 1.6.3 as a standalone application (i.e. running locally in IntelliJ IDEA)
Edit: This question does not adress the issue from Dropping empty DataFrame partitions in Apache Spark (i.e. how to repartiton along a column without producing empty partitions), but why the docs say something different from what I observe in my example

It is something related to Tungsten project which was enabled in Spark. It uses hardware optimization and calls hash partitioning which triggers shuffle operation. By default spark.sql.shuffle.partitions is set to be 200. You can verify by calling explain on your dataframe before repartitioning and after just calling:
myDF.explain
val repartitionedDF = myDF.repartition($"x")
repartitionedDF.explain

Related

How to extract the field values from a dataset in spark using scala?

I have a dataframe which reads streams from kafka as a source and it is then converted to a dataset after applying schema, now how to get that particular field value from the dataset to work with it?
case class Fruitdata(id:Int, name:String, color:String, price:Int)
//say this function reads streams from kafka and gives me the dataframe
val df = readFromKafka(sparkSession,inputTopic)
//say this converts dataframe to a dataset with schema defined accordingly
val ds: Dataset[Fruitdata] = getDataSet[Fruitdata](df,schema)
//and say the incoming stream data is -
//"{"id":1,"name":"Grapes","color":"Green","price":15}"
//Now how to get a particular field like name, price and so on
//this doesn't works, it says "Queries with streaming sources must be executed with writeStream.start()"
ds.first()
//same here
ds.show
//also can i get the complete string as input,this gives me Dataset[String]
val ds2 = ds.flatMap((f: Fruitdata)=>List(s"${f.id},${f.name}"))
I think it's because you're trying to read from kafka.
When you run with Spark streaming, I think you cannot run few of the commands as they are related to streaming sources. For example, if you are reading from kafka, there is nothing like first, because it is a micro batch and first refers to each micro batch. Please, try something like "console" sink to output your records to console. Also make sure to read few sample records and not real kafka feed.

How RDDs are created in Structured streaming in spark?

How RDDs are created in Structured streaming in Spark? In DStream, we have for every batch, does it create as soon as Data is available or trigger happens? How does it physically distributes RDDs across executors?
Internally, a DStream is represented as a sequence of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval
IN the word count example:-
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
So, an RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.
DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions.
DStream will execute as soon as the trigger happens, if your time interval is 2 seconds, job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty.
DStream is actually a sequence of RDD from the code of DStream:-
// RDDs generated, marked as private[streaming] so that testsuites can access it
#transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
number of executors generated depends upon partition as well as configuration provided.
There are normally two types of allocation in the configuration static allocation and dynamic allocation.
you can read about them here:-
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/

Saving a streaming DataFrame to MongoDB using MongoSpark

Some backstory: For a homework project for university we are tasked to implement an algorithm of choice in a scalable way. We chose to use Scala, Spark, MongoDB and Kafka as these were recommended during the course. To read data from our MongoDB, we opted to use MongoSpark as it allows for easy and scalable operations on data. We also use Kafka to simulate streaming from an outside source. We need to perform multiple operations on every entry that Kafka produces. The issue comes from saving the result of this data back to MongoDB.
We have the following code:
val streamDF = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "aTopic")
.load
.selectExpr("CAST(value AS STRING)")
From here on, we're at a loss. We cannot use a .map as MongoSpark only operates on DataFrames, Datasets and RDDs and is not serializable, and using MongoSpark.save does not work on streaming DataFrames like the one specified. We also cannot use the default MongoDB Scala driver as this conflicts with MongoSpark upon adding the dependency. Note that the rest of the algorithm heavily relies on joins and groupbys.
How can we get the data from here to our MongoDB?
Edit:
For an easy to reproduce example, one could try the following:
val streamDF = sparkSession
.readStream
.format("rate")
.load
Adding a .write to that, which is required for MongoSpark.save, will cause an exception because write cannot be called on a streaming DataFrame.
Adding a .write to that, which is required for MongoSpark.save, will cause an exception because write cannot be called on a streaming DataFrame.
The save() method for MongoDB Connector for Spark accepts RDD (as of current version 2.2). When utilising DStream with MongoSpark, you need to fetch the 'batches' of RDDs in the stream to write.
wordCounts.foreachRDD({ rdd =>
import spark.implicits._
val wordCounts = rdd.map({ case (word: String, count: Int)
=> WordCount(word, count) }).toDF()
wordCounts.write.mode("append").mongo()
})
See also:
Design Patterns for using foreachRDD
MongoDB: Spark Streaming

Stream the most recent data in cassandra with spark streaming

I continuously have data being written to cassandra from an outside source.
Now, I am using spark streaming to continuously read this data from cassandra with the following code:
val ssc = new StreamingContext(sc, Seconds(5))
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
val dstream = new ConstantInputDStream(ssc, cassandraRDD)
dstream.foreachRDD { rdd =>
println("\n"+rdd.count())
}
ssc.start()
ssc.awaitTermination()
sc.stop()
However, the following line:
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
takes the entire table data from cassandra every time. Now just the newest data saved into the table.
What I want to do is have spark streaming read only the latest data, ie, the data added after its previous read.
How can I achieve this? I tried to Google this but got very little documentation regarding this.
I am using spark 1.4.1, scala 2.10.4 and cassandra 2.1.12.
Thanks!
EDIT:
The suggested duplicate question (asked by me) is NOT a duplicate, because it talks about connecting spark streaming and cassandra and this question is about streaming only the latest data. BTW, streaming from cassandra IS possible by using the code I provided. However, it takes the entire table every time and not just the latest data.
There will be some low-level work on Cassandra that will allow notifying external systems (an indexer, a Spark stream etc.) of new mutations incoming to Cassandra, read this: https://issues.apache.org/jira/browse/CASSANDRA-8844

Apache spark + RDD + persist() doubts

I am new in apache spark and using scala API. I have 2 questions regarding RDD.
How to persist some partitions of a rdd, instead of entire rdd in apache spark? (core rdd implementation provides rdd.persist() and rdd.cache() methods but i do not want to persist entire rdd. I am interested only some partitions to persist.)
How to create one empty partition while creating each rdd? (I am using repartition and textFile transformations.In these cases i can get expected number of partitions but i also want one empty partition for each rdd.)
Any help is appreciated.
Thanks in advance