Same Spark Dataframe created in 2 different ways gets different execution times in same query - mongodb

I created the same Spark Dataframe in 2 ways in order to run Spark SQL on it.
1. I read the data from a .csv file straight into a Dataframe in Spark shell using the following command:
val df=spark.read.option("header",true).csv("C:\\Users\\Tony\\Desktop\\test.csv")
2. I created a collection in MongoDB from the same .csv file and then using the Spark-MongoDB Connector, I imported it as an RDD into Spark which I then turned into a Dataframe using the following commands(in cmd/spark-shell):
spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/myDb.myBigCollection" --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
import com.mongodb.spark._
val rdd = MongoSpark.load(sc)
val df = rdd.toDF()
After that I created a view of the dataframe in either case using the following command:
df.createOrReplaceTempView("sales")
Then I run the same queries on either Dataframe and the execution times were very different. In the following example, the 1st way of creating the dataframe had 4-5 times faster execution time then the 2nd one.
spark.time(spark.sql("SELECT Region, Country, `Unit Price`, `Unit Cost` FROM sales WHERE `Unit Price` > 500 AND `Unit Cost` < 510 ORDER BY Region").show())
The database has 1 million entries and has the following structure:
id: 61a6540c3838fe02b81e5338
Region: "Sub-Saharan Africa"
Country: "South Africa"
Item Type: "Fruits"
Sales Channel: "Offline"
Order Priority: "M"
Order Date: 2012-07-26T21:00:00.000+00:00
Order ID: 443368995
Ship Date: 2012-07-27T21:00:00.000+00:00
Units Sold: 1593
Unit Price: 9.33
Unit Cost: 6.92
Total Revenue: 14862.69
Total Cost: 11023.56
Total Profit: 3839.13
The problem in my case is that I have to get the Dataframe from Mongodb using the connector but why is this happening?

Spark is optimized to perform better on Dataframes. In your second approach you are first reading RDD then converting it to Dataframe which definitely has the cost.
Instead try to read data from Mongo DB directly as a dataframe. You can refer to the following syntax:
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/mydb.mycoll").load()

The answer is that in the second case, the extra time is needed in order to transfer the data from mongodb to Spark before executing the query.

Related

Skew Data with countDistinct

I have a PySpark DataFrame with 3 columns: 'client', 'product', 'date'. I want to run a groupBy operation:
df.groupBy("product", "date").agg(F.countDistinct("client"))
So I want to count the number of clients that bought a product in each day. This is causing huge skew data (in fact, it causes error because of memory). I have been learning about salting techniques. As I understood, it can be used with 'sum' or 'count' adding a new column to the groupBy and performing a second aggregation, but I do not see how to apply them in this case because of the countDistinct aggregation method.
How can I do apply it in this case?
I would recommend to just not use countDistinct at all here and achieve what you want using 2 aggregations in a row especially since you have a skew in your data. It can look like the following:
import pyspark.sql.functions as F
new_df = (df
.groupBy("product", "date", "client")
.agg({}) # getting unique ("product", "date", "client") tuples
.groupBy("product", "date")
.agg(F.count('*').alias('clients'))
)
First aggregation here ensures that you have a DataFrame with one row per each distinct ("product", "date", "client") tuple, second is counting number of clients for each ("product", "date") pair. This way you don't need to worry about skews anymore since Spark will know to do partial aggregations for you (as opposed to countDistinct which is forced to send all individual "client" values corresponding to each ("product", "date") pair to one node).

Not able to Show/Write from spark DF read using mongo spark connector.

I am trying to read a huge complex document from MongoDB into spark data frame. When I convert this db to json, It works. But If I directly read from MongoDB I am getting the following error : Caused by: com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a DoubleType (value: BsonString{value='NaN'})
Able to read into DF and do all the processing. Getting error when I try to show it or write to a json/csv.
at mongo$.main(mongo.scala:270) – df.show()
Using sbt for Dependencies
mongo spark connector: 2.2.1
Scala Version: 2.11.6 Spark version: 2.3.0/2.2.0
As described by the error, this is because there is a String value of "NaN" which is inferred in the Spark schema as Double type.
There's a value for the field, amongst all the documents that is not Double. for example :
{_id:1, foo: 100.00}
{_id:2, foo: 101.00}
{_id:3, foo: 102.00}
{_id:4, foo: 103.00}
...
{_id:99, foo: "NaN"}
{_id:100, foo: 200.00}
As you may know, "NaN" means "Not a Number". It is likely during creation of the document that whichever process failed to insert Double, and defaulted into NaN.
There are few ways to solve this, depending on your use case:
Utilise MongoDB Schema Validation to ensure that the values within the collection has the expected type on insert.
Perform transformation to clean the data. Query the collection to find the offending field i.e. {foo: "NaN"} and update with a desired value i.e. 0.
I had a similar conversion problem. Mongodb takes a sample of 1000 documents to define the scheme. In my case, 1000 documents were not sufficient to cover all cases. I increased the sample size and this solved the problem. Mongo documentation
Code:
val readConfig = ReadConfig(Map(
"database" -> "myDatabase",
"collection" -> "myCollection",
"sampleSize" -> "100000"), Some(ReadConfig(sc)))
val df = sc.loadFromMongoDB(readConfig).toDF()

Updating Data in MongoDB from Apache Spark Streaming

I am using the scala Api of Apache Spark Streaming to read from Kafka Server in a window with the size of a minute and a slide intervall of a minute.
The message from Kafka contain a timestamp from the moment they were sent and an arbitrary value. Each of the values is supposed to be reducedByKeyAndWindow and saved to the Mongo.
val messages = stream.map(record => (record.key, record.value.toDouble))
val reduced = messages.reduceByKeyAndWindow((x: Double , y: Double) => (x + y),
Seconds(60), Seconds(60))
reduced.foreachRDD({ rdd =>
import spark.implicits._
val aggregatedPower = rdd.map({x => MyJsonObj(x._2, x._1)}).toDF()
aggregatedPower.write.mode("append").mongo()
})
This works so far, however it is possible, that some message come with a delay, of a minute, which results leads to having two json objects with the same timestamp in the dataBase.
{"_id":"5aa93a7e9c6e8d1b5c486fef","power":6.146849997,"timestamp":"2018-03-14 15:00"}
{"_id":"5aa941df9c6e8d11845844ae","power":5.0,"timestamp":"2018-03-14 15:00"}
The Documentation of the mongo-spark-connector didn't help me with finding a solution.
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
It seems that what you're looking for is a MongoDB operation called upsert. Where an update operation will insert a new document if the criteria has no match, and update the fields if there is a match.
If you are using MongoDB Connector for Spark v2.2+, whenever a Spark dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted.
Now you could try to create an RDD using MongoDB Spark Aggregation, specifying a $match filter to query where timestamp matches the current window:
val aggregatedRdd = rdd.withPipeline(Seq(
Document.parse(
"{ $match: { timestamp : '2018-03-14 15:00' } }"
)))
Modify the value of power field, and then write.mode('append').
You may also find blog: Data Streaming MongoDB/Kafka useful as well. If you would like to write a Kafka consumer and directly insert into MongoDB applying your logics using MongoDB Java Driver

Use more than one collect_list in one query in Spark SQL

I have the following dataframe data:
root
|-- userId: string
|-- product: string
|-- rating: double
and the following query:
val result = sqlContext.sql("select userId, collect_list(product), collect_list(rating) from data group by userId")
My question is that, does product and rating in the aggregated arrays match each other? That is, whether the product and the rating from the same row have the same index in the aggregated arrays.
Update:
Starting from Spark 2.0.0, one can do collect_list on struct type so we can do one collect_list on a combined column. But for pre 2.0.0 version, one can only use collect_list on primitive type.
I believe there is no explicit guarantee that all arrays will have the same order. Spark SQL uses multiple optimizations and under certain conditions there is no guarantee that all aggregations are scheduled at the same time (one example is aggregation with DISTINCT). Since exchange (shuffle) results in nondeterministic order it is theoretically possible that order will differ.
So while it should work in practice it could be risky and introduce some hard to detect bugs.
If you Spark 2.0.0 or later you can aggregate non-atomic columns with collect_list:
SELECT userId, collect_list(struct(product, rating)) FROM data GROUP BY userId
If you use an earlier version you can try to use explicit partitions and order:
WITH tmp AS (
SELECT * FROM data DISTRIBUTE BY userId SORT BY userId, product, rating
)
SELECT userId, collect_list(product), collect_list(rating)
FROM tmp
GROUP BY userId

Spark, mapPartitions, network connection is closed before map operation is finished

I am running a spark job and at some point I want to connect to an elastic search server to get some data and add them to an RDD. So the code I am using looks like this
input.mapParitions(records=>{
val elcon=new ElasticSearchConnection
val client:TransportClient=elcon.openConnection()
val newRecs=records.flatMap(record=>{
val response = client.prepareGet("index" "indexType",
record.id.toString).execute().actionGet()
val newRec=processRec(record,reponse)
newRec
})//end of flatMap
client.close()
newRecs
})//end of mapPartitions
My problem is that the client.close() command is called before the flatMap operation is finished which results of course into an Exception. The code works if I move the generation and the closing of the connection inside the flatMap, but this would generate a huge amount of connections. Is it possible to make sure that client.close will be called after the flatMap operation is finished?
Making a blocking call for each item in your RDD to fetch corresponding ElasticSearch document is causing the problem. It is generally advised to avoid blocking calls.
There is another alternate approach using the ElasticSearch-for-Hadoop's Spark support.
Read the ElasticSearch index/type as another RDD and join it with your RDD.
Include the right version of ESHadoop dependency.
import org.elasticsearch.spark._
val esRdd = sc.esRDD("index/indexType") //This returns a pair RDD of (_id, Map of all key value pairs for all fields]
input.map(record => (record.id, record)) //Convert your RDD of records to a pair rdd of (id, record) as we want to join based on the id
input.join(esRdd).map(rec => processResponse(rec._2._1, rec._2._2)) // Join the two RDDs based on id column it returns a pair RDD with key=id & value=Pair of matching records (id,(inputrddrecord,esrddrecord))
Hope this helps.
PS: It will still not alleviate the problem of lack of co-location. (i.e. each document with _id will come from different shard of the index). Better approach would have been to achieve co-location of the input RDD and the ES index's documents at the time of creating the ES index.