Spark time in parallel execution in Spark for API call - scala

I am doing this below thing in my 8gigs of laptop and running code in Intellij. I am calling 3 apis in parallel with map function and scalajlibrary and calculating time to call each api as follows :
val urls = spark.sparkContext.parallelize(Seq("url1", "url2", "url3"))
//for each API call,execute them in different executor and collate data
val actual_data = urls.map(x => spark.time(HTTPRequestParallel.ds(x)))
When spark.time is executed,i have expected 3 sets of time but it gives me 6 sets of time
Time taken: 14945 ms
Time taken: 21773 ms
Time taken: 22446 ms
Time taken: 6438 ms
Time taken: 6877 ms
Time taken: 7107 ms
What am i missing here and is it really parallel calls to api in nature?

Actually that piece of code alone won't execute any spark.time at all, the map function is lazy so it won't be executed until you perform an action on the RDD. You should also consider that if you don't persist your transformed RDD it will re-compute all transformations for every action. What this mean is that if you are doing something like this:
val urls = spark.sparkContext.parallelize(Seq("url1", "url2", "url3"))
//for each API call,execute them in different executor and collate data
val actual_data = urls.map(x => spark.time(HTTPRequestParallel.ds(x)))
val c = actual_data.count()
actual_data.collect()
There will be 6 executions of what is defined inside the map (two for each element in the RDD, first one for the count and the second for the collect). To avoid this re-computation you can cache or persist the RDD as follows
val urls = spark.sparkContext.parallelize(Seq("url1", "url2", "url3"))
//for each API call,execute them in different executor and collate data
val actual_data = urls.map(x => spark.time(HTTPRequestParallel.ds(x))).cache()
val c = actual_data.count()
actual_data.collect()
In this second example you will only see 3 logs instead of 6

Related

Spark Dataframe changing through execution

I'm fairly new to Spark, so most likely I'm having a huge gap in my understanding. Apologies in advance if what you see here is very silly. So, what I'm trying to achieve is:
Get a set of rows from a table in Hive (let's call it T_A) and save them in a DataFrame (let's call it DF_A). This is done.
Get extra information from another Hive table (T_B) and join it with DF_A to get a new Dataframe (DF_B). And then cache it. This is also done.
val DF_A = sparkSession.sql("select * from T_A where whatever=something").toDF()
val extraData = sparkSession.sql("select * from T_B where whatever=something").toDF()
val DF_B = DF_A.join(extraData,
col(something_else=other_thing), "left"
).toDF().cache()
Now this is me assuming Spark + Hive works similarly than regular java app + SQL, which is where I might need a hard course correction.
Here, I attempt to store in one of the Hive Tables I used before (T_B), partitioned by column X, whatever N rows I transformed (Tr1(DF_B)) from DF_B. I use:
val DF_C = DF_B.map(row => {
Tr1(row)
}).toDF()
DF_C.write.mode(SaveMode.Overwrite).insertInto("T_B")
After saving it to this table, I want to reuse the information from DF_B (not the transformed data reinserted in T_B, but the joined data in DF_B based on previous state of T_B) to make a second transformation over it (Tr2(DF_B)).
I want to update the same N rows written in T_B with the data transformed by previous step, using an "INSERT OVERWRITE" operation and the same partition column X.
val DF_D = DF_B.map(row => {
Tr2(row)
}).toDF()
DF_D.write.mode(SaveMode.Overwrite).insertInto("T_B")
What I expect:
T_B having N rows.
Having DF_B unchanged, with N rows.
What it's happening:
DF_B having 3*N rows.
T_C having 3*N rows.
Now, after some debugging, I found that DF_B has 3N rows after DF_C write finishes. So DF_B will have 3N rows too and that will cause T_B to have 3*N rows as well.
So, my question is... Is there a way to retain the original DF_B data and use it for the second map transformation, since it relies on the original DF_B state for the transformation process? Is there a reference somewhere I can read to know why this happens?
EDIT: I don't know if this is useful information, but I log the count of records before and after doing the first write. And I get the following
val DF_C = DF_B.map(row => {
Tr1(row)
}).toDF()
Logger.info("DF_C.count {} - DF_B.count {}"...
DF_C.write.mode(SaveMode.Overwrite).insertInto("T_B")
Logger.info("DF_C.count {} - DF_B.count {}"...
With persist(MEMORY_AND_DISK) or no persist at all, instead of cache and 3 test rows. I get:
DF_C.count 3 - DF_B.count 3
write
DF_C.count 3 - DF_B.count 9
With cache, I get:
DF_C.count 3 - DF_B.count 3
write
DF_C.count 9 - DF_B.count 9
Any idea?
Thank you so much.
In Spark, execution happens in lazy way, only when action gets called.
So when you calling some action two times on same dataframe (in your case DF_B) that dataframe(DB_B) will be created and transformed two times from starting at time of execution.
So try to persist your dataframe DF_B before calling first action, then you can use same DF for both Tr1 and Tr2.
After persist dataframe will be stored in memory/disk and can be reused multiple times.
You can learn more about persistance here

Why clustering seems doesn't work in spark cogroup function

I have two hive clustered tables t1 and t2
CREATE EXTERNAL TABLE `t1`(
`t1_req_id` string,
...
PARTITIONED BY (`t1_stats_date` string)
CLUSTERED BY (t1_req_id) INTO 1000 BUCKETS
// t2 looks similar with same amount of buckets
the code looks like as following:
val t1 = spark.table("t1").as[T1].rdd.map(v => (v.t1_req_id, v))
val t2= spark.table("t2").as[T2].rdd.map(v => (v.t2_req_id, v))
val outRdd = t1.cogroup(t2)
.flatMap { coGroupRes =>
val key = coGroupRes._1
val value: (Iterable[T1], Iterable[T2])= coGroupRes._2
val t3List = // create a list with some logic on Iterable[T1] and Iterable[T2]
t3List
}
outRdd.write....
I make sure that the both t1 and t2 table has same amount of partitions, and on spark-submit there are
spark.sql.sources.bucketing.enabled=true and spark.sessionState.conf.bucketingEnabled=true flags
But Spark DAG doesn't show any impact of clustering. It seems there is still data full shuffle
What am I missing, any other configurations, tunings? How can it be assured that there is no full data shuffle?
My spark version is 2.3.1
And it shouldn't show.
Any logical optimizations are limited to DataFrame API. Once you push data to black-box functional dataset API (see Spark 2.0 Dataset vs DataFrame) and later to RDD API, no more information is pushed back to the optimizer.
You could partially utilize bucketing by performing join first, getting something around these lines
spark.table("t1")
.join(spark.table("t2"), $"t1.t1_req_id" === $"t2.t2_req_id", "outer")
.groupBy($"t1.v.t1_req_id", $"t2.t2_req_id")
.agg(...) // For example collect_set($"t1.v"), collect_set($"t2.v")
However, unlike cogroup, this will generate full Cartesian Products within groups, and might be not applicable in your case

Apache Spark SQL: how to optimize chained join for dataframe

I have to make a left join between a principle data frame and several reference data frame, so a chained join computation. And I wonder how to make this action efficient and scalable.
Method 1 is easy to understand, which is also the current method, but I'm not satisfied because all the transformations have been chained and waited for the final action to trigger the computation, if I continue to add transformation and the volume of data, spark will fail at the end, so this method is not scalable.
Method 1:
def pipeline(refDF1: DataFrame, refDF2: DataFrame, refDF3: DataFrame, refDF4: DataFrame, refDF5: DataFrame): DataFrame = {
val transformations: List[DataFrame => DataFrame] = List(
castColumnsFromStringToLong(ColumnsToCastToLong),
castColumnsFromStringToFloat(ColumnsToCastToFloat),
renameColumns(RenameMapping),
filterAndDropColumns,
joinRefDF1(refDF1),
joinRefDF2(refDF2),
joinRefDF3(refDF3),
joinRefDF4(refDF4),
joinRefDF5(refDF5),
calculate()
)
transformations.reduce(_ andThen _)
}
pipeline(refDF1, refDF2, refDF3, refDF4, refDF5)(principleDF)
Method 2: I've not found a real way to achieve my idea, but I hope to trigger the computation of each join immediately.
according to my test, count() is too heavy for spark and useless for my application, but I don't know how to trigger the join computation with an efficient action. This kind of action is, in fact, the answer to this question.
val joinedDF_1 = castColumnsFromStringToLong(principleDF, ColumnsToCastToLong)
joinedDF_1.cache() // joinedDF is not always used multiple times, but for some data frame, it is, so I add cache() to indicate the usage
joinedDF_1.count()
val joinedDF_2 = castColumnsFromStringToFloat(joinedDF_1, ColumnsToCastToFloat)
joinedDF_2.cache()
joinedDF_2.count()
val joinedDF_3 = renameColumns(joinedDF_2, RenameMapping)
joinedDF_3.cache()
joinedDF_3.count()
val joinedDF_4 = filterAndDropColumns(joinedDF_4)
joinedDF_4.cache()
joinedDF_4.count()
...
When you want to force the computation of a given join (or any transformation that is not final) in Spark, you can use a simple show or count on your DataFrame. This kind of terminal points will force the computation of the result because otherwise it is simply not possible to execute the action.
Only after this will your DataFrame be effectively stored in your cache.
Once you're finished with a given DataFrame, don't hesitate to unpersist. This will unpersist your data if your cluster need more room for further computation.
You need to repartitions your dataset with the columns before calling the join transformation.
Example:
df1=df1.repartion(col("col1"),col("col2"))
df2=df2.repartion(col("col1"),col("col2"))
joinDF = df1.join(jf2,df1.col("col1").equals(df2.col("col1")) &....)
Try creating a new dataframe based on it.
Ex:
val dfTest = session.createDataFrame(df.rdd, df.schema).cache()
dfTest .storageLevel.useMemory // result should be a true.

Pyspark actions on dataframe [duplicate]

I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.

How to tune mapping/filtering on big datasets (cross joined from two datasets)?

Spark 2.2.0
I have the following code converted from SQL script. It has been running for two hours and it's still running. Even slower than SQL Server. Is anything not done correctly?
The following is the plan,
Push table2 to all executors
Partition table1 and distribute the partitions to executors.
And each row in table2/t2 joins (cross join) each partition of table1.
So the calculation on the result of the cross-join can be run distributed/parallelly. (I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
case class Cols (Id: Int, F2: String, F3: BigDecimal, F4: Date, F5: String,
F6: String, F7: BigDecimal, F8: String, F9: String, F10: String )
case class Result (Id1: Int, ID2: Int, Point: Int)
def getDataFromDB(source: String) = {
import sqlContext.sparkSession.implicits._
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"$source"
)).load()
.select("Id", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9", "F10")
.as[Cols]
}
val sc = new SparkContext(conf)
val table1:DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
println(table1.count()) // about 300K rows
val table2:DataSet[Cols] = getDataFromDB("table2") // ~20K rows
table2.take(1)
println(table2.count())
val t2 = sc.broadcast(table2)
import org.apache.spark.sql.{functions => func}
val j = table1.joinWith(t2.value, func.lit(true))
j.map(x => {
val (l, r) = x
Result(l.Id, r.Id,
(if (l.F1!= null && r.F1!= null && l.F1== r.F1) 3 else 0)
+(if (l.F2!= null && r.F2!= null && l.F2== r.F2) 2 else 0)
+ ..... // All kind of the similiar expression
+(if (l.F8!= null && r.F8!= null && l.F8== r.F8) 1 else 0)
)
}).filter(x => x.Value >= 10)
println("Total count %d", j.count()) // This takes forever, the count will be about 100
How to rewrite it with Spark idiomatic way?
Ref: https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
(Somehow I feel as if I have seen the code already)
The code is slow because you use just a single task to load the entire dataset from the database using JDBC and despite cache it does not benefit from it.
Start by checking out the physical plan and Executors tab in web UI to find out about the single executor and the single task to do the work.
You should use one of the following to fine-tune the number of tasks for loading:
Use partitionColumn, lowerBound, upperBound options for the JDBC data source
Use predicates option
See JDBC To Other Databases in Spark's official documentation.
After you're fine with the loading, you should work on improving the last count action and add...another count action right after the following line:
val table1: DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
// trigger caching as it's lazy in Dataset API
table1.count
The reason why the entire query is slow is that you only mark table1 to be cached when an action gets executed which is exactly at the end (!) In other words, cache does nothing useful and more importantly makes the query performance even worse.
Performance will increase after you table2.cache.count too.
If you want to do cross join, use crossJoin operator.
crossJoin(right: Dataset[_]): DataFrame Explicit cartesian join with another DataFrame.
Please note the note from the scaladoc of crossJoin (no pun intended).
Cartesian joins are very expensive without an extra filter that can be pushed down.
The following requirement is already handled by Spark given all the optimizations available.
So the calculation on the result of the cross-join can be run distributed/parallelly.
That's Spark's job (again, no pun intended).
The following requirement begs for broadcast.
I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
Use broadcast function to hint Spark SQL's engine to use table2 in broadcast mode.
broadcast[T](df: Dataset[T]): Dataset[T] Marks a DataFrame as small enough for use in broadcast joins.