val spark = SparkSession.builder().master("local[4]").appName("Test")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.sql.adaptive.coalescePartitions.enabled", "true")
.config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m")
.config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")
.config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024")
.getOrCreate()
val df = spark.read.csv("<Input File Path>")
val df1 = df.distinct()
df1.persist() // On removing this line. Code works as expected
df1.write.csv("<Output File Path>")
I have an input file of size 2 GB which is read as 16 partitions of size 128 MB each. I have enabled adaptive SQL to coalesce partitions after the shuffle
Without df1.persist, df1.write.csv writes 4 partition files of 50 MB each which is expected
Without persist
If I include df1.persist, Spark is writing 200 partitions(adaptive coalesce not working)
With persist
.config("spark.sql.optimizer.canChangeCachedPlanOutputPartitioning", "true")
Adding this config worked
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-38172?filter=reportedbyme
Related
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate()
import spark.implicits._
case class Something(id: Int, batchId: Option[String], div: String)
val sth1 = Something(1, Some("1000"), "10")
val sth2 = Something(2, Some("1000"), "10")
val sth3 = Something(3, Some("1000"), "10")
val sth4 = Something(4, Some("1000"), "10")
val ds = Seq(sth1, sth2, sth3, sth4).toDS()
ds.write.mode("overwrite").option("path", "loacl_path").bucketBy(3, "id").saveAsTable("Tmp")
I go to the local_path where it stores the data but I only find two parquet files. I wonder why it doesn't create 3 parquet files which is the number of bucket.
I have also tried bucket number equals to 1 or 2, it does impact the number of parquet files stored in local path. When bucket numer is 1, then there is only 1 parquet file, similarly for the case when it equals to 2.
You should use Dataset.repartition operator to control the number of output files.
You can still have the bucketBy with combination with repartition, but bucketBy has different use - avoiding shuffles in joins when they use the join keys matching the bucketing keys.
ds.repartition(3)
.write
.mode("overwrite")
.option("path", "loacl_path")
.bucketBy(3, "id")
.saveAsTable("Tmp")
bucketBy is not probably what you're looking for (if you're expecting your data to be written inside 3 parquet files). when you use bucketBy, you define the column names, and a hash function is responsible to divide your data into number of buckets you specified, it doesn't necessarily mean that they should be saved in n files. This is used to boost your querying performance (something similar to indexing, not equal). Now I haven't tried this yet, but what you're looking for probably is repartition method.
df.repartition(3)
.write.mode(SaveMode.Overwrite)
.option("path", "local_path")
.saveAsTable("Tmp")
I am trying to create an application to process 10 million json files where the size of a json can vary from 1mb to 50mb.
To avoid burdening the driver I am using the structured streaming api to process 100,000 json files at a time rather than loading all the source files at once.
mySchema
val mySchema: StructType =
StructType( Array(
StructField("ID",StringType,true),
StructField("StartTime",DoubleType, true),
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field2",LongType,true),
StructField("field3",LongType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true),
StructField("field7",LongType,true),
StructField("field8",LongType,true)
)),true),true)))
Create Streaming Dataframe by picking 100,000 files at a time
val readDF = spark.readStream
.format("json")
.option("MaxFilesPerTrigger", 100000)
.option( "pathGlobFilter", "*.json")
.option( "recursiveFileLookup", "true")
.schema(mySchema)
.load("/mnt/source/2020/*")
writeStream to start streaming computation
val sensorFileWriter = binaryDF
.writeStream
.queryName( "myStream")
.format("delta")
.trigger(Trigger.ProcessingTime("30 seconds"))
.outputMode("append")
.option( "checkpointLocation", "/mnt/dir/checkpoint")
.foreachBatch(
(batchDF: DataFrame, batchId: Long) => {
batchDF.persist()
val parseDF = batchDF
.withColumn("New_Set", expr("transform(Data, x -> (x.field1 as f1, x.field2 as f2, x.field3 as field3))"))
.withColumn("Data_New",addCol(uuid(),to_json($"New_Set")))
.withColumn("data_size", size(col("Data")))
.withColumn("requestid", uuid())
.withColumn("start_epoch_double", bround($"StartTime").cast("long"))
.withColumn("Start_date", from_unixtime($"start_epoch_double", "YYYYMMdd"))
.withColumn("request", concat(lit("start"), col("Data_New"), lit("end")))
.persist()
val requestDF = parseDF
.select($"Start_date", $"request")
requestDF.write
.partitionBy("Start_date")
.mode("append")
.parquet("/mnt/Target/request")
}
)
In the above "addCol" is a user defined function that adds new StructField to Array of StructFields
val addCol = udf((id:String,json:String) => {
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
import org.json4s.JsonDSL._
compact(parse(json).extract[List[Map[String,String]]].map(m => Map("requestid" -> id) ++ m))
})
"uuid" is another udf that generates a unique id
val uuid = udf(() => java.util.UUID.randomUUID().toString)
Databricks cluster Config:-
Apache Spark 2.4.5
70 Workers: 3920.0 GB Memory, 1120 Cores (i.e. 56.0 GB Memory and 16 Cores per Worker)
1 Driver: 128.0 GB Memory, 32 Cores
The below image is total tasks for writing each batch of 100,000 which takes more than an hour. The entire process takes days to complete processing the 10 million json files.
How can I make this streaming process run faster?
Should I be setting the property for "spark.sql.shuffle.partitions". If so what is a good value for this property?
In most cases, you want a 1-to-1 mapping of partitions to cores for streaming applications...Azure Databricks Structured Streaming.
To optimize mapping of your partitions to cores, try this:
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)
This will give you a 1-to-1 mapping of your partitions to cores.
Let's say I do something like this:
def readDataset: Dataset[Row] = ???
val ds1 = readDataset.cache();
val ds2 = ds1.withColumn("new", lit(1)).cache();
Will ds2 and ds1 share all data in columns except "new" added to ds2? If I cache both datasets will it store in memory whole datasets ds and ds2 or will shared data be stored only once?
If data is shared, then when this sharing is broken (so the same data is stored in two memory locations)?
I know that datasets and rdds are immutable, but I couldn't find clear anwsers if the share data or not.
In short: the cached data will not be shared.
Experimental proof to convince you, with code snippet and corresponding memory usage that can be found in Spark UI:
val df = spark.range(10000000).cache()
val df2 = df.withColumn("other", col("id")*3)
df2.count()
uses about 10MB of memory:
while
val df = spark.range(10000000).cache()
val df2 = df.withColumn("other", col("id")*3).cache()
df2.count()
uses about 30MB:
for df: 10MB
for df2: 10MB for the copied column and another 10MB the new one:
I am trying a read a SQL Table (15 million rows) using Spark into Dataframe, I want to leverage Multi-Core to Do the read very Fast and do the Partition, What are the column/s I can select to partition ? is it ID, UUID, Sequence, date-time? How should I calculate the Number of Partitions?
There are multiple of complex questions in your question :
- What are the column/s I can select to partition ?
It depends on your needs and your computing goals and transformations that you will do next with spark on your data. (if groupBy(key) and your key is date-time, then you should partitionBy date-time)
- Number of partition depends on : the size of your data, your hardware ressources, yours needs ... it is a complex question, you have to take into account shuffle partitions for transformations also (default is 200, value advised by spark: 3 * number of cpu)
val sparkSession = org.apache.spark.sql.SparkSession.builder
.master("local[*]")
.appName("JdbcMicroService")
.getOrCreate()
sparkSession.conf.set("spark.sql.shuffle.partitions", 3 * nb CPU)
def requestPostgreSql(sparkSession: SparkSession, database : database, dateOfRequest : String) : DataFrame = {
val url = "jdbc:postgresql://" + database.url + "/" + database.databaseName
val requestDF = sparkSession.read.format("jdbc")
.option("Driver", "org.postgresql.Driver")
.option("url", url)
.option("dbtable",database.tableName)
.option("user",database.user)
.option("password",database.passwd)
.load()
.repartition(col("colName"))
requestDF
}
I am writing a Scala script that reads from a table, transforms data and shows result using Spark. I am using Spark 2.1.1.2 and Scala 2.11.8. There is a dataframe instance I use twice in the script (df2 in the code below.). Since dataframes are calculated when an action is called on them, not when they are declared, I predict that this dataframe to be calculated twice. I thought that persisting this dataframe would improve performance thinking that, it would be calculated once (when persisted), instead of twice, if persisted.
However, script run lasts ~10 seconds longer when I persist compared to when I don't persist. I cannot figure out what the reason for this is. If someone has an idea, it would be much appreciated.
My submission command line is below:
spark-submit --class TestQuery --master yarn --driver-memory 10G --executor-memory 10G --executor-cores 2 --num-executors 4 /home/bcp_data/test/target/TestQuery-1.0-SNAPSHOT.jar
Scala script is below:
val spark = SparkSession
.builder()
.appName("TestQuery")
.config("spark.sql.warehouse.dir", "file:/tmp/hsperfdata_hdfs/spark-warehouse/")
.enableHiveSupport()
.getOrCreate()
val m = spark.sql("select id, startdate, enddate, status from members")
val l = spark.sql("select mid, no, status, potential from log")
val r = spark.sql("select mid, code from records")
val df1 = m.filter(($"status".isin(1,2).and($"startdate" <= one_year_ago)).and((($"enddate" >= one_year_ago)))
val df2 = df1.select($"id", $"code").join(l, "mid").filter(($"status".equalTo(1)).and($"potential".notEqual(9))).select($"no", $"id", $"code")
df2.persist
val df3 = df2.join(r, df2("id").equalTo(r("mid"))).filter($"code".isin("0001","0010","0015","0003","0012","0014","0032","0033")).groupBy($"code").agg(countDistinct($"no"))
val fa = spark.sql("select mid, acode from actions")
val fc = spark.sql("select dcode, fcode from params.codes")
val df5 = fa.join(fc, fa("acode").startsWith(fc("dcode")), "left_outer").select($"mid", $"fcode")
val df6 = df2.join(df5, df2("id").equalTo(df5("mid"))).groupBy($"code", $"fcode")
println("count1: " + df3.count + " count2: " + df6.count)
using caching is the right choice here, but your statement
df2.persist
has no effect because you do not utilize the returned dataframe. Just do
val df2 = df1.select($"id", $"code")
.join(l, "mid")
.filter(($"status".equalTo(1)).and($"potential".notEqual(9)))
.select($"no", $"id", $"code")
.persist