I am using PySpark v1.6.2 and my code goes like this:
df = sqlContext.sql('SELECT * FROM <lib>.<table>')
df.rdd.numPartitions() ## 2496
df = df.withColumns('count', lit(1)) ## up to this point it still has 2496 partitions
df = df.repartition(2496,'trip_id').sortWithinPartitions('trip_id','time')
# This is where the trouble starts
sequenceWS = Window.partitionBy('trip_id').orderBy('trip_id','time') ## Defining a window
df = df.withColumn('delta_time', (df['time'] - min(df['time']).over(sequenceWS.rowsBetween(-1, 0))))
# Done with window function
df.rdd.numPartitions() ## 200
My question is:
Is there a way to tell pyspark how many partitions it should make when using the function Window.partionBy(*cols)?
Alternatively, is there a way to influence PySpark as to keep the same number of partitions as it had before the window function is applied on its DataFrame?
Related
I need to process Spark dataframe partitions in batches, N partitions at a time. For example if i have 1000 partitions in hive table, i need to process 100 partitions at a time.
I tried following approach
Get partition list from hive table and find total count
Get loop count using total_count/100
Then
for x in range(loop_count):
files_list=partition_path_list[start_index:end_index]
df = spark.read.option("basePath", target_table_location).parquet(*files_list)
But this is not working as expected. Can anyone suggest a better method. Solution in Spark Scala is preferred
The for loop you have is just having x increment each time. That's why the start and end indices do not increment.
Not sure why you mention Scala since your code is in Python.
Here's an example with loop count being 1000.
partitions_per_iteration = 100
loop_count = 1000
for start_index in range(0, loop_count, partitions_per_iteration):
files_list=partition_path_list[start_index:start_index + partitions_per_iteration]
df = spark.read.option("basePath", target_table_location).parquet(*files_list)
In Scala, you can do a similar loop:
total = 1000
for {
startIndex <- 0 until total by 100
} {
val filesList = partitionsPathList.slice(startIndex, startIndex + partitionsPerIteration)
val df = ...
}
I think total or totalPartitions is a clearer variable name than "loop count".
I have 27 million records in an xml file, that I want to push it into elasticsearch index
Below is the code snippet written in spark scala, i'l be creating a spark job jar and going to run on AWS EMR
How can I efficiently use the spark to complete this exercise? Please guide.
I have a gzipped xml of 12.5 gb which I am loading into spark dataframe. I am new to Spark..(Should I split this gzip file? or spark executors will take care of it?)
class ReadFromXML {
def createXMLDF(): DataFrame = {
val spark: SparkSession = SparkUtils.getSparkInstance("Spark Extractor")
import spark.implicits._
val m_df: DataFrame = SparkUtils.getDataFrame(spark, "temp.xml.gz").coalesce(5)
var new_df: DataFrame = null
new_df = m_df.select($"CountryCode"(0).as("countryCode"),
$"PostalCode"(0).as("postalCode"),
$"state"(0).as("state"),
$"county"(0).as("county"),
$"city"(0).as("city"),
$"district"(0).as("district"),
$"Identity.PlaceId".as("placeid"), $"Identity._isDeleted".as("deleted"),
$"FullStreetName"(0).as("street"),
functions.explode($"Text").as("name"), $"name".getField("BaseText").getField("_VALUE")(0).as("nameVal"))
.where($"LocationList.Location._primary" === "true")
.where("(array_contains(_languageCode, 'en'))")
.where(functions.array_contains($"name".getField("BaseText").getField("_languageCode"), "en"))
new_df.drop("name")
}
}
object PushToES extends App {
val spark = SparkSession
.builder()
.appName("PushToES")
.master("local[*]")
.config("spark.es.nodes", "awsurl")
.config("spark.es.port", "port")
.config("spark.es.nodes.wan.only", "true")
.config("spark.es.net.ssl", "true")
.getOrCreate()
val extractor = new ReadFromXML()
val df = extractor.createXMLDF()
df.saveToEs("myindex/_doc")
}
Update 1:
I have splitted files in 68M each and to read this single file it takes 3.7 mins
I wast trying to use snappy instead of gzip compression codec
So converted the gz file into snappy file and added below in config
.config("spark.io.compression.codec", "org.apache.spark.io.SnappyCompressionCodec")
But it returns empty dataframe
df.printschema returns just "root"
Update 2:
I have managed to run with lzo format..it takes very less time to decompress and load in dataframe.
Is it a good idea to iterate over each lzo compressed file of size 140 MB and create dataframe?
or
should i load set of 10 files in a dataframe ?
or
should I load all 200 lzo compressed files each of 140MB in a single dataframe?. if yes then how much memory should be allocated to master as i think this will be loaded on master?
While reading file from s3 bucket, "s3a" uri can improve performance? or "s3" uri is ok for EMR?
Update 3:
To test a small set of 10 lzo files.. I used below configuration.
EMR Cluster took overall 56 minutes from which step(Spark application) took 48 mins to process 10 files
1 Master - m5.xlarge
4 vCore, 16 GiB memory, EBS only storage
EBS Storage:32 GiB
2 Core - m5.xlarge
4 vCore, 16 GiB memory, EBS only storage
EBS Storage:32 GiB
With below Spark tuned parameters learnt from https://idk.dev/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
[
{
"Classification": "yarn-site",
"Properties": {
"yarn.nodemanager.vmem-check-enabled": "false",
"yarn.nodemanager.pmem-check-enabled": "false"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "false"
}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.network.timeout": "800s",
"spark.executor.heartbeatInterval": "60s",
"spark.dynamicAllocation.enabled": "false",
"spark.driver.memory": "10800M",
"spark.executor.memory": "10800M",
"spark.executor.cores": "2",
"spark.executor.memoryOverhead": "1200M",
"spark.driver.memoryOverhead": "1200M",
"spark.memory.fraction": "0.80",
"spark.memory.storageFraction": "0.30",
"spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.yarn.scheduler.reporterThread.maxFailures": "5",
"spark.storage.level": "MEMORY_AND_DISK_SER",
"spark.rdd.compress": "true",
"spark.shuffle.compress": "true",
"spark.shuffle.spill.compress": "true",
"spark.default.parallelism": "4"
}
},
{
"Classification": "mapred-site",
"Properties": {
"mapreduce.map.output.compress": "true"
}
}
]
Here are some of the tips from my side.
Read the data in parquet format or any format. Re-partition it as per your need. Data conversion may consume time so read it in spark and then process it. Try to create map and format data before starting load. This would help easy debugging in case of complex map.
val spark = SparkSession
.builder()
.appName("PushToES")
.enableHiveSupport()
.getOrCreate()
val batchSizeInMB=4; // change it as you need
val batchRetryCount= 3
val batchWriteRetryWait = 10
val batchEntries= 10
val enableSSL = true
val wanOnly = true
val enableIdempotentInserts = true
val esNodes = [yourNode1, yourNode2, yourNode3]
var esConfig = Map[String, String]()
esConfig = esConfig + ("es.node"-> esNodes.mkString)(","))
esConfig = esConfig + ("es.port"->port.toString())
esConfig = esConfig + ("es.batch.size.bytes"->(batchSizeInMB*1024*1024).toString())
esConfig = esConfig + ("es.batch.size.entries"->batchEntries.toString())
esConfig = esConfig + ("es.batch.write.retry.count"->batchRetryCount.toString())
esConfig = esConfig + ("es.batch.write.retry.wait"->batchWriteRetryWait.toString())
esConfig = esConfig + ("es.batch.write.refresh"->"false")
if(enableSSL){
esConfig = esConfig + ("es.net.ssl"->"true")
esConfig = esConfig + ("es.net.ssl.keystore.location"->"identity.jks")
esConfig = esConfig + ("es.net.ssl.cert.allow.self.signed"->"true")
}
if (wanOnly){
esConfig = esConfig + ("es.nodes.wan.only"->"true")
}
// This helps if some task fails , so data won't be dublicate
if(enableIdempotentInserts){
esConfig = esConfig + ("es.mapping.id" ->"your_primary_key_column")
}
val df = "suppose you created it using parquet format or any format"
Actually data is inserted at executor level and not at driver level
try giving only 2-4 core to each executor so that not so many connections are open at same time.
You can vary document size or entries as per your ease. Please read about them.
write data in chunks this would help you in loading large dataset in future
and try creating index map before loading data. And prefer little nested data as you have that functionality in ES
I mean try to keep some primary key in your data.
val dfToInsert = df.withColumn("salt", ceil(rand())*10).cast("Int").persist()
for (i<-0 to 10){
val start = System.currentTimeMillis
val finalDF = dfToInsert.filter($"salt"===i)
val counts = finalDF.count()
println(s"count of record in chunk $i -> $counts")
finalDF.drop("salt").saveToES("indexName",esConfig)
val totalTime = System.currentTimeMillis - start
println(s"ended Loading data for chunk $i. Total time taken in Seconds : ${totalTime/1000}")
}
Try to give some alias to your final DF and update that in each run. As you would not like to disturb your production server
at time of load
Memory
This can not be generic. But just to give you a kick start
keep 10-40 executor as per your data size or budget. keep each
executor 8-16gb size and 5 gb overhead. (This can vary as your
document can be large or small in size). If needed keep maxResultSize 8gb.
Driver can have 5 cores and 30 g ram
Important Things.
You need to keep config in variable as you can change it as per Index
Insertion happens on executor not on driver, So try to keep lesser
connection while writing. Each core would open one connection.
document insertion can be with batch entry size or document size.
Change it as per your learning while doing multiple runs.
Try to make your solution robust. It should be able to handle all size data.
Reading and writing both can be tuned but try to format your data as
per document map before starting load. This would help in easy
debugging, If data document is little complex and nested.
Memory of spark-submit can also be tuned as per your learning while running
jobs. Just try to look at insertion time by varying memory and batch
size.
Most important thing is design. If you are using ES than create
your map while keeping end queries and requirement in mind.
Not a complete answer but still a bit long for a comment. There are a few tips I would like to suggest.
It's not clear but I assume your worry hear is the execution time. As suggested in the comments you can improve the performance by adding more nodes/executors to the cluster. If the gzip file is loaded without partitioning in spark, then you should split it to a reasonable size. (Not too small - This will make the processing slow. Not too big - executors will run OOM).
parquet is a good file format when working with Spark. If you can convert your XML to parquet. It's super compressed and lightweight.
Reading on your comments, coalesce does not do a full shuffle. The coalesce algorithm changes the number of nodes by moving data from some partitions to existing partitions. This algorithm obviously cannot increase the number of partitions. Use repartition instead. The operation is costly but it can increase the number of partitions. Check this for more facts: https://medium.com/#mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
I am wondering if it's possible to obtain the result of percentile_rank using the QuantileDiscretizer transformer in pyspark.
The purpose is that I am trying to avoid computing the percent_rank over the entire column, as it generates the following error:
WARN WindowExec: No Partition Defined for Window operation!
Moving all data to a single partition, this can cause serious performance degradation.
The method I am following is to first use QuantileDiscretizer then normalize to [0,1]:
from pyspark.sql.window import Window
from pyspark.ml.feature import QuantileDiscretizer
from scipy.stats import gamma
X1 = gamma.rvs(0.2, size=1000)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("perc_rank", F.percent_rank().over(Window.orderBy("x")))
df = QuantileDiscretizer(numBuckets=df.count()+1,\
inputCol="x",\
outputCol="q_discretizer").fit(df).transform(df)
agg_values = df.agg(F.max(df["q_discretizer"]).alias("maxval"),\
F.min(df["q_discretizer"]).alias("minval")).collect()[0]
xmax, xmin = agg_values.__getitem__("maxval"), agg_values.__getitem__("minval")
normalize = F.udf(lambda x: (x-xmin)/(xmax-xmin))
df = df.withColumn("perc_discretizer", normalize("q_discretizer"))
df = df.withColumn("error", F.round(F.abs(F.col("perc_discretizer")- F.col("perc_rank")),6) )
print(df.select(F.max("error")).show())
df.show(5)
However, it seems that increasing the number of datapoints the error grows, so I am not sure this is the right way to do it.
Is it possible to use QuantileDiscretizer to obtain the percentile_rank?
Alternatively is there a way to compute percentile_rank over an entire column in an efficient way?
Well you can use the below to avoid the warning message:
X1 = gamma.rvs(0.2, size=10)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("dummyCol", F.lit("some_val"))
win = Window.partitionBy("dummyCol").orderBy("x")
df = df.withColumn("perc_rank", F.percent_rank().over(win)).drop("dummyCol")
but nonetheless, the data would still be moved to a single worker, I don't think so there is any better alternative to avoid the shuffle here since the complete column needs to be rank-ordered.
In case you have multiple windows over the same column, you can try to pre-partition the data and then apply the ranking functions.
I am reading data from s3 bucket and run a for loop and do few filers and find max value. Then running on emr cluster But this is taking hours to run.
df has 1.5 m rows and df_new has 50000 rows. The reason why I converted to np array to see whether it improves the performance for the loop.
Since i am new to pyspark i am not sure whether its a efficient way to do this or a better way to do this.
Thanks in advance
df = spark.read.format('parquet').load(os.path.join('s3://', bucket_name, bucket_path_exec+date_val, report_name)
df_new = df.filter(f.col("a") == 1)
df_new = np.array(trade_report_broker.select("a", "b","c", "d","e").collect())
rows = len(df_new)
for i in range(0,rows):
aaa = df_newr[i][0]
eee = df_new[i][4]
time = df_new[i][2]
sub = df_new.filter(f.col("a") == aaa)
sub = sub.filter(f.col("b") < time)
max_time = sub.groupby().agg(f.max("eee").alias("MaxTime"))
I am trying to find similar users by vectorizing user features and sorting by distance between user vectors in PySpark. I'm running this in Databricks on Runtime 5.5 LTS ML cluster (Scala 2.11, Spark 2.4.3)
Following the code in the docs, I am using approxSimilarityJoin() method from the pyspark.ml.feature.BucketedRandomProjectionLSH model.
I have found similar users successfully using approxSimilarityJoin(), but every now and then I come across a user of interest that apparently has no users similar to them.
Usually when approxSimilarityJoin() doesn't return anything, I assume it's because the threshold parameter is set to low. That fixes the issue sometimes, but now I've tried using a threshold of 100000 and still getting nothing back.
I define the model as
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=1.0)
I'm not sure if I changing bucketLength or numHashTables would help in obtaining results.
The following example shows a pair of users where approxSimilarityJoin() returned something (dataA, dataB) and a pair of users (dataC, dataD) where it didn't.
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([0.7016968702094931,0.2636417660310031,4.155293362824633,4.191398632883099]),)]
dataB = [(1, Vectors.dense([0.3757117100334294,0.2636417660310031,4.1539923630906745,4.190086328785612]),)]
dfA = spark.createDataFrame(dataA, ["customer_id", "scaledFeatures"])
dfB = spark.createDataFrame(dataB, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfA)
# returns
# theshold of 100000 is clearly overkill
# A dataframe with dfA and dfB feature vectors and a EuclideanDistance of 0.32599039770730354
model.approxSimilarityJoin(dfA, dfB, 100000, distCol="EuclideanDistance").show()
dataC = [(0, Vectors.dense([1.1600056435954367,78.27652460873155,3.5535837780801396,0.0030949620591871887]),)]
dataD = [(1, Vectors.dense([0.4660731192450482,39.85571715054726,1.0679201943112886,0.012330725745062067]),)]
dfC = spark.createDataFrame(dataC, ["customer_id", "scaledFeatures"])
dfD = spark.createDataFrame(dataD, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfC)
# returns empty df
model.approxSimilarityJoin(dfC, dfD, 100000, distCol="EuclideanDistance").show()
I was able to obtain results to the second half of the example above by increasing the bucketLength parameter value to 15. The threshold could have been lowered because the Euclidean Distance was ~34.
Per the PySpark docs:
bucketLength = the length of each hash bucket, a larger bucket lowers the false negative rate