Delta Merge Operation not always inserts/updates all of the records - merge

This happens from time to time - this is the strange part
My current Solution :
Re-run the job ! :disappointed: - but this is very reactive - not happy3
This is how my merge stmt look like:
MERGE INTO target_tbl AS Target USING df_source AS Source ON Source.key = Target.key
WHEN MATCHED
AND Target.ctl_utc_dts = '9999-12-31'
AND Target.ctl_hash = Source.ctl_hash
AND Source.ctl_start_utc_dts < Target.ctl_start_utc_dts THEN
UPDATE
SET
Target.ctl_start_utc_dts = Source.ctl_start_utc_dts,
Target.ctl_updated_run_id = Source.ctl_updated_run_id,
Target.ctl_modified_utc_dts = Source.ctl_modified_utc_dts,
Target.ctl_updated_batch_id = Source.ctl_updated_batch_id
WHEN MATCHED
AND Target.ctl_utc_dts = '9999-12-31'
AND Target.ctl_hash != Source.ctl_hash THEN
UPDATE
SET
Target.ctl_utc_dts = Source.ctl_start_utc_dts,
Target.ctl_updated_run_id = Source.ctl_updated_run_id,
Target.ctl_modified_utc_dts = Source.ctl_modified_utc_dts,
Target.ctl_updated_batch_id = Source.ctl_updated_batch_id
WHEN NOT MATCHED THEN
INSERT(columns......)
VALUES(columns......)
Spark App configuration:
--conf spark.yarn.stagingDir=hdfs://$(hostname -f):8020/user/hadoop
--conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark
--conf spark.yarn.submit.waitAppCompletion=true
--conf spark.yarn.maxAppAttempts=5
--conf yarn.resourcemanager.am.max-attempts=5
--conf spark.shuffle.service.enabled=true
--executor-memory 24G
--driver-memory 60G
--driver-cores 6
--executor-cores 4
--conf spark.executor.asyncEagerFileSystemInit.paths=s3://s3_bkt
--conf spark.dynamicAllocation.maxExecutors=24
--packages io.delta:delta-core_2.12:1.0.0
Running on AWS EMR
Release label : emr-6.4.0
Hadoop distribution : Amazon 3.2.1
Applications: Spark 3.1.2
Any pointers of things i need to do different ? any help or ideas would be great ! Thx all

Related

How to utilize all Driver cores in cluster mode in spark?

I've an RDD final_rdd which I am collecting on driver using the accumulator and converting to List.
val acumFileKeys = sc.collectionAccumulator[String]("File Keys")
var input_map_keys = ListBuffer(input_map.keys.toSeq: _*)
final_rdd.keys.foreach(m => acumFileKeys.add(m.trim))
import collection.JavaConverters._
acumFileKeys.value.asScala.toList.foreach(fileKey => { // code goes here })
The foreach loop runs on driver and uses only 1 core out of 5 cores. Which in turn results in slow performance. Is there any way I can utilise all cores of driver.
Below is the spark-submit command. We have total 5 workers 5 cores each and each having 16G memory.
spark-submit --class com.test.MyMainClass \
--deploy-mode cluster \
--master spark://master_ip:7077 \
--executor-cores 5 \
--conf spark.driver.maxResultSize=5G \
--conf spark.network.timeout=800s \
--executor-memory 8g \
--driver-memory 8g \
/opt/jars/my_app.jar
Use Scala Parallel Collections - https://docs.scala-lang.org/overviews/parallel-collections/configuration.html
val list = acumFileKeys.value.asScala.toList
import scala.collection.parallel._
val forkJoinPool = new scala.concurrent.forkjoin.ForkJoinPool(5)
val parallelList = list.par
parallelList.tasksupport = new ForkJoinTaskSupport(forkJoinPool)
parallelList.foreach { case fileKey =>
println(Thread.currentThread.getName)
...
}
First, final_rdd.keys.foreach is not a for-loop. foreach in this context is an operation on rdd which is performed remotely. It's already parallelized.
Usually there is no much sense in utilizing computing resources of the driver. In typical workflow the driver is mostly underloaded and just coordinates computations that happens on workers.
In your particular case the code could be rewritten as final_rdd.keys.collect().

Results get vary when running spark through local and through yarn?

joined_data_filtered = spark.sql("SELECT T.TransactionId, T.CustomerId, T.StartClusterId, T.EndCentermostClusterId, T.EndClusterId, T.StartCellId, T.EndCellId, T.EndCentermostCellId, T.EndCentermostLatitude, T.EndCentermostLongitude FROM joined_trip_data AS T LEFT JOIN(SELECT CustomerId,StartClusterId,EndClusterId FROM joined_trip_data WHERE EndCellId = StartCellId GROUP BY CustomerId, StartClusterId,EndClusterId) AS D ON T.CustomerId = D.CustomerId AND T.StartClusterId = D.StartClusterId AND T.EndClusterId = D.EndClusterId WHERE D.CustomerId IS NULL")
In my pyspark script Initially I cluster the start and end location and then remove the data which have same start and end location. I'm selecting the data with same start and end cellid and get their startclusterid,endclusterid and left join that to the data set and take out the data which are not starting with same start and end with same location.
I ran this query several times through following yarn command and got different results each time.
yarn command - spark-submit --master yarn --deploy-mode client --driver-memory 4g --executor-memory 6g --executor-cores 3 --num-executors 4
So I ran two data sets without the join condition and got same count each time.When I'm joining two datasets with startclusterid and endclusterid results get vary. But result is not changing when I run the script through spark-submit --master local[4] command.
I'm using DBSCAN algorithm to cluster start and end latitude,longitude and returning clusterid,Clustercellid.I get different clusterids for a given cluster when I run multiple times through yarn but we think that clusterid doesn't change for a given session.
start cluster for location 'A' - 1st run through yarn
startcluterid startcellid
1 11126
1 11127
start cluster for a location 'A' - 2nd run through yarn
startcluterid startcellid
5 11126
5 11127
Initial Data set before clustering,
TransactionId CustomerId StartLat StartLon EndLat EndLon
17471146 590 41.890334 12.854832 41.91075183 12.86703281
17540917 590 41.890347 12.854828 41.91041441 12.86689
18972483 590 41.890389 12.854123 41.91134124 12.86684897
19037116 590 41.890358 12.854846 41.9107199 12.8671107
20315292 590 41.8903541 12.85485 41.9107082 12.8672354
20422794 590 41.890337 12.854812 41.91074152 12.867081
20458932 590 41.8904 12.854815 41.9107416 12.86717336
25902100 590 41.890329 12.854836 41.91074148 12.86704109
29829078 590 41.89034 12.8548 41.91074 12.867
30024741 590 41.89035 12.8548 41.91078 12.867
Can anyone please let me know what's the issue?

spark job unable to Execute in yarn-cluster mode

I am using Spark version 1.6.0 , Python version 2.6.6
I have a pyspark script as:
conf = SparkConf().setAppName("Log Analysis")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
loadFiles=sc.wholeTextFiles("hdfs:///locations")
fileWiseData=loadFiles.flatMap(lambda inpFile : inpFile[1].split("\n\n"))
replaceNewLine=fileWiseData.map(lambda lines:lines.replace("\n",""))
filterLines=replaceNewLine.map(lambda lines:lines.replace("/"," "))
errorEntries =filterLines.filter(lambda errorLines : "Error" in errorLines)
errEntry= errorEntries.map(lambda line: gettingData(line))#formatting the data
ErrorFiltered = Row('ExecutionTimeStamp','ExecutionDate','ExecutionTime','ExecutionEpoch','ErrorNum','Message')
errorData = errEntry.map(lambda r: ErrorFiltered(*r))
errorDataDf = sqlContext.createDataFrame(errorData)
`
when i am executing the script after splitting my 1gb log file into 20mb(in general , or for 30,40..mbs splits of the files), the script part is working fine .
spark-submit --jars /home/hpuser/LogAnaysisPOC/packages/spark-csv_2.10-1.5.0.jar,/home/hpuser/LogAnaysisPOC/packages/commons-csv-1.1.jar \
--master yarn-cluster --driver-memory 6g --executor-memory 6g --conf spark.yarn.driver.memoryOverhead=4096 \
--conf spark.yarn.executor.memoryOverhead=4096 \
/home/user/LogAnaysisPOC/scripts/essbase/Essbaselog.py
1) if i try to execute with 1gb as the input ,once ,it's failing(errorDataDf = sqlContext.createDataFrame(errorData)).
2) I need to join the parsed data with one meta-data data-frame which is shuffling around 43mb. dfinal.repartition(1).write.format("com.databricks.spark.csv").save("/user/user/loganalysis")
again it's working fine for splited data and failing for the data at once.
The job execution is failing with error :
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
The Yarn scheduler setting are as follows:
yarn.scheduler.capacity.root.queues=default,hive1,hive2
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.state=RUNNING
yarn.scheduler.capacity.root.default.maximum-capacity=100
yarn.scheduler.capacity.root.default.capacity=50
yarn.scheduler.capacity.root.default.acl_submit_applications=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.maximum-am-resource-percent=0.5
yarn.scheduler.capacity.queue-mappings-override.enable=false
yarn.scheduler.capacity.root.default.minimum-user-limit-percent=25
yarn.scheduler.capacity.root.default.ordering-policy=fifo
yarn.scheduler.capacity.root.hive1.acl_administer_queue=*
yarn.scheduler.capacity.root.hive1.acl_submit_applications=*
yarn.scheduler.capacity.root.hive1.capacity=25
yarn.scheduler.capacity.root.hive1.maximum-capacity=100
yarn.scheduler.capacity.root.hive1.minimum-user-limit-percent=25
yarn.scheduler.capacity.root.hive1.ordering-policy=fifo
yarn.scheduler.capacity.root.hive1.state=RUNNING
yarn.scheduler.capacity.root.hive1.user-limit-factor=1
yarn.scheduler.capacity.root.hive2.acl_administer_queue=*
yarn.scheduler.capacity.root.hive2.acl_submit_applications=*
yarn.scheduler.capacity.root.hive2.capacity=25
yarn.scheduler.capacity.root.hive2.maximum-capacity=100
yarn.scheduler.capacity.root.hive2.minimum-user-limit-percent=25
yarn.scheduler.capacity.root.hive2.ordering-policy=fifo
yarn.scheduler.capacity.root.hive2.state=RUNNING
yarn.scheduler.capacity.root.hive2.user-limit-factor=1
yarn.scheduler.capacity.root.user-limit-factor=1
cluster details
i have asked the same question in the forum
Any form of suggestions is greatly appreciated.

Pymongo-spark: BsonSerializationException about decoding a BSON string

I'm running some PySpark code through an IPython notebook that loads and processes three Mongo collections as RDDs, then merges them (with unionAll and dropDuplicates), converts the merged RDD to a DataFrame and writes the result to a CSV.
The Spark job is failing, apparently because Pymongo-spark is unable to load just a few documents. How would I ignore any bad documents, or add a try/except block to ignore this exception, and what does the exception mean? Loading RDDs with sc.mongoRDD(database_uri) doesn't allow me to insert any error handling logic.
I ran into this exception on a few tasks:
org.bson.BsonSerializationException: While decoding a BSON string found a size that is not a positive number: 0
at org.bson.io.ByteBufferBsonInput.readString(ByteBufferBsonInput.java:107)
at org.bson.BsonBinaryReader.doReadString(BsonBinaryReader.java:223)
at org.bson.AbstractBsonReader.readString(AbstractBsonReader.java:430)
at org.bson.codecs.StringCodec.decode(StringCodec.java:39)
at org.bson.codecs.StringCodec.decode(StringCodec.java:28)
at com.mongodb.DBObjectCodec.readValue(DBObjectCodec.java:306)
at com.mongodb.DBObjectCodec.readDocument(DBObjectCodec.java:345)
at com.mongodb.DBObjectCodec.readValue(DBObjectCodec.java:286)
at com.mongodb.DBObjectCodec.readArray(DBObjectCodec.java:333)
at com.mongodb.DBObjectCodec.readValue(DBObjectCodec.java:289)
at com.mongodb.DBObjectCodec.readDocument(DBObjectCodec.java:345)
at com.mongodb.DBObjectCodec.readValue(DBObjectCodec.java:286)
at com.mongodb.DBObjectCodec.readDocument(DBObjectCodec.java:345)
at com.mongodb.DBObjectCodec.decode(DBObjectCodec.java:136)
at com.mongodb.DBObjectCodec.decode(DBObjectCodec.java:61)
at com.mongodb.CompoundDBObjectCodec.decode(CompoundDBObjectCodec.java:43)
at com.mongodb.CompoundDBObjectCodec.decode(CompoundDBObjectCodec.java:27)
at com.mongodb.connection.ReplyMessage.<init>(ReplyMessage.java:57)
at com.mongodb.connection.QueryProtocol.execute(QueryProtocol.java:305)
at com.mongodb.connection.QueryProtocol.execute(QueryProtocol.java:54)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:286)
at com.mongodb.connection.DefaultServerConnection.query(DefaultServerConnection.java:209)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:493)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:480)
at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:239)
at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:212)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:480)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:77)
at com.mongodb.Mongo.execute(Mongo.java:772)
at com.mongodb.Mongo$2.execute(Mongo.java:759)
at com.mongodb.DBCursor.initializeCursor(DBCursor.java:851)
at com.mongodb.DBCursor.hasNext(DBCursor.java:152)
at com.mongodb.hadoop.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:78)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:420)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:249)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
For reference, I'm running Spark 1.5.1 on four workers with the following submit arguments:
export PYSPARK_SUBMIT_ARGS="--master spark://<IP_ADDRESS>:7077 --executor-memory 18g --driver-memory 4g --num-executors 4 --executor-cores 6 --conf spark.cores.max=24 --conf spark.driver.maxResultSize=4g --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/home/tao/eventLogging --jars /usr/local/spark/lib/mongo-hadoop-spark-1.5.0.jar --driver-class-path /usr/local/spark/lib/mongo-hadoop-spark-1.5.0.jar --packages com.stratio.datasource:spark-mongodb_2.10:0.11.0 --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"
I launched an IPython notebook with ipython notebook --profile=pyspark to run the following code (simplified):
hdfs_path = 'hdfs://<IP address>/path/to/training_data_folder'
rdd1 = sc.mongoRDD(database_uri1)\
.map(select_certain_fields1)\
.filter(lambda doc: len(doc.keys()))\
.map(add_a_field)\
.map(doc_to_tuple)\
.filter(len)
rdd2 = sc.mongoRDD(database_uri2)\
.map(select_certain_fields2)\
.filter(lambda doc: len(doc.keys()))\
.map(add_a_field)\
.map(doc_to_tuple)\
.filter(len)
rdd3 = sc.mongoRDD(database_uri3)\
.map(select_certain_fields3)\
.filter(lambda doc: len(doc.keys()))\
.map(add_a_field)\
.map(doc_to_tuple)\
.filter(len)
df1 = sqlContext.createDataFrame(rdd1, schema)
df2 = sqlContext.createDataFrame(rdd2, schema)
df3 = sqlContext.createDataFrame(rdd3, schema)
df_all = df2.unionAll(df3).unionAll(df1)\
.dropDuplicates(['caseId', 'timestamp'])
df_all.write.format('com.databricks.spark.csv') \
.save(hdfs_path + '/all_notes_1.csv')
Thanks!

DIMSUM leading to memory issues

I'm trying to compute column similarity for big data sets using DIMSUM algorithm
cluster configuration is of 5 nodes of 32GB RAM with 6 cores each
spark-shell --driver-memory 21G --executor-memory 29G
--conf "spark.rdd.compress=true"
--conf "spark.shuffle.memoryFraction=0.5"
--conf "spark.storage.memoryFraction=0.3"
--conf "spark.kryoserializer.buffer.max=256m"
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
--num-executors 5 --executor-cores 5
DIMSUM code
val rows = sc.textFile(filename).map { line =>
val values = line.split(' ').map(_.toDouble)
Vectors.dense(values)
}
val mat = new RowMatrix(rows)
// Compute similar columns with estimation using DIMSUM
val simsEstimate = mat.columnSimilarities(0.2)
val array = simsEstimate.entries.map{case MatrixEntry(row :Long, col:Long, sims:Double) => Array(row, col, sims).mkString(",")}
val coal = array.coalesce(1, true)
array.saveAsTextFile("/user/similarity")
Data set contains of 3000 rows and 10M columns , It created 601 splits and took 46 min to compute TreeAggregate, when trying to persist the results to file, it is throwing memory allocation issues or thread spilling into disk issue.
Any pointers on how to fix it?