apache spark - delay on driver - scala

I am attaching image from spark UI, and i am asking what is causing the delay( represented by white space) based on the description of my code below
Description:
1) isEmpt: is a action triggered on a Dataset DS1. it takes fe milliseconds : 60ms.
2) The white space between "isEmpty" and " run at ThreadPool...".
3) "collect at graphUtil" : collection of Datasets created between 1) and 2)
The script is running on yarn cluster.
Between 1) and 2) i am declaring Datasets which uses sqlContext.implicits._, i am not collecting them here.so this is supposed to be work on Driver.Those Datasets contains Join/filter/....
Having that i am not collecting them between 1) and 2) what could be causing this delay.
Code between 1) and 2)
import sqlContext.implicits._
val intermediateInputFlowsIdsDS= intermediateInputFlowsDS
.map(x=>x.flow)
.toDF("flowid").distinct().as[Int].repartition($"flowid")
val df_exch_flow_interm_out=df_exch_flow.filter(df_exch_flow("flow_type")==="PRODUCT_FLOW"
&&df_exch_flow("is_input")==="0" )
val allproducersExchDS= intermediateInputFlowsIdsDS.join(df_exch_flow_interm_out,
intermediateInputFlowsIdsDS("flowid")===df_exch_flow_interm_out("f_flow") )
.repartition($"f_owner")
//proc{id,name,proctype}/inter{flowid}/df_exch{exch,proc,flow,direct,amount,provider,unit}/df_flow{id,name,type}/unit{id,src,factor,dest}
df_proc.join(allproducersExchDS,df_proc("Id")=== allproducersExchDS("f_owner"))
.map(row => {
/*(flowid,procid,value)*/
new FlowProducer( row.getInt(3),// flowid output of producer
row.getInt(0) ,// the process id of producer
row.getDouble(8),// value of the matrix A cell,
row.getDouble(16),//factor
row.getString(17),//destination unit
row.getString(2)//process type
)
}).repartition($"producer_flow")

Related

Spark vs scikit-learn

I use pyspark for traffic classification using the decision tree model & I measure the time required for training the model. It took 2 min and 17 s. Then, I perform the same task using scikit-learn. In the second case, the training time is 1 min and 19 s. Why? since it is supposed that Spark performs the task in a distributed way.
This is the code for pyspark:
df = (spark.read.format("csv")\
.option('header', 'true')\
.option("inferSchema", "true")\
.load("D:/PHD Project/Paper_3/Datasets_Download/IP Network Traffic Flows Labeled with 75 Apps/Dataset-Unicauca-Version2-87Atts.csv"))
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 10)
pModel = dt.fit(trainDF)
in scikit - learn
import warnings
warnings.filterwarnings('ignore')
path = 'D:/PHD Project/Paper_3/Datasets_Download/IP Network Traffic Flows Labeled with 75 Apps/Dataset-Unicauca-Version2-87Atts.csv'
df= pd.read_csv(path)
#df.info()
%%time
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

apache-spark count filtered words from textfilestream

Hi I am new to Scala and I am using intellij Idea. I am trying to filestream a text file running Scala cluster on Hadoop-Spark. My main goal is to count only words (without any special characters) in a key, value format.
I found apache-spark regex extract words from rdd article where they use findAllIn() function with regex but not sure if I am using it correctly in my code.
When I built my project and generate the jar file to run it in Spark I manually provide the text file and it seems that it runs and count words but it also seems that it enters in a loop as it processed the file infinitely. I thought it should process it only once.
Can someone tell me why this may be happening? or is it a better way to achieve my goal?
Part of my text is:
It wlll only take a day,' he said. The others disagreed.
It's too fragile," 289 they said disapprovingly 23 age, but he refused to listen. Not
quite so lazy, the second little pig went in search of planks of seasoned 12 hola, 1256 23.
My code is:
package streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object NetworkWordCount {
def main(args: Array[String]): Unit = {
if (args.length < 1) {
System.err.println("Usage: HdfsWordCount <directory>")
System.exit(1)
}
//StreamingExamples.setStreamingLogLevels()
val sparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster("local")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(5))
// Create the FileInputDStream on the directory and use the
// stream to count words in new files created
val lines = ssc.textFileStream(args(0))
val sep_words = lines.flatMap("[a-zA-Z]+".r.findAllIn(_))
val wordCounts = sep_words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveAsTextFiles("hdfs:///user/test/task1.txt")
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
My expected output is something like below but as mentioned before it keeps repeating:
22/10/24 05:11:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
22/10/24 05:11:40 INFO Executor: Finished task 0.0 in stage 27.0 (TID 20). 1529 bytes result sent to driver
22/10/24 05:11:40 INFO TaskSetManager: Finished task 0.0 in stage 27.0 (TID 20) in 18 ms on localhost (executor driver) (1/1)
22/10/24 05:11:40 INFO TaskSchedulerImpl: Removed TaskSet 27.0, whose tasks have all completed, from pool
22/10/24 05:11:40 INFO DAGScheduler: ResultStage 27 (print at NetworkWordCount.scala:28) finished in 0.031 s
22/10/24 05:11:40 INFO DAGScheduler: Job 13 finished: print at NetworkWordCount.scala:28, took 0.036092 s
-------------------------------------------
Time: 1666588300000 ms
-------------------------------------------
(heaped,1)
(safe,1)
(became,1)
(For,1)
(ll,1)
(it,1)
(Let,1)
(Open,1)
(others,1)
(pack,1)
...
22/10/24 05:11:40 INFO JobScheduler: Finished job streaming job 1666588300000 ms.1 from job set of time 1666588300000 ms
22/10/24 05:11:40 INFO JobScheduler: Total delay: 0.289 s for time 1666588300000 ms (execution: 0.222 s)
22/10/24 05:11:40 INFO ShuffledRDD: Removing RDD 40 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 40
22/10/24 05:11:40 INFO MapPartitionsRDD: Removing RDD 39 from persistence list
22/10/24 05:11:40 INFO BlockManager: Removing RDD 39
Additionally, I notice that when I only use split() function without any regex neither findAllIn() function it process the file only once as expected, but of course it only splits the text by spaces.
Something like this:
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
I almost forgot if you can explain also the meaning of the usage of underscores in code will help me a lot. I am having some problems to understand that part too.
Thanks in advance.
If your file doesn't change while you're parsing it, you won't need to use SparkStreamingContext.
Just create SparkContext:
val sc = new SparkContext(new SparkConf().setAppName("HdfsWordCount").setMaster("local"))
and then process you file in the way you need.
You will need SparkStreamingContext if you have some datasources which changing and you need to process changed delta continiously
There're many different ways to use underscore in scala, but in your current case underscore is a just syntax sugar reduces syntax of lambda function. It actully means element of collection.
You can rewrite code with undescore like this:
//both lines doing the same
.flatMap("[a-zA-Z]+".r.findAllIn(_))
.flatMap((s: String) => "[a-zA-Z]+".r.findAllIn(s))
And
//both lines doing the same
.map(x => (x, 1)).reduceByKey(_ + _)
.map(x => (x, 1)).reduceByKey((l, r) => l + r)
For better understanding this case and other usages read some articles. Like this or this
Well, I think that I found what was my main error. I was telling spark to monitor the same directory where I was creating my output file. So, my thought is that it was self-triggering a new event everytime it saved the output processed data. Once I created an input and output directory the repeating output stopped and worked as expected. It only generates a new output until I provide a new file to the server manually.

RejectedExecutionException: ReactorDispatcher instance is closed. - Azure Event Hubs & Databricks Spark

I am trying to consume data from Azure Event Hubs with Databricks PySpark and write it in an ADLS sink. Somehow, the spark jobis not able to finish and gets aborted after running for 2 hours. The error is Caused by: java.util.concurrent.RejectedExecutionException: ReactorDispatcher instance is closed.
here is a full error https://gist.github.com/kingindanord/a5f585c6ee7053c275c714d1b07c6538#file-spark_error-log
and here is my python script
import json
from datetime import date, timedelta, datetime
from pyspark.sql import functions as F
KEY_VAULT_NAME="KEY_VAULT_NAME"
EVENT_HUBS_SECRET_NAME="EVENT_HUBS_SECRET_NAME"
EVENT_HUBS_CONSUMER_NAME="EVENT_HUBS_CONSUMER_NAME"
BATCH_START_DATE = datetime.strptime("2022-03-22 23:00:00", "%Y-%m-%d %H:%M:%S")
BATCH_END_DATE = datetime.strptime("2022-03-23 00:00:00", "%Y-%m-%d %H:%M:%S")
CONTAINER_NAME = "CONTAINER_NAME_AZ"
HUB_NAME = "HUB_NAME"
ROOT_FOLDER = "ROOT_FOLDER"
SINK_URI = 'abfss://{CONTAINER_NAME}#.dfs.core.windows.net/{SINK_ROOT_FOLDER}'.format(CONTAINER_NAME=CONTAINER_NAME, SINK_ROOT_FOLDER=ROOT_FOLDER)
connection = dbutils.secrets.get(scope = KEY_VAULT_NAME, key = EVENT_HUBS_SECRET_NAME)
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connection)
ehConf['eventhubs.consumerGroup'] = EVENT_HUBS_CONSUMER_NAME
# Create the positions
startingEventPosition = {
"offset": None,
"seqNo": -1, #not in use
"enqueuedTime": BATCH_START_DATE.strftime("%Y-%m-%dT00:00:00.000Z"),
"isInclusive": True
}
endingEventPosition = {
"offset": None,
"seqNo": -1,
"enqueuedTime": BATCH_END_DATE.strftime("%Y-%m-%dT00:00:00.000Z"),
"isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
ehConf["eventhubs.endingPosition"] = json.dumps(endingEventPosition)
ehConf["eventhubs.MaxEventsPerTrigger"] = 1000
ehConf["eventhubs.UseExclusiveReceiver"] = True
df = spark.read.format("eventhubs").options(**ehConf).load()
df2 = df.withColumn("body", df["body"].cast("string")) \
.withColumn("year", F.date_format(df["enqueuedTime"], "yyyy")) \
.withColumn("month", F.date_format(df["enqueuedTime"], "MM")) \
.withColumn("day", F.date_format(df["enqueuedTime"], "dd"))\
.select("body", "year", "month", "day")
df2.write.partitionBy("year", "month", "day").mode("overwrite") \
.format("delta") \
.parquet(SINK_URI)
I am using a separate consumer group for this application. The Event hub has 3 partitions, Auto-inflate throughput units are enabled and it is set to 21 units.
Databricks Runtime Version: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) Worker type & Driver type are Standard_E16_v3 (128GB Memory, 16 Cores) Min workers: 1, Max workers, 3.
As you can see in the code, startingEventPosition and endingEventPosition are only one hour apart, so the size of data should be around 3 GB, I don't know why I am not able to consume them. Can you please help me with this issue.
You can try the 2 workarounds:
Set different Consumer Groups for each stream.
Restart databricks cluster and then try again.
Refer this github link

Spark Job keeps failing by the running the same map 3 times

My job has a step where I am converting the data frame as RDD[(key, value)] but the step runs three 3 times and getting stuck in the third time and fails
Spark UI shows :
Active Jobs(1)
Job Id (Job Group) Description Submitted Duration Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total
3 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 05:50:02 1.6 min 0/1 210/45623
Completed Jobs (2)
2 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 05:16:28 23 min 1/1 46742/46075 (21 failed)
1 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 04:47:58 17 min 1/1 47369/46795 (20 failed)
This is the code :
val eventsRDD = eventsDF.map {
r =>
val customerId = r.getAs[String]("customerId")
val itemId = r.getAs[String]("itemId")
val countryId = r.getAs[Long]("countryId").toInt
val timeStamp = r.getAs[String]("eventTimestamp")
val totalRent = r.getAs[Int]("totalRent")
val totalPurchase = r.getAs[Int]("totalPurchase")
val totalProfit = r.getAs[Int]("totalProfit")
val store = r.getAs[String]("store")
val itemName = r.getAs[String]("itemName")
val itemName = if (itemName.size > 0 && itemName.nonEmpty && itemName != null ) itemName else "NA"
(itemId, (customerId, countryId, timeStamp, totalRent, totalProfit, totalPurchase, store,itemName ))
}
Can someone tell what is wrong here ? If I want persist/cache which one I should do ?
Error :
16/10/25 23:28:55 INFO YarnClientSchedulerBackend: Asked to remove non-existent executor 181
16/10/25 23:28:55 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1477415847345_0005_02_031011 on host: ip-172-31-14-104.ec2.internal. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_1477415847345_0005_02_031011
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
You map operation is resulting in some error and that propogates to the driver which results in task failure.
By default spark.task.maxFailures has value as 4 which is for :
Number of failures of any particular task before giving up on the job.
The total number of failures spread across different tasks will not
cause the job to fail; a particular task has to fail this number of
attempts. Should be greater than or equal to 1. Number of allowed
retries = this value - 1.
So what happens when your task fails spark tries to recompute the map operation untill it has failed 4 times in all.
If I want persist/cache which one I should do ?
cache is just specific persist operation where RDD is persisted with default storage level (MEMORY_ONLY).

Shark getting started: all queries hanging

I am a noobie for sharkle - though I do have some experience with spark. Every attempt being made to retrieve data from shark is hanging.
As a preliminary step: let's ensure that spark were up and healthy:
spark>
val tf = sc.textFile("hdfs://10.213.39.125:8020/hadoop/example/20417.txt")
val c = tf.count
..
14/04/10 19:44:34 INFO SparkContext: Job finished: count at <console>:14, took 0.161135127 s
c: Long = 12761
I have checked carefully about the shark-env.sh points to the spark installation correctly..
Now let us go to shark and try (a) the same file read and (b) a shark table read
(a)
shark>
val tf = sc.textFile("hdfs://10.213.39.125:8020/hadoop/example/20417.txt")
tf: org.apache.spark.rdd.RDD[String] = MappedRDD[4] at textFile at <console>:17
scala> val c2 = tf.count
(wait minutes .. finally do control -c)
shark>
sc.makeRDD("select * from dual")
res1: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[2] at makeRDD at <console>:18
scala> res1.collect
(Once again: wait minutes .. finally do control -c)
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:62)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:313)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:725)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:744)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:758)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:772)
at org.apache.spark.rdd.RDD.collect(RDD.scala:560)
More Details
Here are pertinent sections of the shark-env.sh
export SPARK_MEM=2g
# (Required) Set the master program's memory
export SHARK_MASTER_MEM=1g
# (Required) Point to your Scala installation.
export SCALA_HOME="/usr/local/scala-2.9.3"
# (Required) Point to the patched Hive binary distribution
export HIVE_HOME="/home/guest/shark-0.8.0-bin-hadoop1/hive-0.9.0-shark-0.8.0-bin"
# For running Shark in distributed mode, set the following:
export HADOOP_HOME="/usr/local/hadoop"
export SPARK_HOME="/home/guest/spark-0.8.0"
export MASTER="spark://swlab-r03-16L:17087"
From shark-shell, let us ensure we are talking to the same spark server
scala> sc.sparkHome
res0: String = /home/guest/spark-0.8.0
scala> sc.isLocal
res1: Boolean = false
scala> sc.master
res2: String = spark://swlab-r03-16L:17087
It seems there were hive metastore configuration issues. The metastore parameters are under the shark-hive-/conf/hive-site.xml