Determining/enabling parallelism in spark - scala

I have developed a scala application and I am getting almost proper results out of it. But I am not sure if my code is taking advantage of spark parallelism.
I am running spark in a standalone mode having two virtual workers with 2 cores and 2G memory each.
Below is the code snippet from the application:
The initialization of the RDD:
for(i <- 0 to limit-1){
data+=new MyClass(dimension_limit) with Serializable
}
var example_rdd = sc.parallelise(data)
RDD Operations:
var temp_rdd: RDD[MyClass] = sc.emptyRDD[MyClass]
temp_rdd = example_rdd
var updated_rdd: RDD[MyClass] = sc.emptyRDD[MyClass]
for(i <- 0 to no_of_iterations-1){
updated_rdd = temp_rdd.map{ x => updation_function(x)}
updated_rdd.count() // to trigger the map
temp_rdd = updated_rdd
}
Update function :
def update_function(x: MyClass): MyClass{
x.property1 = "value"
.
.
.
//all updations
return x
}
Below is the snapshot of the job from history server
and this is stage detail:
Kindly help me to determine if my code is running parallelly; if not what might be the issue in my implementation?

Related

Apache Spark Data Generator Function on Databricks Not working

I am trying to execute the Data Generator function provided my Microsoft to test streaming data to Event Hubs.
Unfortunately, I keep on getting the error
Processing failure: No such file or directory
When I try and execute the function:
%scala
DummyDataGenerator.start(15)
Can someone take a look at the code and help decipher why I'm getting the error:
class DummyDataGenerator:
streamDirectory = "/FileStore/tables/flight"
None # suppress output
I'm not sure how the above cell gets called into the function DummyDataGenerator
%scala
import scala.util.Random
import java.io._
import java.time._
// Notebook #2 has to set this to 8, we are setting
// it to 200 to "restore" the default behavior.
spark.conf.set("spark.sql.shuffle.partitions", 200)
// Make the username available to all other languages.
// "WARNING: use of the "current" username is unpredictable
// when multiple users are collaborating and should be replaced
// with the notebook ID instead.
val username = com.databricks.logging.AttributionContext.current.tags(com.databricks.logging.BaseTagDefinitions.TAG_USER);
spark.conf.set("com.databricks.training.username", username)
object DummyDataGenerator extends Runnable {
var runner : Thread = null;
val className = getClass().getName()
val streamDirectory = s"dbfs:/tmp/$username/new-flights"
val airlines = Array( ("American", 0.17), ("Delta", 0.12), ("Frontier", 0.14), ("Hawaiian", 0.13), ("JetBlue", 0.15), ("United", 0.11), ("Southwest", 0.18) )
val reasons = Array("Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft")
val rand = new Random(System.currentTimeMillis())
var maxDuration = 3 * 60 * 1000 // default to three minutes
def clean() {
System.out.println("Removing old files for dummy data generator.")
dbutils.fs.rm(streamDirectory, true)
if (dbutils.fs.mkdirs(streamDirectory) == false) {
throw new RuntimeException("Unable to create temp directory.")
}
}
def run() {
val date = LocalDate.now()
val start = System.currentTimeMillis()
while (System.currentTimeMillis() - start < maxDuration) {
try {
val dir = s"/dbfs/tmp/$username/new-flights"
val tempFile = File.createTempFile("flights-", "", new File(dir)).getAbsolutePath()+".csv"
val writer = new PrintWriter(tempFile)
for (airline <- airlines) {
val flightNumber = rand.nextInt(1000)+1000
val deptTime = rand.nextInt(10)+10
val departureTime = LocalDateTime.now().plusHours(-deptTime)
val (name, odds) = airline
val reason = Random.shuffle(reasons.toList).head
val test = rand.nextDouble()
val delay = if (test < odds)
rand.nextInt(60)+(30*odds)
else rand.nextInt(10)-5
println(s"- Flight #$flightNumber by $name at $departureTime delayed $delay minutes due to $reason")
writer.println(s""" "$flightNumber","$departureTime","$delay","$reason","$name" """.trim)
}
writer.close()
// wait a couple of seconds
//Thread.sleep(rand.nextInt(5000))
} catch {
case e: Exception => {
printf("* Processing failure: %s%n", e.getMessage())
return;
}
}
}
println("No more flights!")
}
def start(minutes:Int = 5) {
maxDuration = minutes * 60 * 1000
if (runner != null) {
println("Stopping dummy data generator.")
runner.interrupt();
runner.join();
}
println(s"Running dummy data generator for $minutes minutes.")
runner = new Thread(this);
runner.run();
}
def stop() {
start(0)
}
}
DummyDataGenerator.clean()
displayHTML("Imported streaming logic...") // suppress output
you should be able to use the Databricks Labs Data Generator on the Databricks community edition. I'm providing the instructions below:
Running Databricks Labs Data Generator on the community edition
The Databricks Labs Data Generator is a Pyspark library so the code to generate the data needs to be Python. But you should be able to create a view on the generated data and consume it from Scala if that's your preferred language.
You can install the framework on the Databricks community edition by creating a notebook with the cell
%pip install git+https://github.com/databrickslabs/dbldatagen
Once it's installed you can then use the library to define a data generation spec and by using build, generate a Spark dataframe on it.
The following example shows generation of batch data similar to the data set you are trying to generate. This should be placed in a separate notebook cell
Note - here we generate 10 million records to illustrate ability to create larger data sets. It can be used to generate datasets much larger than that
%python
num_rows = 10 * 1000000 # number of rows to generate
num_partitions = 8 # number of Spark dataframe partitions
delay_reasons = ["Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft"]
# will have implied column `id` for ordinal of row
flightdata_defn = (dg.DataGenerator(spark, name="flight_delay_data", rows=num_rows, partitions=num_partitions)
.withColumn("flightNumber", "int", minValue=1000, uniqueValues=10000, random=True)
.withColumn("airline", "string", minValue=1, maxValue=500, prefix="airline", random=True, distribution="normal")
.withColumn("original_departure", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
.withColumn("delay_minutes", "int", minValue=20, maxValue=600, distribution=dg.distributions.Gamma(1.0, 2.0))
.withColumn("delayed_departure", "timestamp", expr="cast(original_departure as bigint) + (delay_minutes * 60) ", baseColumn=["original_departure", "delay_minutes"])
.withColumn("reason", "string", values=delay_reasons, random=True)
)
df_flight_data = flightdata_defn.build()
display(df_flight_data)
You can find information on how to generate streaming data in the online documentation at https://databrickslabs.github.io/dbldatagen/public_docs/using_streaming_data.html
You can create a named temporary view over the data so that you can access it from SQL or Scala using one of two methods:
1: use createOrReplaceTempView
df_flight_data.createOrReplaceTempView("delays")
2: use options for build. In this case the name passed to the Data Instance initializer will be the name of the view
i.e
df_flight_data = flightdata_defn.build(withTempView=True)
This code will not work on the community edition because of this line:
val dir = s"/dbfs/tmp/$username/new-flights"
as there is no DBFS fuse on Databricks community edition (it's supported only on full Databricks). It's potentially possible to make it working by:
Changing that directory to local directory, like, /tmp or something like
adding a code (after writer.close()) to list flights-* files in that local directory, and using dbutils.fs.mv to move them into streamDirectory

Multiple Gatling simulations in parallel with different rampUsers over different times

I have multiple Gatling simulations defined in this manner (imports removed).
class MySimulation1 extends Simulation {
object SimulationObj1 {
var feeder = ...
var random = exec(...)
}
val httpProtocol = ...
val myScenario = scenario("Scenario name").exec(SimulationObj1.random)
setUp(myScenario.inject(
rampUsers(10) over (180 seconds)
)
)
.assert(...)
}
class MySimulation2 extends Simulation {
object SimulationObj2 {
var feeder = ...
var random = exec(...)
}
val httpProtocol = ...
val myScenario = scenario("Scenario name").exec(SimulationObj2.random)
setUp(myScenario.inject(
rampUsers(15) over (300 seconds)
)
)
.assert(...)
}
And then there's another AllSimulations class that simply calls all the simulations so that the scenarios in them could be executed in parallel.
class AllSimulations extends Simulation {
object AllSimulationsObj {
var feeder = ...
var random = exec(...)
}
val httpProtocol = ...
val myScenario = scenario("All scenarios").exec(
new MySimulation1().SimulationObj1.random,
new MySimulation2().SimulationObj2.random)
setUp(myScenario.inject(
rampUsers(10) over (180 seconds)
)
)
.assert(...)
}
The problem is that, in order to have different rampUsers count over different durations, I'm removing the setUp block from AllSimulations class, but that gives me an error "No scenario set up".
How do I possibly run all the simulation scenarios in parallel with the rampUsers and durations defined in the respective simulation classes?
EDIT: Here's what I tried, but I'm not sure if it makes sense.
class AllSimulations extends Simulation {
setUp(
new MySimulation1().myScenario.inject(rampUsers(10) over (180 seconds)),
new MySimulation2().myScenario.inject(rampUsers(15) over (300 seconds))
)
.assert(...)
}
If u want to run two or more scenarios concurrently or simultaneously or parallelly then let's say u have two files (EXAMPLE1.scala and EXAMPLE2.scala).
U have to make a separate file (Simulator.scala) like shown below.
EXAMPLE1.SCALA (FILE-1)
...
val Example1_scenario = scenario("EXAMPLE1").exec(RunningForAllTenants())
...
EXAMPLE2.SCALA (FILE-2)
...
val Example2_scenario = scenario("EXAMPLE2").exec(RunningForAllTenants())
...
Simulator.scala
class Simulator extends Simulation
{
setUp **(** new EXAMPLE1().Example1_scenario.inject(rampUsers(10) during (10) **)** .protocols(httpConf1),
setUp **(** new EXAMPLE2().Example2_scenario.inject(rampUsers(30) during (20) **)** .protocols(httpConf1),
}
Run Simulator.scala which will automatically run EXAMPLE1.scala and EXAMPLE2.scala concurrently
I don't think what you propose will work - it doesn't really make sense to execute simulations in parallel as your results would no longer reflect the true number of concurrent users.
What would work would be to define your scenarios (in different files, if that suits), then have a simulation that injects users into each as desired.

Spark-Streaming broadcast variable to custom receiver

I created an application which is using Spark-Streaming with the custom receiver Google Pub/Sub.
I hit my performance limit and interested to drop messages without processing it. I had an idea to store() sub of reading messages
I used apache/bahir receiver
val pullResponse = client.projects().subscriptions().pull(subscriptionFullName, pullRequest).execute()
val receivedMessages = pullResponse.getReceivedMessages.asScala.toList
Utils.LOG.info(s"receivedMessages from PUB/SUB ${receivedMessages.size}")
rateLimiter.acquire(receivedMessages.size)
var factor: Int = 0
if (dropFactorBroad != null) {
factor = dropFactorBroad.value
} else {
Utils.LOG.info("dropFactorBroad is null")
}
val endIndex = if (factor > receivedMessages.length) receivedMessages.length else factor
val messagesToStore = receivedMessages.slice(0, receivedMessages.length - endIndex)
store(messagesToStore.map(x => {
val sm = new SparkPubsubMessage
sm.message = x.getMessage
sm
})
.iterator)
val ackRequest = new AcknowledgeRequest()
ackRequest.setAckIds(receivedMessages.map(x => x.getAckId).asJava)
client.projects().subscriptions().acknowledge(subscriptionFullName, ackRequest).execute()
dropFactorBroad - is Broadcast variable which is updated on every onBatchCompleted(unpersisted and created again)
It is not working, I'm getting
java.lang.NullPointerException
at com.mag.ingester.ReceiverDropFactorBroadcaster.value(ReceiverDropFactorBroadcaster.scala:20)
at com.mag.pubSubReceiver.PubsubReceiver.receive(PubsubInputDStream.scala:260)
at com.mag.pubSubReceiver.PubsubReceiver$$anon$1.run(PubsubInputDStream.scala:244)
ReceiverDropFactorBroadcaster is dropFactorBroad
How can I control the receive store?
Should I kill receivers change the variable and start it again? (How can it be done?)
Thanks

Difference between RoundRobinRouter and RoundRobinRoutinglogic

So I was reading tutorial about akka and came across this http://manuel.bernhardt.io/2014/04/23/a-handful-akka-techniques/ and I think he explained it pretty well, I just picked up scala recently and having difficulties with the tutorial above,
I wonder what is the difference between RoundRobinRouter and the current RoundRobinRouterLogic? Obviously the implementation is quite different.
Previously the implementation of RoundRobinRouter is
val workers = context.actorOf(Props[ItemProcessingWorker].withRouter(RoundRobinRouter(100)))
with processBatch
def processBatch(batch: List[BatchItem]) = {
if (batch.isEmpty) {
log.info(s"Done migrating all items for data set $dataSetId. $totalItems processed items, we had ${allProcessingErrors.size} errors in total")
} else {
// reset processing state for the current batch
currentBatchSize = batch.size
allProcessedItemsCount = currentProcessedItemsCount + allProcessedItemsCount
currentProcessedItemsCount = 0
allProcessingErrors = currentProcessingErrors ::: allProcessingErrors
currentProcessingErrors = List.empty
// distribute the work
batch foreach { item =>
workers ! item
}
}
}
Here's my implementation of RoundRobinRouterLogic
var mappings : Option[ActorRef] = None
var router = {
val routees = Vector.fill(100) {
mappings = Some(context.actorOf(Props[Application3]))
context watch mappings.get
ActorRefRoutee(mappings.get)
}
Router(RoundRobinRoutingLogic(), routees)
}
and treated the processBatch as such
def processBatch(batch: List[BatchItem]) = {
if (batch.isEmpty) {
println(s"Done migrating all items for data set $dataSetId. $totalItems processed items, we had ${allProcessingErrors.size} errors in total")
} else {
// reset processing state for the current batch
currentBatchSize = batch.size
allProcessedItemsCount = currentProcessedItemsCount + allProcessedItemsCount
currentProcessedItemsCount = 0
allProcessingErrors = currentProcessingErrors ::: allProcessingErrors
currentProcessingErrors = List.empty
// distribute the work
batch foreach { item =>
// println(item.id)
mappings.get ! item
}
}
}
I somehow cannot run this tutorial, and it's stuck at the point where it's iterating the batch list. I wonder what I did wrong.
Thanks
In the first place, you have to distinguish diff between them.
RoundRobinRouter is a Router that uses round-robin to select a connection.
While
RoundRobinRoutingLogic uses round-robin to select a routee
You can provide own RoutingLogic (it has helped me to understand how Akka works under the hood)
class RedundancyRoutingLogic(nbrCopies: Int) extends RoutingLogic {
val roundRobin = RoundRobinRoutingLogic()
def select(message: Any, routees: immutable.IndexedSeq[Routee]): Routee = {
val targets = (1 to nbrCopies).map(_ => roundRobin.select(message, routees))
SeveralRoutees(targets)
}
}
link on doc http://doc.akka.io/docs/akka/2.3.3/scala/routing.html
p.s. this doc is very clear and it has helped me the most
Actually I misunderstood the method, and found out the solution was to use RoundRobinPool as stated in http://doc.akka.io/docs/akka/2.3-M2/project/migration-guide-2.2.x-2.3.x.html
For example RoundRobinRouter has been renamed to RoundRobinPool or
RoundRobinGroup depending on which type you are actually using.
from
val workers = context.actorOf(Props[ItemProcessingWorker].withRouter(RoundRobinRouter(100)))
to
val workers = context.actorOf(RoundRobinPool(100).props(Props[ItemProcessingWorker]), "router2")

Processing multiple files as independent RDD's in parallel

I have a scenario where a certain number of operations including a group by has to be applied on a number of small (~300MB each) files. The operation looks like this..
df.groupBy(....).agg(....)
Now to process it on multiple files, I can use a wildcard "/**/*.csv" however, that creates a single RDD and partitions it to for the operations. However, looking at the operations, it is a group by and involves lot of shuffle which is unnecessary if the files are mutually exclusive.
What, I am looking at is, a way where i can create independent RDD's on files and operate on them independently.
It is more an idea than a full solution and I haven't tested it yet.
You can start with extracting your data processing pipeline into a function.
def pipeline(f: String, n: Int) = {
sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load(f)
.repartition(n)
.groupBy(...)
.agg(...)
.cache // Cache so we can force computation later
}
If your files are small you can adjust n parameter to use as small number of partitions as possible to fit data from a single file and avoid shuffling. It means you are limiting concurrency but we'll get back to this issue later.
val n: Int = ???
Next you have to obtain a list of input files. This step depends on a data source but most of the time it is more or less straightforward:
val files: Array[String] = ???
Next you can map above list using pipeline function:
val rdds = files.map(f => pipeline(f, n))
Since we limit concurrency at the level of the single file we want to compensate by submitting multiple jobs. Lets add a simple helper which forces evaluation and wraps it with Future
import scala.concurrent._
import ExecutionContext.Implicits.global
def pipelineToFuture(df: org.apache.spark.sql.DataFrame) = future {
df.rdd.foreach(_ => ()) // Force computation
df
}
Finally we can use above helper on the rdds:
val result = Future.sequence(
rdds.map(rdd => pipelineToFuture(rdd)).toList
)
Depending on your requirements you can add onComplete callbacks or use reactive streams to collect the results.
If you have many files, and each file is small (you say 300MB above which I would count as small for Spark), you could try using SparkContext.wholeTextFiles which will create an RDD where each record is an entire file.
By this way we can write multiple RDD parallely
public class ParallelWriteSevice implements IApplicationEventListener {
private static final IprogramLogger logger = programLoggerFactory.getLogger(ParallelWriteSevice.class);
private static ExecutorService executorService=null;
private static List<Future<Boolean>> futures=new ArrayList<Future<Boolean>>();
public static void submit(Callable callable) {
if(executorService==null)
{
executorService=Executors.newFixedThreadPool(15);//Based on target tables increase this
}
futures.add(executorService.submit(callable));
}
public static boolean isWriteSucess() {
boolean writeFailureOccured = false;
try {
for (Future<Boolean> future : futures) {
try {
Boolean writeStatus = future.get();
if (writeStatus == false) {
writeFailureOccured = true;
}
} catch (Exception e) {
logger.error("Erorr - Scdeduled write failed " + e.getMessage(), e);
writeFailureOccured = true;
}
}
} finally {
resetFutures();
if (executorService != null)
executorService.shutdown();
executorService = null;
}
return !writeFailureOccured;
}
private static void resetFutures() {
logger.error("resetFutures called");
//futures.clear();
}
}