I have an array that i want to use for 2 feeders. I was expecting each feeder will be able to use all the values in the array. But seems like the values run out
val baseArray = Array ( Map("transactionId" -> "q-1"),
Map("transactionId" -> "q-2"),
Map("transactionId" -> "q-3"))
val feeder_getA = baseArray.clone.queue
val scn_getInsuredOrPrincipals = scenario("getInsuredOrPrincipals")
.feed(feeder_getA)
.exec(http("request_getA").get("/getA/${transactionId}"))
val feeder_getB = baseArray.clone.queue
val scn_getInsuredOrPrincipals = scenario("getInsuredOrPrincipals")
.feed(feeder_getB)
.exec(http("request_getB").get("/getB/${transactionId}"))
setUp(
scn_getInsuredOrPrincipals.inject(
atOnceUsers(3), // 2
rampUsers(3) over (5 seconds)
),
scn_getInsuredOrPrincipal.inject(
atOnceUsers(3), // 2
rampUsers(3) over (5 seconds)
)
)
I get an error saying feeder is now empty after 3 values are consumed... i was assuming feeder_getA and feeder_getB would each get 3 values so each scenario would get equal number of values. That doesnt seem like its happening. Almot as if the clone isnt working
The issue is that your feeders are defined using the queue strategy, which runs through the elements and then fails if no more are available:
val feeder_getA = baseArray.clone.queue
You need to use the circular strategy, which goes back to the beginning:
val feeder_getA = baseArray.clone.circular
For more information see the docs.
Related
I have the following code where I want to get Dataframe dfDateFiltered from dfBackendInfo containing all rows with RowCreationTime greater than timestamp "latestRowCreationTime"
val latestRowCreationTime = dfVersion.agg(max("BackendRowCreationTime")).first.getTimestamp(0)
val dfDateFiltered = dfBackendInfo.filter($"RowCreationTime" > latestRowCreationTime)
The problem I see is that the first line adds a job in Databricks cluster making it slower.
Is there anyway if I could use a better way to filter (for ex. just using transformation instead of action)?
Below are the schemas of the 2 Dataframes:
case class Version(BuildVersion:String,
MainVersion:String,
Hotfix:String,
BackendRowCreationTime:Timestamp)
case class BackendInfo(SerialNumber:Integer,
NumberOfClients:Long,
BuildVersion:String,
MainVersion:String,
Hotfix:String,
RowCreationTime:Timestamp)
The below code worked:
val dfLatestRowCreationTime1 = dfVersion.agg(max($"BackendRowCreationTime").as("BackendRowCreationTime")).limit(1)
val latestRowCreationTime = dfLatestRowCreationTime1.withColumn("BackendRowCreationTime", when($"BackendRowCreationTime".isNull, DefaultTime))
val dfDateFiltered = dfBackendInfo.join(latestRowCreationTime, dfBackendInfo.col("RowCreationTime").gt(latestRowCreationTime.col("BackendRowCreationTime")))
This question already has an answer here:
Spark : Difference between accumulator and local variable
(1 answer)
Closed 3 years ago.
Hi I am writing code in scala for apache-spark.
my local variable "country" value is not reflecting after rdd iteration done.
I am assigning value in country variable after checking condition inside rdd iteration.until rdd is iterating value is available in country variable after control come out from loop value lost.
import org.apache.spark.sql.SparkSession
import java.lang.Long
object KPI1 {
def main(args:Array[String]){
System.setProperty("hadoop.home.dir","C:\\shivam docs\\hadoop-2.6.5.tar\\hadoop-2.6.5");
val spark=SparkSession.builder().appName("KPI1").master("local").getOrCreate();
val textFile=spark.read.textFile("C:\\shivam docs\\HADOOP\\sample data\\wbi.txt").rdd;
val splitData=textFile.map{
line=>{
val token=line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
(token(0),token(10).replace("\"","").replace(",", ""));
}
};
// splitData.max()._2;
var maxele=0l;
var index=0;
var country="";
splitData.foreach(println);
for(ele<-splitData){
val data=Long.parseLong(ele._2);
if(maxele<data){
maxele=data;
println(maxele);
country=ele._1;
println(country);
}
};
println("***************************** "+country+maxele);
spark.close()
}
}
country variable value should not have default value.
Both for and foreach is wide operation. That means the execution will happen on more than one executors and that's why you are getting default value for some threads. I'm running my sample code in single node cluster with 4 executors and you can see the execution has happened in two different executors( Thread id is evident)
Sample
val baseRdd = spark.sparkContext.parallelize(Seq((1, 2), (3, 4)))
for (h <- baseRdd) {
println( "Thread id " + Thread.currentThread().getId)
println("Value "+ h)
}
Output
Thread id 48
Value (1,2)
Thread id 50
Value (3,4)
If you still want to have your expected result, follow either of the below option
1.Make changes to your spark context configuration as
master("local[1]"). This will run your job with single executor.
collect() your splitData before you perform for(ele<-splitData){...}
Note Both the options are strictly for testing or experimental purpose only and it will not work against large datasets.
When you're using variables within Executors, Spark (YARN/Mesos etc.) creates a new instance of it per each Executor. This is why you don't see any update to your variable (the updates occur only on the Executors, none is retrieved to the Driver). If you want to accomplish this, you should use Accumulators:
Both 'maxele' & 'country' should be Accumulators.
You can read about it here and here
I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}
First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.
Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd
I have defined a mutable map of maps
import scala.collection.mutable.Map
val default = Map.empty[String, Int].withDefaultValue(0)
val count = Map.empty[Any, Map[String, Int]].withDefaultValue(default)
which I populate/update as in
count("furniture")("table") += 1
count("furniture")("chair") = 6
count("appliance")("dishwasher") = 1
How can I iterate over all items in count? And why does count.keys return an empty Set()?
With default, does not create new Map when no value exists in collection, it just returns default value on such requests, and other changes are done on this default value.
count("furniture")("table") += 1
count("furniture")("chair") = 6
count("appliance")("dishwasher") = 1
count("banana") // will return Map with "table", "chair" & "dishwasher"
is equivalent
default("table") += 1
default("chair") = 6
default("dishwasher") = 1
And since you return this default value on any key, this default map will be returned on every call.
Your code will work like this.
count("furniture") = Map.empty[String, Int].withDefaultValue(0)
count("appliance") = Map.empty[String, Int].withDefaultValue(0)
count("furniture")("table") += 1
count("furniture")("chair") = 6
count("appliance")("dishwasher") = 1
There are several problems with your approach:
Issue #1:
val default = Map.empty[String,Int].withDefaultValue(0)
defines a value default. There is only one instance of this value and it can not be changed, since you defined as a val.
That means that your count map has a default value which is always the same instance of an empty map. Since count is empty, count("furniture") or count("appliance") is exactly the same as just default.
Issue #2:
withDefaultValue does not add entries to a map it just returns a default for undefined keys.
See #mavarazys answer
I have tasks that I want to execute concurrently and each task takes substantial amount of memory so I have to execute them in batches of 2 to conserve memory.
def runme(n: Int = 120) = (1 to n).grouped(2).toList.flatMap{tuple =>
tuple.par.map{x => {
println(s"Running $x")
val s = (1 to 100000).toList // intentionally to make the JVM allocate a sizeable chunk of memory
s.sum.toLong
}}
}
val result = runme()
println(result.size + " => " + result.sum)
The result I expected from the output was 120 => 84609924480 but the output was rather random. The returned collection size differed from execution to execution. Most of the time there was missing count even though all the futures were executed looking at the console. I thought flatMap waits the parallel executions in map to complete before returning the complete. What should I do to always get the right result using par? Thanks
Just for the record: changing the underlying collection in this case shouldn't change the output of your program. The problem is related to this known bug. It's fixed from 2.11.6, so if you use that (or higher) Scala version, you should not see the strange behavior.
And about the overflow, I still think that your expected value is wrong. You can check that the sum is overflowing because the list is of integers (which are 32 bit) while the total sum exceeds the integer limits. You can check it with the following snippet:
val n = 100000
val s = (1 to n).toList // your original code
val yourValue = s.sum.toLong // your original code
val correctValue = 1l * n * (n + 1) / 2 // use math formula
var bruteForceValue = 0l // in case you don't trust math :) It's Long because of 0l
for (i ← 1 to n) bruteForceValue += i // iterate through range
println(s"yourValue = $yourValue")
println(s"correctvalue = $correctValue")
println(s"bruteForceValue = $bruteForceValue")
which produces the output
yourValue = 705082704
correctvalue = 5000050000
bruteForceValue = 5000050000
Cheers!
Thanks #kaktusito.
It worked after I changed the grouped list to Vector or Seq i.e. (1 to n).grouped(2).toList.flatMap{... to (1 to n).grouped(2).toVector.flatMap{...