The following code works as expected, for each iteration the next value from the valueFeed is popped and written to the output.csv file
class TestSimulation extends Simulation {
val valueFeed = csv("input.csv")
val writer = {
val fos = new java.io.FileOutputStream("output.csv")
new java.io.PrintWriter(fos, true)
}
val scn = scenario("Test Sim")
.repeat(2) {
feed(valueFeed)
.exec(session => {
writer.println(session("value").as[String])
session
})
}
setUp(scn.inject(constantUsersPerSec(1) during (10 seconds)))
}
When feed creation is inlined in the feed method the behaviour is still exactly the same
class TestSimulation extends Simulation {
val writer = {
val fos = new java.io.FileOutputStream("output.csv")
new java.io.PrintWriter(fos, true)
}
val scn = scenario("Test Sim")
.repeat(2) {
feed(csv("input.csv"))
.exec(session => {
writer.println(session("value").as[String])
session
})
}
setUp(scn.inject(constantUsersPerSec(1) during (10 seconds)))
}
Since the feed creation is not extracted I would not expect each iteration to be using the same feed but creating it's own feed instance.
Why then is it the behaviour implies the same feed is being used and the first value from the input file not always written to the output?
Example input file (data truncated, tested with more lines to prevent empty feeder exception):
value
1
2
3
4
5
Because csv(...) is in fact FeederBuilder which is called once to produce the feeder to be used within the scenario.
The gatling DSL defines builders - these are executed only once at startup, so even when you inline you get a feeder shared between all users as the same (and only) builder is used to create all the users.
if you want to have each user have its own copy of the data, you can't use the .feed method, but you can get all the records and use other looping constructs to iterate through them
val records = csv("foo.csv").records
foreach(records, "record") {
exec(flattenMapIntoAttributes("${record}"))
}
Related
I am trying to upload data in spark job through API calls where API has payload limit of 5MB per API call(3rd party API limitation). I am accumulating data to form API body until payload limit to minimize the number of API calls. I am doing the same inside foreachPartition method of RDD with some comments to analyze. But this entire code is running totally fine when I run spark job locally in my machine (APIs are getting called & data is getting uploaded) but not working the same way in GCP cluster. While running the job in GCP dataproc cluster data is not getting uploaded through APIs so I believe the code inside foreachPartition is not getting called.
While running locally I can see all the log messages(1,2,3,4,5,6,7) but While running the job in GCP cluster I can see only few log messages(1,3)
Sample code is below for your reference
Would appreciate your suggestions to make it running in GCP cluster as well.
def exportData(
client: ApiClient,
batchSize: Int,
): Unit = {
val exportableDataRdd = getDataToUpload() //this is rdd of type RDD[UserDataObj]
logger.info(s"1. exportableDataRdd count:${exportableDataRdd.count}") //not a good practice to call count here but calling just for debugging
exportableDataRdd.foreachPartition { iterator =>
logger.info(s"2. perPartition iteration")
perPartitionMethod(client, iterator, batchSize)
}
logger.info(s"3. Data export completed")
}
def perPartitionMethod(
client: ApiClient,
iterator: Iterator[UserDataObj],
batchSize: Int
): Unit = {
logger.info(s"4. Inside perPartition")
iterator.grouped(batchSize).foreach { userDataGroup =>
val payLoadLimit = 5000000 //5 MB
val groupSize = userDataGroup.size
var counter = 0
var batchedUsersData = Seq[UserDataObj]()
userDataGroup.map{ user =>
counter = counter + 1
val curUsersDataSet = batchedUsersData :+ user
val body = Map[String, Any]("data" -> curUsersDataSet.map(_.toMap))
val apiPayload = Serialization.write(body)
val size = apiPayload.getBytes().length
if (size > payLoadLimit) {
val usersToUpload = batchedUsersData
logger.info(s"5. API called with batch size: ${usersToUpload.size}")
uploadDataThroughAPI(usersToUpload, client) //method to upload data through API
batchedUsersData = Seq[UserDataObj](user)
} else {
batchedUsersData = batchedUsersData :+ user
}
//upload left out data
if(counter == groupSize && batchedUsersData.size > 0){
uploadDataThroughAPI(batchedUsersData, client)
logger.info(s"6. API called with batch size: ${batchedUsersData.size}")
}
}
}
logger.info(s"7. perPartition completed")
}
I have two feeders in my test scenario
When i use one of them in first request it works alright, however when I use it in the next request block it doesn't work
I have tried changing the feed(feederName) position but still having the same problem
Here is a snippet of my test scenario with some comment to explain what's not working
//the Two feeders
val kmPerYearFeeder = Iterator.continually(
Map("kmPerYear" -> Random.shuffle(List("10000", "15000", "20000", "25000", "30000", "35000", "40000", "45000", "50000")).head)
)
val customerTypes = Iterator.continually(
Map("customerType" -> Random.shuffle(List("P","B")).head)
)
//here the customerTypes feeder is working
val homepage = feed(customerTypes)
.exec(http("homepage")
.get("/?customer_type=${customerType}"))
//this block is not really important but working alright
val pdp = exec(http("homepage")
....
// the feeder here doesn't work
val calculate_rate = feed(kmPerYearFeeder)
.exec(http("calculate_random_rate")
.get(session => session("random_pdp_link").as[String] + "?inquiry_type=&km_per_year=${kmPerYear}")
.check(status.is(200)))
val pdp_scenario = scenario("PDP").exec(homepage).exec(pdp).exec(calculate_rate)
setUp(
pdp_scenario.inject(
rampUsers(10) during (5 seconds),
).protocols(httpProtocol),
)
these are the get requests that are executed (got them from the logger)
GET ********?inquiry_type=&km_per_year=$%7BkmPerYear%7D
GET ********?inquiry_type=&km_per_year=$%7BkmPerYear%7D
GET ********?inquiry_type=&km_per_year=$%7BkmPerYear%7D
Your issue is that in where you attempt to use the Gatling EL to reference the kmPerYear session var in
.get(session => session("random_pdp_link").as[String] + "?inquiry_type=&km_per_year=${kmPerYear}")
this version of .get takes a session function (which you are using to get "random_pdp_link"), but the Gatling EL doesn't work in session functions.
You either need to get it manually using
session("kmPerYear").as[String]
or to reference "random_pdp_link" via the EL and not use the session function get. eg:
.get("${random_pdp_link}?inquiry_type=&km_per_year=${kmPerYear}")
i am new to Scala and gatling. i need to run scenaio if previous scenario is passed using doIf.
My code is:
HttpRequest
object CompanyProfileRequest {
val check_company_profile: HttpRequestBuilder = http("Create Company
Profile")
.get(onboarding_url_perf + "/profile")
.headers(basic_headers)
.headers(auth_headers)
.check(status.is(404).saveAs("NOT_FOUND"))
val create_company_profile: HttpRequestBuilder = http("Create Company
Profile")
.post(onboarding_url_perf + "/profile")
.headers(basic_headers)
.headers(auth_headers)
.body(RawFileBody("data/company/company_profile_corporation.json")).asJson
.check(status.is(200))
.check(jsonPath("$.id").saveAs("id"))
}
Scenario class is:-
object ProfileScenarios {
val createProfileScenarios: ScenarioBuilder = scenario("Create profile
Scenario")
.exec(TokenScenario.getCompanyUsersGwtToken)
.exec(CompanyProfileRequest.check_company_profile)
.doIf(session => session.attributes.contains("NOT_FOUND")) {
exec(CompanyProfileRequest.create_company_profile).exitHereIfFailed
}
}
And Simulation is :-
private val createProfile = ProfileScenarios
.createProfileScenarios
.inject(constantUsersPerSec(1) during (Integer.getInteger("ramp", 1)
second))
setUp(createProfile.protocols(httpConf))
Whenever i am running this simulation, I am not able to check this condition:-
.doIf(session => session.attributes.contains("NOT_FOUND"))
Any help is much appreciated.
Regards,
Vikram
I was able to get your example to work, but here's a better way...
the main issue with using
.check(status.is(404).saveAs("NOT_FOUND"))
and
.doIf(session => session.attributes.contains("NOT_FOUND"))
to implement conditional switching is that you've now got a check that will cause check_company_profile to fail when it really shouldn't (when you get a 200, for example).
A nicer way is to use a check transform to insert a boolean value into the "NOT_FOUND" variable. This way, your check_company_profile action can still pass when the office exists, and the doIf construct can just use the EL syntax and be much clearer as to why it's executing.
val check_company_profile: HttpRequestBuilder = http("Create Company Profile")
.get(onboarding_url_perf + "/profile")
.headers(basic_headers)
.headers(auth_headers)
.check(
status.in(200, 404), //both statuses are valid for this request
status.transform( status => 404.equals(status) ).saveAs("OFFICE_NOT_FOUND") //if the office does not exist, set a boolean flag in the session
)
now that you've got a boolean session variable ("OFFICE_NOT_FOUND") you can just use that in your doIf...
.doIf("${OFFICE_NOT_FOUND}") {
exec(CompanyProfileRequest.create_company_profile).exitHereIfFailed
}
I am using Apache Spark to take real time data from Apache Kafka which are from any sensors in Json format.
example of data format :
{
"meterId" : "M1",
"meterReading" : "100"
}
I want to apply rule to raise alert in real time. i.e. if I did not get data of "meter M 1" from last 2 hours or meter Reading exceed some limit the alert should be created.
so how can I achieve this in Scala?
I will respond here as an answer - too long for comment.
As I said json in kafka should be: one message per one line - send this instead -> {"meterId":"M1","meterReading":"100"}
If you are using kafka there is KafkaUtils with that you can create stream:
JavaPairDStream<String, String> input = KafkaUtils.createStream(jssc, zkQuorum, group, topics);
Pair means <kafkaTopicName, JsonMessage>. So basically you can take a look only to jsonmessage if you dont need to use kafkaTopicName.
for input you can use many methods that are described in JavaPairDStream documentation - eg. you can use map to get only messages to simple JavaDStream.
And of course you can use some json parser like gson, jackson or org.json it depends on use cases, performance for different cases and so on.
So you need to do something like this:
JavaDStream<String> messagesOnly = input.map(
new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> message) {
return message._2();
}
}
);
now you have only messages withou kafka topic name, now you can use your logic like you described in question.
JavaPairDStream<String, String> alerts = messagesOnly.filter(
new Function<Tuple2<String, String>, Boolean>() {
public Boolean call(Tuple2<String, String> message) {
// here use gson parser e.g
// filter messages with meterReading that doesnt exceed limit
// return true or false based on your logic
}
}
);
And here you have only alert messages - you can send it to another place.
-- AFTER EDIT
Below is the example in scala
// batch every 2 seconds
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
def filterLogic(message: String): Boolean=
{
// here your logic for filtering
}
// map _._2 takes your json messages
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
// filtered data after filter transformation
val filtered = messages.filter(m => filterLogic(m))
I'm new to scala and extremely new to scalaz. Through a different stackoverflow answer and some handholding, I was able to use scalaz.stream to implement a Process that would continuously fetch twitter API results. Now i'd like to do the same thing for the Cassandra DB where the twitter handles are stored.
The code for fetching the twitter results is here:
def urls: Seq[(Handle,URL)] = {
Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
}
val fetchUrl = channel.lift[Task, (Handle, URL), Fetched] {
url => Task.delay {
val finalResult = callTwitter(url)
if (finalResult.tweets.nonEmpty) {
connection.updateTwitter(finalResult)
} else {
println("\n" + finalResult.handle + " does not have new tweets")
}
s"\ntwitter Fetch & database update completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second) zipWith P.emitAll(urls))((b, url) => url).
through(fetchUrl)
val fetched = process.runLog.run
fetched.foreach(println)
What I'm planning to do is use
def urls: Seq[(Handle,URL)] = {
to continuously fetch Cassandra results (with an awakeEvery) and send them off to an actor to run the above twitter fetching code.
My question is, what is the best way to implement this with scalaz.stream? Note that i'd like it to get ALL the database results, then have a delay before getting ALL the database results again. Should i use the same architecture as the twitter fetching code above? If so, how would I create a channel.lift that doesn't require input? Is there a better way in scalaz.stream?
Thanks in advance
Got this working today. The cleanest way to do it would be to emit the database results as a stream and attach a sink to the end of the stream to do the twitter processing. What I actually have is a bit more complex as it retrieves the database results continuously and sends them off to an actor for the twitter processing. The style of retrieving the results follows my original code from my question:
val connection = new simpleClient(conf.getString("cassandra.node"))
implicit val threadPool = new ScheduledThreadPoolExecutor(4)
val system = ActorSystem("mySystem")
val twitterFetch = system.actorOf(Props[TwitterFetch], "twitterFetch")
def myEffect = channel.lift[Task, simpleClient, String]{
connection: simpleClient => Task.delay{
val results = Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
println("Query Successful, results= " +results +" at " + format.print(System.currentTimeMillis()))
twitterFetch ! fetched(connection, results)
s"database fetch completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
val fetching = process.runLog.run
fetching.foreach(println)
Some notes:
I had asked about using channel.lift without input, but it became clear that the input should be the cassandra connection.
The line
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
Changed from zipWith to flatMap because I wanted to retrieve the results continuously instead of once.