Problems joining 2 kafka streams (using custom timestampextractor) - scala

I'm having problems joining 2 kafka streams extracting the date from the fields of my event. The join is working fine when I do not define a custom TimeStampExtractor but when I do the join does not work anymore. My topology is quite simple:
val builder = new StreamsBuilder()
val couponConsumedWith = Consumed.`with`(Serdes.String(),
getAvroCouponSerde(schemaRegistryHost, schemaRegistryPort))
val couponStream: KStream[String, Coupon] = builder.stream(couponInputTopic, couponConsumedWith)
val purchaseConsumedWith = Consumed.`with`(Serdes.String(),
getAvroPurchaseSerde(schemaRegistryHost, schemaRegistryPort))
val purchaseStream: KStream[String, Purchase] = builder.stream(purchaseInputTopic, purchaseConsumedWith)
val couponStreamKeyedByProductId: KStream[String, Coupon] = couponStream.selectKey(couponProductIdValueMapper)
val purchaseStreamKeyedByProductId: KStream[String, Purchase] = purchaseStream.selectKey(purchaseProductIdValueMapper)
val couponPurchaseValueJoiner = new ValueJoiner[Coupon, Purchase, Purchase]() {
#Override
def apply(coupon: Coupon, purchase: Purchase): Purchase = {
val discount = (purchase.getAmount * coupon.getDiscount) / 100
new Purchase(purchase.getTimestamp, purchase.getProductid, purchase.getProductdescription, purchase.getAmount - discount)
}
}
val fiveMinuteWindow = JoinWindows.of(TimeUnit.MINUTES.toMillis(10))
val outputStream: KStream[String, Purchase] = couponStreamKeyedByProductId.join(purchaseStreamKeyedByProductId,
couponPurchaseValueJoiner,
fiveMinuteWindow
)
outputStream.to(outputTopic)
builder.build()
As I said this code works like a charm when I do not use a custom TimeStampExtractor but when I do by setting the StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG to my custom extractor class (I've double checked that the class is extracting the date properly) the join does not work anymore.
I'm testing the topology by running a unit test and passing the following events to it:
val coupon1 = new Coupon("Dec 05 2018 09:10:00.000 UTC", "1234", 10F)
// Purchase within the five minutes after the coupon - The discount should be applied
val purchase1 = new Purchase("Dec 05 2018 09:12:00.000 UTC", "1234", "Green Glass", 25.00F)
val purchase1WithDiscount = new Purchase("Dec 05 2018 09:12:00.000 UTC", "1234", "Green Glass", 22.50F)
val couponRecordFactory1 = couponRecordFactory.create(couponInputTopic, "c1", coupon1)
val purchaseRecordFactory1 = purchaseRecordFactory.create(purchaseInputTopic, "p1", purchase1)
testDriver.pipeInput(couponRecordFactory1)
testDriver.pipeInput(purchaseRecordFactory1)
val outputRecord1 = testDriver.readOutput(outputTopic,
new StringDeserializer(),
JoinTopologyBuilder.getAvroPurchaseSerde(
schemaRegistryHost,
schemaRegistryPort).deserializer())
OutputVerifier.compareKeyValue(outputRecord1, "1234", purchase1WithDiscount)
Not sure if the step of selecting a new key is getting rid of the proper date. I have tested a lot of combinations with no luck :(
Any help would be really appreciated!

I'm not sure of that because I don't know how much you test your code, but my guess will be that :
1) your code work with the default timestamp extractor because it's using the time when you're sending record into the pipes as timestamps records, so basically it will work because in your test you're sending data one after another without a pause.
2) you are using the TopologyTestDriver to do your tests !
Note that it's very useful for testing your business code and the topology as a unit (what I have as inputs and what is the correct according outputs) but there isn't a Kafka Stream app running in thoses tests.
In your case you can play with the method advanceWallClockTime(long) in the TopologyTestDriver class to simulate the system time walking.
If you want to start the topology you will have to do an integration test with an embedded kafka cluster (there is one on kafka libraries that's working just fine !).
Let me know if that's help :-)

Thank you for replying. I was working on this yesterday and I think I found the problem. As you said I am using the TopologyTestDriver to run my tests and when you initialize the TopologyTestDriver class it uses an initialWallClockTime, if you do not provide a value, the TopologyTestDriver will pick up the currentTimeMillis:
public TopologyTestDriver(Topology topology, Properties config) {
this(topology, config, System.currentTimeMillis());
}
There is another constructor that allows you to pass-in an initialWallClockTime. I've been testing this method but for some reason it does not work for me.
So to sum up my solution has been to create the Purchase and Coupon objects with the current timestamp. I'm still using my custom timestamp extractor but instead of hardcoding a date I am always getting the current timestamp and this way the join works fine.
Not fully happy with my end solution because I don't know why the initialWallClockTime does not work for me, but at least the tests are working fine now.

Related

Beam pipeline: Kafka to HDFS by time buckets

I am trying to bake a very simple pipeline that reads a stream of events from Kafka (KafkaIO.read) and writes the very same events to HDFS, bucketing each event together by hour (the hour is read from a timestamp field of the event, not processing time).
No assumption can be made about the timestamp of the events (they could be spanning through multiple days even if 99% of the time they are in real-time) and there is absolutely no information about the order of the events. My first attempt is to create a pipeline running in processing time.
My pipeline looks like this:
val kafkaReader = KafkaIO.read[String, String]()
.withBootstrapServers(options.getKafkaBootstrapServers)
.withTopic(options.getKafkaInputTopic)
.withKeyDeserializer(classOf[StringDeserializer])
.withValueDeserializer(classOf[StringDeserializer])
.updateConsumerProperties(
ImmutableMap.of("receive.buffer.bytes", Integer.valueOf(16 * 1024 * 1024))
)
.commitOffsetsInFinalize()
.withoutMetadata()
val keyed = p.apply(kafkaReader)
.apply(Values.create[String]())
.apply(new WindowedByWatermark(options.getBatchSize))
.apply(ParDo.of[String, CustomEvent](new CustomEvent))
val outfolder = FileSystems.matchNewResource(options.getHdfsOutputPath, true)
keyed.apply(
"write to HDFS",
FileIO.writeDynamic[Integer, CustomEvent]()
.by(new SerializableFunction[CustomEvent, Integer] {
override def apply(input: CustomEvent): Integer = {
new Instant(event.eventTime * 1000L).toDateTime.withMinuteOfHour(0).withSecondOfMinute(0)
(eventZeroHoured.getMillis / 1000).toInt
}
})
.via(Contextful.fn(new SerializableFunction[CustomEvent, String] {
override def apply(input: CustomEvent): String = {
convertEventToStr(input)
}
}), TextIO.sink())
.withNaming(new SerializableFunction[Integer, FileNaming] {
override def apply(bucket: Integer): FileNaming = {
new BucketedFileNaming(outfolder, bucket, withTiming = true)
}
})
.withDestinationCoder(StringUtf8Coder.of())
.to(options.getHdfsOutputPath)
.withTempDirectory("hdfs://tlap/tmp/gulptmp")
.withNumShards(1)
.withCompression(Compression.GZIP)
)
And this is my WindowedByWatermark:
class WindowedByWatermark(bucketSize: Int = 5000000) extends PTransform[PCollection[String], PCollection[String]] {
val window: Window[String] = Window
.into[String](FixedWindows.of(Duration.standardMinutes(10)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(bucketSize))
)
.withAllowedLateness(Duration.standardMinutes(30))
.discardingFiredPanes()
override def expand(input: PCollection[String]): PCollection[String] = {
input.apply("window", window)
}
}
The pipeline runs flawlessly but it is suffering from incredibly high back pressure due to the write phase (the groupby caused by the writeDynamic). Most of the events are coming in real-time, hence they belong to the same hour. I tried also bucketing the data using hour and minutes, without much help.
After days of pain, I have decided to replicate the same with Flink using a bucketingSink and the performance is excellent.
val stream = env
.addSource(new FlinkKafkaConsumer011[String](options.kafkaInputTopic, new SimpleStringSchema(), properties))
.addSink(bucketingSink(options.hdfsOutputPath, options.batchSize))
According to my analysis (even using JMX), the threads in Beam are waiting during the write phase to HDFS (and this causes the pipeline to pause the retrieval of data from Kafka).
I have therefore the following questions:
Is it possible to push down the bucketing as the bucketingSink is doing also in Beam?
Is there a smarter way to achieve the same in Beam?

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

Passing arguments between Gatling scenarios and simulation

I'm current creating some Gatling simulation to test a REST API. I don't really understand Scala.
I've created a scenario with several exec and pause;
object MyScenario {
val ccData = ssv("cardcode_fr.csv").random
val nameData = ssv("name.csv").random
val mobileData = ssv("mobile.csv").random
val emailData = ssv("email.csv").random
val itemData = ssv("item_fr.csv").random
val scn = scenario("My use case")
.feed(ccData)
.feed(nameData)
.feed(mobileData)
.feed(emailData)
.feed(itemData)
.exec(
http("GetCustomer")
.get("/rest/customers/${CardCode}")
.headers(Headers.headers)
.check(
status.is(200)
)
)
.pause(3, 5)
.exec(
http("GetOffers")
.get("/rest/offers")
.queryParam("customercode", "${CardCode}")
.headers(Headers.headers)
.check(
status.is(200)
)
)
}
And I've a simple Simulation :
class MySimulation extends Simulation {
setUp(MyScenario.scn
.inject(
constantUsersPerSec (1 ) during (1)))
.protocols(EsbHttpProtocol.httpProtocol)
.assertions(
global.successfulRequests.percent.is(100))
}
The application I'm trying to simulate is a multilocation mobile App, so I've prepared a set of samples data for each Locale (US, FR, IT...)
My REST API handles all the locales, therefore I want to make the simulation concurrently execute several instances of MyScenario, each with a different locale sample, to simulate the global load.
Is it possible to execute my simulation without having to create/duplicate the scenario and change the val ccData = ssv("cardcode_fr.csv").random for each one?
Also, each locale has its own load, how can I create a simulation that takes a single scenario and executes it several times concurrently with a different load and feeders?
Thanks in advance.
From what you've said, I think this may be a good approach:
Start by grouping your data in such a way that you can look up each item you want to send based on the current locale. For this, I would recommend using a Map that matches a locale string (such as "FR") to the item that matches that locale for the field you're looking to fill in. Then, at the start of each iteration of the scenario, you just pick which locale you want to use for the current iteration from a list. It would look something like this:
val locales = List("US", "FR", "IT")
val names = Map( "US" -> "John", "FR" -> "Pierre", "IT" -> "Guillame")
object MyScenario {
//These two lines pick a random locale from your list
val random_index = rand.nextInt(locales.length);
val currentLocale = locales(random_index);
//This line gets the name
val name = names(currentLocale)
//Do the rest of your logic here
}
This is a very simplified example - you'll have to figure out how you actually want to retrieve the data from files and put it into a Map structure, as I assume you don't want to hard code every item for every field into your code.

Jsoup post, selecting different options

http://www.myprotein.com/sports-nutrition/impact-whey-protein/10530943.html
I would like retrieve the prices for the different amounts, i.e. 1kg and 2.5kg
The first option is quite straightforward.
val page = Jsoup.connect("http://www.myprotein.com/sports-nutrition/impact-whey-protein/10530943.html").get()
println(page.select("div.media").select("h2.price").text.replaceAll("[a-zA-z\\s]", ""))
Getting the different options:
for (value <- page.getElementById("opts-7").getElementsByAttribute("value").asScala ) {
println(value.attr("value"))
}
However I have no idea about how to go on with the post, setting the cookies, headers etc. I prefer the speed and parsing abilities from Jsoup, however I am also considering taking the easy way out and using selenium to select the other options and parsing the source with jsoup.
So I decided to change to Selenium. It's selecting the right option with an emulated click however the price isn't updating. If I avoid closing the WebDriver and select the 2.5 kg manually it updates price accordingly
var driver: WebDriver = null
def main(args: Array[String]) {
setupWebDriver()
driver.navigate().to("http://www.myprotein.com/sports-nutrition/impact-whey-protein/10530943.html")
driver.findElement(By.xpath("//*[#id=\"opts-7\"]/option[2]")).click()
val temp = Jsoup.parse(driver.getPageSource)
println(temp.select("div.product-price").text())}
There's the main method and the WebDriver setup
def setupWebDriver() {
val binaryExe = "\\phantomjs.exe"
val BinaryChrome = "\\chromedriver.exe"
val caps: DesiredCapabilities = new DesiredCapabilities()
caps.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, System.getProperty("user.dir") + binaryExe)
System.setProperty("phantomjs.binary.path", System.getProperty("user.dir") + binaryExe)
System.setProperty("webdriver.chrome.driver", System.getProperty("user.dir") + BinaryChrome)
caps.setJavascriptEnabled(true)
caps.setBrowserName(BrowserType.GOOGLECHROME)
driver = new ChromeDriver(caps)
driver.manage.timeouts.implicitlyWait(10, TimeUnit.SECONDS)}
Can anyone enlighten me on how I would get the price to update this way?

Why are JodaTime timestamps not re-initializing in Squeryl?

Using Scala, JodaTime, and Squeryl for ORM. There's an annoying problem where once the application starts up, a Timestamp generated using JodaTime doesn't re-initialize every time it's called. Instead it sets the time once and annoyingly doesn't re-initialize every time the SQL is called.
Code below. First, the time parameter:
val todayEnd = new Timestamp(new DateMidnight(now, DateTimeZone.forID("America/Los_Angeles")).plusDays(1).getMillis())
And the Squeryl JOIN:
join(DB.jobs, DB.clients.leftOuter, DB.projects.leftOuter)((j,c,p) =>
where((j.teamId === teamId)
and (j.startTime < todayEnd)
and (j.userId isNotNull)
and (j.canceled === false)
and (j.completed === false))
select(j,c,p)
on(j.clientId === c.map(_.id), j.projectId === p.map(_.id)))
The strange part is that if I generate the todayEnd timestamp without JodaTime, then it re-initializes every time. So what is JodaTime doing differently?
Found the problem: apparently the thread managing the JOIN was never successfully being shutdown, and was being re-referenced inside Akka. This meant that the todayEnd variable had never been re-initialized.
So the take-home lesson is: manage your threads.
Update
As I have further learned, the original object holding the time values were set as val. As it turns out, they need to be def.
Bad:
val today = new Date()
lazy val today = new Date()
Good:
def today = new Date()