Testing Twitter with Spark Streaming API - scala

I am new to Streaming framework of Spark and was trying to process the twitter stream.
I am in process of writing test cases for same and understand that I can use Spark StreamingSuiteBase which will help me test input as a stream on my functions.
But I have written a function which take DStream[Status] as input and after processing gives DStream[String] as output.
The api I am using from StreamingSuiteBase is testOperation.
test("Filter only words Starting with #") {
val inputTweet = List(List("this is #firstHash"), List("this is #secondHash"), List("this is #thirdHash"))
val expected = List(List("#firstHash"), List("#secondHash"), List("#thirdHash"))
testOperation(inputTweet, TransformTweets.getText _, expected, ordered = false)
And this is the function on which the input is sent..
def getText(englishTweets: DStream[Status]): DStream[String] = {
println(englishTweets.toString)
val hashTags = englishTweets.flatMap(x => x.getText.split(" ").filter(_.startsWith("#")))
hashTags
}
But I am getting the error "type mismatch" due to DStream[Status] and DStream[String]. How do I mock Stream[Status].

So, I resolved this issue by getting the Twitter status from "createStatus" API of TwitterObjectFactory. There was no need to mock TwitterStatus. Even if you manage to mock it there are Serialization issues. So, this is the best solution:
val rawJson = Source.fromURL(getClass.getResource("/tweetStatus.json")).getLines.mkString
val tweetStatus = TwitterObjectFactory.createStatus(rawJson)
Hope this helps someone !

Related

log error from catch block to cosmos db - spark

Objective:- Retrieve objects from an S3 bucket using a 'get' api call, write the retrieved object to azure datalake and in case of errors like 404s (object not found) write the error message to cosmos DB
"my_dataframe" consists of the a column (s3ObjectName) with object names like:-
s3ObjectName
a1.json
b2.json
c3.json
d4.json
e5.json
//retry function that writes cosmos error in event of failure
def retry[T](n: Int)(fn: => T): T = {
Try {
return fn
} match {
case Success(x) => x
case Failure(t: Throwable) => {
Thread.sleep(1000)
if (n > 1) {
retry(n - 1)(fn)
} else {
val loggerDf = Seq((t.toString)).toDF("Description")
.withColumn("Type", lit("Failure"))
.withColumn("id", uuid())
loggerDf.write.format("cosmos.oltp").options(ExceptionCfg).mode("APPEND").save()
throw t
}
}
}
}
//execute s3 get api call
my_dataframe.rdd.foreachPartition(partition => {
val creds = new BasicAWSCredentials(AccessKey, SecretKey)
val clientRegion: Regions = Regions.US_EAST_1
val s3client = AmazonS3ClientBuilder.standard()
.withRegion(clientRegion)
.withCredentials(new AWSStaticCredentialsProvider(creds))
.build()
partition.foreach(x => {
retry (2) {
val objectKey = x.getString(0)
val i = s3client.getObject(s3bucket_name, objectKey).getObjectContent
val inputS3String = IOUtils.toString(i, "UTF-8")
val filePath = s"${data_lake_file_path}"
val file = new File(filePath)
val fileWriter = new FileWriter(file)
val bw = new BufferedWriter(fileWriter)
bw.write(inputS3String)
bw.close()
fileWriter.close()
}
})
})
When the above is executed it results in the following error:-
Caused by: java.lang.NullPointerException
This error occurs in the retry function when it is asked to create the dataframe loggerDf and write it to cosmos db
Is there another way to write the error messages to cosmos DB ?
Maybe this isn't a good time to use spark. There is already some hadoop tooling to accomplish this type of S3 file transfer using hadoop that does what you are doing but uses hadoop tools.
If you still feel like spark is the correct tooling:
Split this into a reporting problem and a data transfer problem.
Create and test a list of the files to see if they're valid. Write a UDF that does the dirty work of creating a data frame of good/bad files.
Report the files that aren't valid. (To Cosmos)
Transfer the files that are valid.
If you want to write errors to cosmo DB you'll need to use an "out of band" method to initiate the connection from the executors.(Think: initiating a jdbc connection from inside the partition.foreach.)
As a lower standard, if you wanted to know if it happened you could use Accumulators. This isn't made for logging but does help transfer information from executors to the driver. This would enable you to write something back to Cosmos, but really was intended be used to simply count if something has happened. (And can double count if you end up retrying a executor, so it's not perfect.) It technically can transfer information back to the driver, but should only be used for countable things. (If this type of failure is extremely irregular it's likely suitable. If this happens a lot it's not suitable for use.)

Akka FileIO.fromPath - How to deal with IOResult and get the data instead?

I looked at a lot of examples and posts about this. I got it working in one way but I haven't quite gotten the idea yet, I'm still getting tripped up by Future[IOResult] when I'm trying to read a file into a stream of record objects, one per line, call it Future[List[LineRecordCaseClass]] is what I want instead.
val source = FileIO.fromPath(Paths.get("/tmp/junk_data.csv"))
val flow = makeFlow() // Framing.delimiter->split(",")->map to LineRecordCaseClass
val sink = Sink.collection[LineRecordCaseClass, List[LineRecordCaseClass]]
val graph = source.via(flow).to(sink)
val typeMismatchError: Future[List[LineRecordCaseClass]] = graph.run()
Why does graph.run() return a Future[IOResult] instead? Perhaps I'm missing a Keep.left somewhere, or something? If so what and where at?
Some concept I'm missing.
Here are the type of yours vals
val source: Source[ByteString, Future[IOResult]] =
val flow: Flow[ByteString, LineRecordCaseClass, NotUsed] =
val sink: Sink[LineRecordCaseClass, Future[List[LineRecordCaseClass]]] =
From the akka-stream doc , in the code snippet
By default, the materialized value of the leftmost stage is preserved
The materialized value at your leftmost stage (the source) is Future[IOResult].
In source.via(flow).to(sink), if you look at the implementation of .to, it calls .toMat with a default Keep.left
The type for Keep.both is
val check: RunnableGraph[(Future[IOResult], Future[List[LineRecordCaseClass]])] = source.via(flow).toMat(sink)(Keep.both)
So if you want Future[List[LineRecordCaseClass]], you can do
source.via(flow).toMat(sink)(Keep.right)
I recommend this video which explains the materialized value

What the best way to execute "not transformation" actions in elements of a Dataset

Newly coming in spark, I'm looking for a way to execute actions in all elements of a Dataset with Spark structured streaming:
I know this is a specific purpose case, what I want is iterate through all elements of Dataset, do an action on it, then continue to work with Dataset.
Example:
I got val df = Dataset[Person], I would like to be able to do something like:
def execute(df: Dataset[Person]): Dataset[Person] = {
df.foreach((p: Person) => {
someHttpClient.doRequest(httpPostRequest(p.asString)) // this is pseudo code / not compiling
})
df
}
Unfortunately, foreach is not available with structured streaming since I got error "Queries with streaming sources must be executed with writeStream.start"
I tried to use map(), but then error "Task not serializable" occured, I think because http request, or http client, is not serializable.
I know Spark is mostly use for filter and transform, but is there a way to handle well this specific use case ?
Thanks :)
val conf = new SparkConf().setMaster(“local[*]").setAppName(“Example")
val jssc = new JavaStreamingContext(conf, Durations.seconds(1)) // second option tell about The time interval at which streaming data will be divided into batches
Before concluding on whether a solution exists or not
Let’s as few questions
How does Spark Streaming work?
Spark Streaming receives live input data streams from input source and divides the data into batches, which are then processed by the Spark engine and final batch results are pushed down to downstream applications
How Does the batch execution start?
Spark does lazy evaluations on all the transformation applied on Dstream.it will apply transformation on actions (i.e only when you start streaming context)
jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate.
Note : Each Batch of Dstream contains multiple partitions ( it is just like running sequence of spark-batch job until input source stop producing data)
So you can have custom logic like below.
dStream.foreachRDD(new VoidFunction[JavaRDD[Object]] {
override def call(t: JavaRDD[Object]): Unit = {
t.foreach(new VoidFunction[Object] {
override def call(t: Object): Unit = {
//pseudo code someHttpClient.doRequest(httpPostRequest(t.asString))
}
})
}
})
But again make sure your someHttpClient is serializable or
you can create that object As mentioned below.
dStream.foreachRDD(new VoidFunction[JavaRDD[Object]] {
override def call(t: JavaRDD[Object]): Unit = {
// create someHttpClient object
t.foreach(new VoidFunction[Object] {
override def call(t: Object): Unit = {
//pseudo code someHttpClient.doRequest(httpPostRequest(t.asString))
}
})
}
})
Related to Spark Structured Streaming
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql._;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQuery
import java.util.Arrays;
import java.util.Iterator;
val spark = SparkSession
.builder()
.appName("example")
.getOrCreate();
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load(); // this is example source load copied from spark-streaming doc
lines.foreach(new ForeachFunction[Row] {
override def call(t: Row): Unit = {
//someHttpClient.doRequest(httpPostRequest(p.asString))
OR
// create someHttpClient object here and use it to tackle serialization errors
}
})
// Start running the query foreach and do mention downstream sink below/
val query = lines.writeStream.start
query.awaitTermination()

Making HTTP post requests on Spark usign foreachPartition

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks)
I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I repartitioned the dataframe to make sure each partition has no more than 1000 records. Also, created a json column for each line (so I need only to put them in an array later on)
The trouble is on the making the requests. I created the following a Serializable class using the following code
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.HttpClientBuilder
import org.apache.http.HttpHeaders
import org.apache.http.entity.StringEntity
import org.apache.commons.io.IOUtils
object postObject extends Serializable{
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://my-cool-api-endpoint")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
def makeHttpCall(row: Iterator[Row]) = {
val json_str = """{"people": [""" + row.toSeq.map(x => x.getAs[String]("json")).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
}
}
Now when I try the following:
postObject.makeHttpCall(data.head(2).toIterator)
It works like a charm. The requests go through, there is some output on the screen, and my API gets that data.
But when I try to put it in the foreachPartition:
data.foreachPartition { x =>
postObject.makeHttpCall(x)
}
Nothing happens. No output on screen, nothing arrives in my API. If I try to rerun it, almost all stages just skips. I believe, for any reason, it is just lazy evaluating my requests, but not actually performing it. I don't understand why, and how to force it.
postObject has 2 fields: client and post which has to be serialized.
I'm not sure that client is serialized properly. post object is potentially mutated from several partitions (on the same worker). So many things could go wrong here.
I propose tryng removing postObject and inlining its body into foreachPartition directly.
Addition:
Tried to run it myself:
sc.parallelize((1 to 10).toList).foreachPartition(row => {
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://google.com")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
val json_str = """{"people": [""" + row.toSeq.map(x => x.toString).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
})
Ran it both locally and in cluster.
It completes successfully and prints 405 errors to worker logs.
So requests definitely hit the server.
foreachPartition returns nothing as the result. To debug your issue you can change it to mapPartitions:
val responseCodes = sc.parallelize((1 to 10).toList).mapPartitions(row => {
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://google.com")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
val json_str = """{"people": [""" + row.toSeq.map(x => x.toString).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
Iterator.single(response.getStatusLine.getStatusCode)
}).collect()
println(responseCodes.mkString(", "))
This code returns the list of response codes so you can analyze it.
For me it prints 405, 405 as expected.
There is a way to do this without having to find out what exactly is not serializable. If you want to keep the structure of your code, you can make all fields #transient lazy val. Also, any call with side effects should be wrapped in a block. For example
val post = {
val httpPost = new HttpPost("https://my-cool-api-endpoint")
httpPost.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
httpPost
}
That will delay the initialization of all fields until they are used by the workers. Each worker will have an instance of the object and you will be able to make invoke the makeHttpCall method.

How to extract records from Dstream and write into Cassandra (Spark Streaming)

I am fetching data from Kafka and processing in Spark Streaming and writing Data into Cassandra
I am trying to Filter the DStream records but it doesn't filter the records and write the complete records in Cassandra,
Any suggestion with sample/example Code to filter multiple columns of records and any help will be highly appreciated i have done a research on this but not able to get any solution.
class SparkKafkaConsumer1(val recordStream : org.apache.spark.streaming.dstream.DStream[String], val streaming : StreamingContext) {
val internationalAddress = recordStream.map(line => line.split("\\|")(10).toUpperCase)
def timeToStr(epochMillis: Long): String =
DateTimeFormat.forPattern("YYYYMMddHHmmss").print(epochMillis)
if(internationalAddress =="INDIA")
{
print("-----------------------------------------------")
recordStream.print()
val riskScore = "1"
val timestamp: Long = System.currentTimeMillis
val formatedTimeStamp = timeToStr(timestamp)
var wc1 = recordStream.map(_.split("\\|")).map(r=>Row(r(0),r(1),r(2),r(3),r(4).toInt,r(5).toInt,r(6).toInt,r(7),r(8),r(9),r(10),r(11),r(12),r(13),r(14),r(15),r(16),riskScore.toInt,0,0,0,formatedTimeStamp))
implicit val rowWriter = SqlRowWriter.Factory
wc1.saveToCassandra("fraud", "fraudrating", SomeColumns("purchasetimestamp","sessionid","productdetails","emailid","productprice","itemcount","totalprice","itemtype","luxaryitem","shippingaddress","country","bank","typeofcard","creditordebitcardnumber","contactdetails","multipleitem","ipaddress","consumer1score","consumer2score","consumer3score","consumer4score","recordedtimestamp"))
}
(Note: I am have records with internationalAddress = INDIA in Kafka and I am very much new to Scala)
I'm not really sure what you're trying to do, but if you are simply trying to filter on records pertaining to India, you could do this:
implicit val rowWriter = SqlRowWriter.Factory
recordStream
.filter(_.split("\\|")(10).toUpperCase) == "INDIA")
.map(_.split("\\|"))
.map(r => Row(...))
.saveToCassandra(...)
As a side note, I think case classes would be really helpful for you.