Is this possible to write tweets in real-time into some relational database (like MySQL or PostgreSQL) without the usage of Spark in Scala? Instead, I would like to user streaming module of Akka.
I would really appreciate some resources and advice on this topic.
EDIT: I have tried to connect to Twitter API and I used Twitter4J to get some tweets and filter them. Now I am on the stage where I want to write extracted tweets into the database instead of writing to the console.
class Counter extends StatusAdapter{
implicit val system = ActorSystem("EmojiTrends")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
implicit val LoggingAdapter =
Logging(system, classOf[Counter])
val overflowStrategy = OverflowStrategy.backpressure
val bufferSize = 1000
val statusSource = Source.queue[Status](
bufferSize,
overflowStrategy
)
val sink: Sink[Any, Future[Done]] = Sink.foreach(println)
val flow: Flow[Status, String, NotUsed] = Flow[Status].map(status => status.getText)
val graph = statusSource via flow to sink
val queue = graph.run()
override def onStatus(status: Status) =
Await.result(queue.offer(status), Duration.Inf)
}
Related
I would like to know about the unit testing side of Spark Structured Streaming. My scenario is, I am getting data from Kafka and I am consuming it using Spark Structured Streaming and applying some transformations on top of the data.
I am not sure about how can I test this using Scala and Spark. Can someone tell me how to do unit testing in Structured Streaming using Scala. I am new to streaming.
tl;dr Use MemoryStream to add events and memory sink for the output.
The following code should help to get started:
import org.apache.spark.sql.execution.streaming.MemoryStream
implicit val sqlCtx = spark.sqlContext
import spark.implicits._
val events = MemoryStream[Event]
val sessions = events.toDS
assert(sessions.isStreaming, "sessions must be a streaming Dataset")
// use sessions event stream to apply required transformations
val transformedSessions = ...
val streamingQuery = transformedSessions
.writeStream
.format("memory")
.queryName(queryName)
.option("checkpointLocation", checkpointLocation)
.outputMode(queryOutputMode)
.start
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
eventGen.generate(userId = 1, offset = 1.second),
eventGen.generate(userId = 2, offset = 2.seconds))
val currentOffset = events.addData(batch)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
// check the output
// The output is in queryName table
// The following code simply shows the result
spark
.table(queryName)
.show(truncate = false)
So, I tried to implement the answer from #Jacek and I couldn't find how to create the eventGen object and also test a small streaming application for write data on the console. I am also using MemoryStream and here I show a small example working.
The class that I testing is:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SparkSession, functions}
object StreamingDataFrames {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName(StreamingDataFrames.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val lines = readData(spark, "socket")
val streamingQuery = writeData(lines)
streamingQuery.awaitTermination()
}
def readData(spark: SparkSession, source: String = "socket"): DataFrame = {
val lines: DataFrame = spark.readStream
.format(source)
.option("host", "localhost")
.option("port", 12345)
.load()
lines
}
def writeData(df: DataFrame, sink: String = "console", queryName: String = "calleventaggs", outputMode: String = "append"): StreamingQuery = {
println(s"Is this a streaming data frame: ${df.isStreaming}")
val shortLines: DataFrame = df.filter(functions.length(col("value")) >= 3)
val query = shortLines.writeStream
.format(sink)
.queryName(queryName)
.outputMode(outputMode)
.start()
query
}
}
I test only the writeData method. This is way I split the query into 2 methods.
Then here is the Spec to test the class. I use a SharedSparkSession class to facilitate the open and close of spark context. Like it is shown here.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.github.explore.spark.SharedSparkSession
import org.scalatest.funsuite.AnyFunSuite
class StreamingDataFramesSpec extends AnyFunSuite with SharedSparkSession {
test("spark structured streaming can read from memory socket") {
// We can import sql implicits
implicit val sqlCtx = sparkSession.sqlContext
import sqlImplicits._
val events = MemoryStream[String]
val queryName: String = "calleventaggs"
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
"this is a value to read",
"and this is another value"
)
val currentOffset = events.addData(batch)
val streamingQuery = StreamingDataFrames.writeData(events.toDF(), "memory", queryName)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
val result: DataFrame = sparkSession.table(queryName)
result.show
streamingQuery.awaitTermination(1000L)
assertResult(batch.size)(result.count)
val values = result.take(2)
assertResult(batch(0))(values(0).getString(0))
assertResult(batch(1))(values(1).getString(0))
}
}
I'm coding a small Akka Streams sample where I want to write elements of a List to a local TXT file
implicit val ec = context.dispatcher
implicit val actorSystem = context.system
implicit val materializer = ActorMaterializer()
val source = Source(List("a", "b", "c"))
.map(char => ByteString(s"${char} \n"))
val runnableGraph = source.toMat(FileIO.toPath(Paths.get("~/Downloads/results.txt")))(Keep.right)
runnableGraph.run()
The file is already created by the location I set in the code.
I do not terminate the actor system, so definitely it has enough time to write all of List elements to the file.
But unfortunately, nothing happens
Use the expanded path to your home directory instead of the tilde (~). For example:
val runnableGraph =
source.toMat(
FileIO.toPath(Paths.get("/home/YourUserName/Downloads/results.txt")))(Keep.right)
runnableGraph.run()
I am trying to write a simple consumer of messages from kafka using akka streams.
build.sbt
"com.typesafe.akka" %% "akka-stream-kafka" % "0.17"
My code
object AkkaStreamskafka extends App {
// producer settings
implicit val system = ActorSystem()
implicit val actorMaterializer = ActorMaterializer()
val consumerSettings = ConsumerSettings(system, Some(new ByteArrayDeserializer), Some(new StringDeserializer))
.withBootstrapServers("foo:9092")
.withGroupId("abhi")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
val source = Consumer
.committableSource(consumerSettings, Subscriptions.topics("my-topic))
val flow = Flow[ConsumerMessage.CommittableMessage[Array[Byte], String]].mapAsync(1){msg =>
msg.committableOffset.commitScaladsl().map(_ => msg.record.value);
}
val sink = Sink.foreach[String](println)
val graph = RunnableGraph.fromGraph(GraphDSL.create(sink){implicit builder =>
s =>
import GraphDSL.Implicits._
source ~> flow ~> s.in
ClosedShape
})
val future = graph.run()
Await.result(future, Duration.Inf)
}
But I get an error
[WARN] [09/28/2017 13:12:52.333] [default-akka.kafka.default-dispatcher-7]
[akka://default/system/kafka-consumer-1] Consumer interrupted with WakeupException after timeout.
Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds
Edit:
I can do a ssh foo and then type the following command on the server terminal ./kafka-console-consumer --zookeeper localhost:2181 --topic my-topic and I can see data. So I guess my server name foo is correct and kafka is up and running on that machine.
Edit2:
On the Kafka Server I am running Cloudera 5.7.1. Kafka version is jars/kafka_2.10-0.9.0-kafka-2.0.0.jar
I was able to solve the problem myself.
The library "com.typesafe.akka" %% "akka-stream-kafka" only works for Kafka 0.10 and beyond. it does not work for earlier versions of Kafka. When I listed the kafka jars on my kafka server I found that I am using Cloudera 5.7.1 which comes with Kafka 0.9.
In order to create a Akka Streams source for this version. I needed to use
"com.softwaremill.reactivekafka" % "reactive-kafka-core_2.11" % "0.10.0"
They also have an example there https://github.com/kciesielski/reactive-kafka
This code worked perfectly for me
implicit val actorSystem = ActorSystem()
implicit val actorMaterializer = ActorMaterializer()
val kafka = new ReactiveKafka()
val consumerProperties = ConsumerProperties(
bootstrapServers = "foo:9092",
topic = "my-topic",
groupId = "abhi",
valueDeserializer = new StringDeserializer()
)
val source = Source.fromPublisher(kafka.consume(consumerProperties))
val flow = Flow[ConsumerRecord[Array[Byte], String]].map(r => r.value())
val sink = Sink.foreach[String](println)
val graph = RunnableGraph.fromGraph(GraphDSL.create(sink) {implicit builder =>
s =>
import GraphDSL.Implicits._
source ~> flow ~> s.in
ClosedShape
})
val future = graph.run()
future.onComplete{_ =>
actorSystem.terminate()
}
Await.result(actorSystem.whenTerminated, Duration.Inf)
I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet. - I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?
I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
val session: SparkSession = SparkSession.builder.getOrCreate
val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedDf)
idfModel.write.save(idfModelFile.getAbsolutePath)
}
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
val wordsDf = tokenizer.transform(documentDf)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
val featuresDf = extractor.transform(featurizedDf)
featuresDf.select("update", "features")
I want to do simple machine learning in Spark.
First the application should do some learning from historical data from a file, train the machine learning model and then read input from kafka to give predictions in real time. To do that I believe I should use spark streaming. However, I'm afraid that I don't really understand how spark streaming works.
The code looks like this:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("test App")
val sc = new SparkContext(conf)
val fromFile = parse(sc, Source.fromFile("my_data_.csv").getLines.toArray)
ML.train(fromFile)
real_time(sc)
}
Where ML is a class with some machine learning things in it and train gives it data to train. There also is a method classify which calculates predictions based on what it learned.
The first part seems to work fine, but real_time is a problem:
def real_time(sc: SparkContext) : Unit = {
val ssc = new StreamingContext(new SparkConf(), Seconds(1))
val topic = "my_topic".split(",").toSet
val params = Map[String, String](("metadata.broker.list", "localhost:9092"))
val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topic)
var lin = dstream.map(_._2)
val str_arr = new Array[String](0)
lin.foreach {
str_arr :+ _.collect()
}
val lines = parse(sc, str_arr).map(i => i.features)
ML.classify(lines)
ssc.start()
ssc.awaitTermination()
}
What I would like it to do is check the Kafka stream and compute it if there are any new lines. This doesn't seem to be the case, I added some prints and it is not printed.
How to use spark streaming, how should it be used in my case?