Spark Structured Streaming for appending to text file using foreach - scala

I want to append lines to a text file using structured streaming. This code results in SparkException: Task not serializable. I think toDF is not allowed. How could I get this code to work?
df.writeStream
.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, version: Long): Boolean = {
true
}
override def process(row: Row): Unit = {
val df = Seq(row.getString(0)).toDF
df.write.format("text").mode("append").save(output)
}
override def close(errorOrNull: Throwable): Unit = {
}
}).start

You cannot call df.write.format("text").mode("append").save(output) inside process method. It will run in the executor side. You can use the file sink instead, such as
df.writeStream.format("text")....

Related

Creating stream from api in Apache Flink

Firstly I describe what I want to do. I have an API that gets a function as a argument (looks like this:dataFromApi => {//do sth}) and I would like to process this data by Flink. I wrote this code to simulate this API:
val myIterator = new TestIterator
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val th1 = new Thread {
override def run(): Unit = {
for (i <- 0 to 10) {
Thread sleep 1000
myIterator.addToQueue("test" + i)
}
}
}
th1.start()
val texts: DataStream[String] = env
.fromCollection(new TestIterator)
texts.print()
This is my iterator:
class TestIterator extends Iterator[String] with Serializable {
private val q: BlockingQueue[String] = new LinkedBlockingQueue[String]
def addToQueue(s: String): Unit = {
println("Put")
q.put(s)
}
override def hasNext: Boolean = true
override def next(): String = {
println("Wait for queue")
q.take()
}
}
My idea was execute myIterator.addToQueue(dataFromApi) when I receive data, but this code doesn't work. Despiting adding to the queue, execution blocks on q.take(). I tried to write own SourceFunction based on idea with Queue and also I tried with this: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/ but I can't manage I want.

Spark connection pooling - Is this the right approach

I have a Spark job in Structured Streaming that consumes data from Kafka and saves it to InfluxDB. I have implemented the connection pooling mechanism as follows:
object InfluxConnectionPool {
val queue = new LinkedBlockingQueue[InfluxDB]()
def initialize(database: String): Unit = {
while (!isConnectionPoolFull) {
queue.put(createNewConnection(database))
}
}
private def isConnectionPoolFull: Boolean = {
val MAX_POOL_SIZE = 1000
if (queue.size < MAX_POOL_SIZE)
false
else
true
}
def getConnectionFromPool: InfluxDB = {
if (queue.size > 0) {
val connection = queue.take()
connection
} else {
System.err.println("InfluxDB connection limit reached. ");
null
}
}
private def createNewConnection(database: String) = {
val influxDBUrl = "..."
val influxDB = InfluxDBFactory.connect(...)
influxDB.enableBatch(10, 100, TimeUnit.MILLISECONDS)
influxDB.setDatabase(database)
influxDB.setRetentionPolicy(database + "_rp")
influxDB
}
def returnConnectionToPool(connection: InfluxDB): Unit = {
queue.put(connection)
}
}
In my spark job, I do the following
def run(): Unit = {
val spark = SparkSession
.builder
.appName("ETL JOB")
.master("local[4]")
.getOrCreate()
...
// This is where I create connection pool
InfluxConnectionPool.initialize("dbname")
val sdvWriter = new ForeachWriter[record] {
var influxDB:InfluxDB = _
def open(partitionId: Long, version: Long): Boolean = {
influxDB = InfluxConnectionPool.getConnectionFromPool
true
}
def process(record: record) = {
// this is where I use the connection object and save the data
MyService.saveData(influxDB, record.topic, record.value)
InfluxConnectionPool.returnConnectionToPool(influxDB)
}
def close(errorOrNull: Throwable): Unit = {
}
}
import spark.implicits._
import org.apache.spark.sql.functions._
//Read data from kafka
val kafkaStreamingDF = spark
.readStream
....
val sdvQuery = kafkaStreamingDF
.writeStream
.foreach(sdvWriter)
.start()
}
But, when I run the job, I get the following exception
18/05/07 00:00:43 ERROR StreamExecution: Query [id = 6af3c096-7158-40d9-9523-13a6bffccbb8, runId = 3b620d11-9b93-462b-9929-ccd2b1ae9027] terminated with error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 8, 192.168.222.5, executor 1): java.lang.NullPointerException
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:332)
at com.abc.telemetry.app.influxdb.InfluxConnectionPool$.returnConnectionToPool(InfluxConnectionPool.scala:47)
at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:55)
at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:46)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:53)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)
The NPE is when the connection is returned to the connection pool in queue.put(connection). What am I missing here? Any help appreciated.
P.S: In the regular DStreams approach, I did it with foreachPartition method. Not sure how to do connection reuse/pooling with structured streaming.
I am using the forEachWriter for redis similarly, where the pool is being referenced in the process only. Your request would look something like below
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(record: record) = {
influxDB = InfluxConnectionPool.getConnectionFromPool
// this is where I use the connection object and save the data
MyService.saveData(influxDB, record.topic, record.value)
InfluxConnectionPool.returnConnectionToPool(influxDB)
}```
datasetOfString.writeStream.foreach(new ForeachWriter[String] {
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write string to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
})
From the docs of ForeachWriter,
Each task will get a fresh serialized-deserialized copy of the provided object
So whatever you initialize outside the ForeachWriter will run only at the driver.
You need to initialize the connection pool and open the connection in the open method.

Spark Structured Streaming, multiples queries are not running concurrently

I slightly modified example taken from here - https://github.com/apache/spark/blob/v2.2.0/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
I added seconds writeStream (sink):
scala
case class MyWriter1() extends ForeachWriter[Row]{
override def open(partitionId: Long, version: Long): Boolean = true
override def process(value: Row): Unit = {
println(s"custom1 - ${value.get(0)}")
}
override def close(errorOrNull: Throwable): Unit = true
}
case class MyWriter2() extends ForeachWriter[(String, Int)]{
override def open(partitionId: Long, version: Long): Boolean = true
override def process(value: (String, Int)): Unit = {
println(s"custom2 - $value")
}
override def close(errorOrNull: Throwable): Unit = true
}
object Main extends Serializable{
def main(args: Array[String]): Unit = {
println("starting")
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val host = "localhost"
val port = "9999"
val spark = SparkSession
.builder
.master("local[*]")
.appName("app-test")
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to host:port
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
// Start running the query that prints the running counts to the console
val query1 = wordCounts.writeStream
.outputMode("update")
.foreach(MyWriter1())
.start()
val ds = wordCounts.map(x => (x.getAs[String]("value"), x.getAs[Int]("count")))
val query2 = ds.writeStream
.outputMode("update")
.foreach(MyWriter2())
.start()
spark.streams.awaitAnyTermination()
}
}
Unfortunately, only first query runs, second never runs (MyWriter2 never been called)
Please advice what I'm doing wrong. According to doc: You can start any number of queries in a single SparkSession. They will all be running concurrently sharing the cluster resources.
I had the same situation (but on the newer structured-streaming api) and in my case it helped to call awaitTermination() on the last streamingQuery.
s.th. like:
query1.start()
query2.start().awaitTermination()
Update:
Instead above, this build-in solution/method is better:
sparkSession.streams.awaitAnyTermination()
Are you using nc -lk 9999 for sending data to spark ? every query create a connection to nc but nc can only send data to the first connection (query) , you can write a tcp server instead of nc
You are using .awaitAnyTermination() which will terminate the application when the first stream returns, you have to wait for both of the streams to finish before you terminate.
something like this should do the trick:
query1.awaitTermination()
query2.awaitTermination()
What you have done is right! Just go ahead and check the scheduler your Spark f/w is using. Most probably it would be using FIFO scheduler which means the first query takes up all resources. Just change it to FAIR Scheduler and you should be good.

Scala error: org.bson.codecs.configuration.CodecConfigurationException: Can't find a codec for class scala.Some

I am trying to get count of mongo query result, but I am getting error
org.bson.codecs.configuration.CodecConfigurationException: Can't find a codec for class scala.Some. Can somebody help?
This is my code:
def fetchData() = {
val mongoClient = MongoClient("mongodb://127.0.0.1")
val database = mongoClient.getDatabase("assignment")
val movieCollection = database.getCollection("movies")
val ratingCollection = database.getCollection("ratings")
val latch1 = new CountDownLatch(1)
movieCollection.find().subscribe(new Observer[Document] {
override def onError(e: Throwable): Unit = {
println("Error while fetching data")
e.printStackTrace()
}
override def onComplete(): Unit = {
latch1.countDown()
println("Completed fetching data")
}
override def onNext(movie: Document): Unit = {
if (movie.get("movieId") != null) {
ratingCollection.count(equal("movieId", movie.get("movieId"))).subscribe(new Observer[Long] {
override def onError(e: Throwable): Unit = println(s"onError: $e")
override def onNext(result: Long): Unit = { println(s"In count result : $result") }
override def onComplete(): Unit = println("onComplete")
})
}
}
})
latch1.await()
mongoClient.close()
}
I am using mongo 3.2.12 and scala -driver:
<dependency>
<groupId>org.mongodb.scala</groupId>
<artifactId>mongo-scala-driver_2.11</artifactId>
<version>2.1.0</version>
</dependency>
Use the code in this answer, and then add that codec to your codec registry. First, add
import org.bson.codecs.configuration.CodecRegistries.fromCodecs
You might already have other imports from that package already; for example, if you're using both providers, registries and codecs:
import org.bson.codecs.configuration.CodecRegistries.{fromRegistries, fromProviders, fromCodecs}
Just make sure you have everything you need imported.
Then:
val codecRegistry = fromRegistries(/* ..., */ fromCodecs(new SomeCodec()), DEFAULT_CODEC_REGISTRY)
val mongoClient = MongoClient("mongodb://127.0.0.1")
val database = mongoClient.getDatabase("assignment").withCodecRegistry(codecRegistry)
This answer is a little bit old, after losing many hours solving the same issue I write an update to it
Using Macros it's much easier now:
import org.mongodb.scala.bson.codecs._
val movieCodecProvider: CodecProvider = Macros.createCodecProviderIgnoreNone[Movie]()
val codecRegistry: CodecRegistry = fromRegistries(fromProviders(movieCodecProvider), DEFAULT_CODEC_REGISTRY)
val movieCollection: MongoCollection[Movie] = mongo.database.withCodecRegistry(codecRegistry).getCollection("movie_collection")
pay attention when you write "manual" query (i.e. query in which you are not parsing an entire Movie object, like an update) you have to handle the Some field like a plain object
so to set it to None you do
movieCollection.updateOne(
equal("_id", movie._id),
unset("foo")
)
to set it to Some
movieCollection.updateOne(
equal("_id", movie._id),
set("foo","some_value")
)
Please make sure all fields are transformed into Strings. Especially enums, where you want the field to be inserted as <your-enum>.map(_.toString).
The code that causes the exception is this
ratingCollection.count(equal("movieId", movie.get("movieId")))
Specifically movie.get(...) which has return type Option[BsonValue]. You cannot query collections with Option[T] values. Since you already checked against null, you could change the code to movie.get("movieId").get but the scala approach would be to utilize pattern matching, something akin to this.
override def onNext(movie: Document): Unit = {
movie.get("movieId") match {
case Some(movieId: BsonValue32) =>
ratingCollection.count(equal("movieId", movieId)).subscribe(new Observer[Long] {
override def onError(e: Throwable): Unit = println(s"onError: $e")
override def onNext(result: Long): Unit = { println(s"In count result : $result") }
override def onComplete(): Unit = println("onComplete")
})
case invalidId =>
println(s"invalid id ${invalidId}")
}
}
The underlying issue is how the mongo scala driver handles Option[T] monads. It's not well documented. One of the answers already provided to this question already shows how to solve this issue with querying case classes like Foo(bar: Option[BsonValue]) but be aware that it fails for other case classes such as Foo(bar: Seq[Option[BsonValue]]).
As mentioned in the answer I refer to, the createCodecProviderIgnoreNone and related codec providers only applies to full document queries, like insert, findReplace etc. When doing field operation queries you have to unpack the Option yourself. I prefer to do this using pattern matching such as shown in my example.
This works for me using the versions below:
scalaVersion := "2.13.1"
sbt.version = 1.3.8
import org.mongodb.scala.bson.ObjectId
object Person {
def apply(firstName: String, lastName: String): Person =
Person(new ObjectId(), firstName, lastName)
}
case class Person(_id: ObjectId, firstName: String, lastName: String)
import models.Person
import org.mongodb.scala.{Completed, MongoClient, MongoCollection, MongoDatabase, Observer}
import org.mongodb.scala.bson.codecs.Macros._
import org.mongodb.scala.bson.codecs.DEFAULT_CODEC_REGISTRY
import org.bson.codecs.configuration.CodecRegistries.{fromRegistries, fromProviders}
object PersonMain extends App {
val codecRegistry = fromRegistries(fromProviders(classOf[Person]), DEFAULT_CODEC_REGISTRY )
val mongoClient: MongoClient = MongoClient("mongodb://localhost")
val database: MongoDatabase = mongoClient.getDatabase("mydb").withCodecRegistry(codecRegistry)
val collection: MongoCollection[Person] = database.getCollection("people")
def addDocument(doc: Person) = {
collection.insertOne(doc)
.subscribe(new Observer[Completed] {
override def onNext(result: Completed): Unit = println(s"Inserted $doc")
override def onError(e: Throwable): Unit = println(s"Failed $e")
override def onComplete(): Unit = println(s"Completed inserting $doc")
})
}
addDocument(Person("name", "surname"))
mongoClient.close()
}

Not able to see RDD contents

i am using scala to create an RDD but when i am trying to see the contents of RDD i am getting below results
MapPartitionsRDD[25] at map at <console>:96
I want to see the contents of RDD how can i see that ?
below is my scala code:
object WordCount {
def main(args: Array[String]): Unit = {
val textfile = sc.textFile("/user/cloudera/xxx/File")
val word = textfile.filter(x => x.length > 0).map(_.split('|'))
println(word)
}
}
You need to provide an output transformation (action). e.g. use RDD.collect:
object WordCount {
def main(args: Array[String]): Unit = {
val textfile = sc.textFile("/user/cloudera/xxx/File")
val word = textfile.filter(x => x.length > 0).map(_.split('|'))
word.collect().foreach(println)
}
}
If you have an Array[Array[T]], you'll need to flatten before using foreach:
word.collect().flatten.foreach(println)