I slightly modified example taken from here - https://github.com/apache/spark/blob/v2.2.0/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
I added seconds writeStream (sink):
scala
case class MyWriter1() extends ForeachWriter[Row]{
override def open(partitionId: Long, version: Long): Boolean = true
override def process(value: Row): Unit = {
println(s"custom1 - ${value.get(0)}")
}
override def close(errorOrNull: Throwable): Unit = true
}
case class MyWriter2() extends ForeachWriter[(String, Int)]{
override def open(partitionId: Long, version: Long): Boolean = true
override def process(value: (String, Int)): Unit = {
println(s"custom2 - $value")
}
override def close(errorOrNull: Throwable): Unit = true
}
object Main extends Serializable{
def main(args: Array[String]): Unit = {
println("starting")
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val host = "localhost"
val port = "9999"
val spark = SparkSession
.builder
.master("local[*]")
.appName("app-test")
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to host:port
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
// Start running the query that prints the running counts to the console
val query1 = wordCounts.writeStream
.outputMode("update")
.foreach(MyWriter1())
.start()
val ds = wordCounts.map(x => (x.getAs[String]("value"), x.getAs[Int]("count")))
val query2 = ds.writeStream
.outputMode("update")
.foreach(MyWriter2())
.start()
spark.streams.awaitAnyTermination()
}
}
Unfortunately, only first query runs, second never runs (MyWriter2 never been called)
Please advice what I'm doing wrong. According to doc: You can start any number of queries in a single SparkSession. They will all be running concurrently sharing the cluster resources.
I had the same situation (but on the newer structured-streaming api) and in my case it helped to call awaitTermination() on the last streamingQuery.
s.th. like:
query1.start()
query2.start().awaitTermination()
Update:
Instead above, this build-in solution/method is better:
sparkSession.streams.awaitAnyTermination()
Are you using nc -lk 9999 for sending data to spark ? every query create a connection to nc but nc can only send data to the first connection (query) , you can write a tcp server instead of nc
You are using .awaitAnyTermination() which will terminate the application when the first stream returns, you have to wait for both of the streams to finish before you terminate.
something like this should do the trick:
query1.awaitTermination()
query2.awaitTermination()
What you have done is right! Just go ahead and check the scheduler your Spark f/w is using. Most probably it would be using FIFO scheduler which means the first query takes up all resources. Just change it to FAIR Scheduler and you should be good.
Related
I have multiple files those are independent and need processing by spark. How could I load them into separate rdds in parallel? Thanks!
The coding language is scala.
If you want concurrent reading/processing of RDDs, you could leverage scala.concurrent.Future (or effects in ZIO, Cats etc).
Sample code for loading function is below:
def load(paths: Seq[String], spark: SparkSession)
(implicit ec: ExecutionContext): Seq[Future[RDD[String]]] = {
def loadSinglePath(path: String): Future[RDD[String]] = Future {
spark.sparkContext.textFile(path)
}
paths map loadSinglePath
}
Sample code for using this function:
import scala.concurrent.duration.{Duration, DurationInt}
val sc = SparkSession.builder.master("local[*]").getOrCreate()
implicit val ec = ExecutionContext.global
val result = load(Seq("t1.txt", "t2.txt", "t3.txt"), sc).zipWithIndex
.map { case (rddFuture, idx) =>
rddFuture.map( rdd =>
println(s"Rdd with index $idx has ${rdd.count()}")
)
}
Await.result(Future.sequence(result), 1 hour)
For example purposes, default global ExecutionContext is provided, but it could be configurable to run your code inside the custom one (just replace this implicit val ec with your own ExecutionContext)
I am new to Scala and I have some questions about how it works.
I want to do the next thing : given list of values, I want to construct some imitation of dictionary in parallel, something like that: (1,2,3,4) -> ((1,1), (2,2), (3,3), (4,4) ). I know that if we deal with parallelized collections we should use accumulators. So here is my attempt:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.util.AccumulatorV2
import scala.collection.mutable.ListBuffer
class DictAccumulatorV2 extends AccumulatorV2[Int, ListBuffer[(Int, Int)]] {
private var dict:ListBuffer[(Int, Int)]= new ListBuffer[(Int, Int)]
def reset(): Unit = {
dict.clear()
}
def add(v: Int): Unit = {
dict.append((v, v))
}
def value():ListBuffer[(Int, Int)] = {
return dict
}
def isZero(): Boolean = {
return dict.isEmpty
}
def copy() : AccumulatorV2[Int, ListBuffer[(Int, Int)]] = {
// I do not understand how to code it correctly
return new DictAccumulatorV2
}
def merge(other:AccumulatorV2[Int, ListBuffer[(Int, Int)]]): Unit = {
// I do not understand how to code it correctly without reinitializing dict from val to var
dict = dict ++ other.value
}
}
object FirstSparkApplication {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("MyFirstApp").setMaster("local")
val sc = new SparkContext(conf)
val accum = new DictAccumulatorV2()
sc.register(accum, "mydictacc")
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
var res = distData.map(x => accum.add(x))
res.count()
println(accum)
}
}
So I wonder if I do it right or there are any mistakes.
In general I also have questions about how sc.parallelize works. Does it actually parallelize job on my machine or it's just fictional string of code? What should I put instead of "local" in setMaster? How can I see on which nodes is the task been performing? Is the task performed on the all of the nodes at the same time or there is some sequence?
(1,2,3,4) -> ((1,1), (2,2), (3,3), (4,4) )
You can do this in Scala by doing
val list = List(1,2,3,4)
val dict = list.map(i => (i,i))
Spark Accumulators are used as a communication means from Spark executor to Driver.
If you want to do the above in Parallel, then you would construct an RDD out of this list and applying map transformation to it like shown above.
In spark shell it would look like
val list = List(1,2,3,4)
val listRDD = sc.parallelize(list)
val dictRDD = listRDD.map(i => (i,i))
how sc.parallelize works
It creates a distributed Dataset (RDD in spark terms) using the collection that you pass in to the function. More information.
It does parallelize your job.
If you are submitting your spark job to a cluster then you should be able to see a YARN application ID or URL after running spark-submit command.You can visit the YARN application URL and see how many executors are processing that distributed dataset and what sequence they are performed in.
What should I put instead of "local" in setMaster
From the Spark documentation -
The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
I'm developing a client-server application using Akka Http and Akka Streams.
The main idea is that the server must feed the http response with a Source from an Akka streams.
The problem is that the server accumulates some elements before sending the first message to the client. However, I need the server to send element to element as soon as a new element is produced by the source.
Code example:
case class Example(id: Long, txt: String, number: Double)
object MyJsonProtocol extends SprayJsonSupport with DefaultJsonProtocol {
implicit val exampleFormat = jsonFormat3(Test)
}
class BatchIterator(batchSize: Int, numberOfBatches: Int, pause: FiniteDuration) extends Iterator[Array[Test]]{
val range = Range(0, batchSize*numberOfBatches).toIterator
val numberOfBatchesIter = Range(0, numberOfBatches).toIterator
override def hasNext: Boolean = range.hasNext
override def next(): Array[Test] = {
println(s"Sleeping for ${pause.toMillis} ms")
Thread.sleep(pause.toMillis)
println(s"Taking $batchSize elements")
Range(0, batchSize).map{ _ =>
val count = range.next()
Test(count, s"Text$count", count*0.5)
}.toArray
}
}
object Server extends App {
import MyJsonProtocol._
implicit val jsonStreamingSupport: JsonEntityStreamingSupport = EntityStreamingSupport.json()
.withFramingRenderer(
Flow[ByteString].intersperse(ByteString(System.lineSeparator))
)
implicit val system = ActorSystem("api")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
def fetchExamples(): Source[Array[Test], NotUsed] = Source.fromIterator(() => new BatchIterator(5, 5, 2 seconds))
val route =
path("example") {
complete(fetchExamples)
}
val bindingFuture = Http().bindAndHandle(route, "localhost", 9090)
println("Server started at localhost:9090")
StdIn.readLine()
bindingFuture.flatMap(_.unbind()).onComplete(_ ⇒ system.terminate())
}
Then, if I execute:
curl --no-buffer localhost:9090/example
I get all the elements at the same time instead of receiving an element every 2 seconds.
Any idea about how I can "force" the server to send every element as it comes out from the source?
Finally, I've found the solution. The problem was that the source is synchronous... So the solution is just to call to the function async
complete(fetchExamples.async)
I want to append lines to a text file using structured streaming. This code results in SparkException: Task not serializable. I think toDF is not allowed. How could I get this code to work?
df.writeStream
.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, version: Long): Boolean = {
true
}
override def process(row: Row): Unit = {
val df = Seq(row.getString(0)).toDF
df.write.format("text").mode("append").save(output)
}
override def close(errorOrNull: Throwable): Unit = {
}
}).start
You cannot call df.write.format("text").mode("append").save(output) inside process method. It will run in the executor side. You can use the file sink instead, such as
df.writeStream.format("text")....
I am trying to read records from Kafka message and put into Hbase. Though the scala script is running with out any issue, the inserts are not happening. Please help me.
Input:
rowkey1,1
rowkey2,2
Here is the code which I am using:
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(row(1)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap).map(_._2)
val words = lines.map(line => line.split(",")).map(line => (line(0),line(1)))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
From the API doc for HTable's flushCommits() method: "Executes all the buffered Put operations". You should call this at the end of your blah() method -- it looks like they're currently being buffered but never executed or executed at some random time.