What the best way to execute "not transformation" actions in elements of a Dataset - scala

Newly coming in spark, I'm looking for a way to execute actions in all elements of a Dataset with Spark structured streaming:
I know this is a specific purpose case, what I want is iterate through all elements of Dataset, do an action on it, then continue to work with Dataset.
Example:
I got val df = Dataset[Person], I would like to be able to do something like:
def execute(df: Dataset[Person]): Dataset[Person] = {
df.foreach((p: Person) => {
someHttpClient.doRequest(httpPostRequest(p.asString)) // this is pseudo code / not compiling
})
df
}
Unfortunately, foreach is not available with structured streaming since I got error "Queries with streaming sources must be executed with writeStream.start"
I tried to use map(), but then error "Task not serializable" occured, I think because http request, or http client, is not serializable.
I know Spark is mostly use for filter and transform, but is there a way to handle well this specific use case ?
Thanks :)

val conf = new SparkConf().setMaster(“local[*]").setAppName(“Example")
val jssc = new JavaStreamingContext(conf, Durations.seconds(1)) // second option tell about The time interval at which streaming data will be divided into batches
Before concluding on whether a solution exists or not
Let’s as few questions
How does Spark Streaming work?
Spark Streaming receives live input data streams from input source and divides the data into batches, which are then processed by the Spark engine and final batch results are pushed down to downstream applications
How Does the batch execution start?
Spark does lazy evaluations on all the transformation applied on Dstream.it will apply transformation on actions (i.e only when you start streaming context)
jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate.
Note : Each Batch of Dstream contains multiple partitions ( it is just like running sequence of spark-batch job until input source stop producing data)
So you can have custom logic like below.
dStream.foreachRDD(new VoidFunction[JavaRDD[Object]] {
override def call(t: JavaRDD[Object]): Unit = {
t.foreach(new VoidFunction[Object] {
override def call(t: Object): Unit = {
//pseudo code someHttpClient.doRequest(httpPostRequest(t.asString))
}
})
}
})
But again make sure your someHttpClient is serializable or
you can create that object As mentioned below.
dStream.foreachRDD(new VoidFunction[JavaRDD[Object]] {
override def call(t: JavaRDD[Object]): Unit = {
// create someHttpClient object
t.foreach(new VoidFunction[Object] {
override def call(t: Object): Unit = {
//pseudo code someHttpClient.doRequest(httpPostRequest(t.asString))
}
})
}
})
Related to Spark Structured Streaming
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql._;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQuery
import java.util.Arrays;
import java.util.Iterator;
val spark = SparkSession
.builder()
.appName("example")
.getOrCreate();
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load(); // this is example source load copied from spark-streaming doc
lines.foreach(new ForeachFunction[Row] {
override def call(t: Row): Unit = {
//someHttpClient.doRequest(httpPostRequest(p.asString))
OR
// create someHttpClient object here and use it to tackle serialization errors
}
})
// Start running the query foreach and do mention downstream sink below/
val query = lines.writeStream.start
query.awaitTermination()

Related

Scala making parallel network calls using Futures

i'm new to Scala, i have a method, that reads data from the given list of files and does api calls with
the data, and writes the response to a file.
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
the above method, is taking too much time, during the network call, so tried to do using parallel
processing to reduce the time.
so i tried wrapping the block of code, which consumes more time, but the program ends quickly
and its not generating any output, as the above code.
import scala.concurrent.ExecutionContext.Implicits.global
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
it would be helpful, if you have any suggestions.
(I also tried using "par", it works fine,
I'm exploring other options other than 'par' and using frameworks like 'akka', 'cats' etc)
Based on Jatin instead of using default execution context which contains deamon threads
import scala.concurrent.ExecutionContext.Implicits.global
define execution context with non-deamon threads
implicit val nonDeamonEc = ExecutionContext.fromExecutor(Executors.newCachedThreadPool)
Also you can use Future.traverse and Await like so
val resultF = Future.traverse(listOfFiles) { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
Await.result(resultF, Duration.Inf)
traverse converts List[Future[A]] to Future[List[A]].

Bulk Insert Data in HBase using Structured Spark Streaming

I'm reading data coming from a Kafka (100.000 line per second) using Structured Spark Streaming, and i'm trying to insert all the data in HBase.
I'm in Cloudera Hadoop 2.6 and I'm using Spark 2.3
I tried something like I've seen here.
eventhubs.writeStream
.foreach(new MyHBaseWriter[Row])
.option("checkpointLocation", checkpointDir)
.start()
.awaitTermination()
MyHBaseWriter looks like this :
class AtomeHBaseWriter[RECORD] extends HBaseForeachWriter[Row] {
override def toPut(record: Row): Put = {
override val tableName: String = "hbase-table-name"
override def toPut(record: Row): Put = {
// Get Json
val data = JSON.parseFull(record.getString(0)).asInstanceOf[Some[Map[String, Object]]]
val key = data.getOrElse(Map())("key")+ ""
val val = data.getOrElse(Map())("val")+ ""
val p = new Put(Bytes.toBytes(key))
//Add columns ...
p.addColumn(Bytes.toBytes(columnFamaliyName),Bytes.toBytes(columnName), Bytes.toBytes(val))
p
}
}
And the HBaseForeachWriter class looks like this :
trait HBaseForeachWriter[RECORD] extends ForeachWriter[RECORD] {
val tableName: String
def pool: Option[ExecutorService] = None
def user: Option[User] = None
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
// I create HBase Connection Here
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(Variables.getTableName()))
}
override def process(record: RECORD): Unit = {
val put = toPut(record)
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def toPut(record: RECORD): Put
}
So here I'm doing a put line by line, even if I allow 20 executors and 4 cores for each, I don't have the data inserted immediatly in HBase. So what I need to do is a bulk load ut I'm struggled because all what I find in the internet is to realize it with RDDs and Map/Reduce.
What I understand is slow rate of record ingestion in to hbase. I have few suggestions to you.
1) hbase.client.write.buffer .
the below property may help you.
hbase.client.write.buffer
Description Default size of the BufferedMutator write buffer in bytes. A bigger buffer takes more memory — on both the client and
server side since server instantiates the passed write buffer to
process it — but a larger buffer size reduces the number of RPCs made.
For an estimate of server-side memory-used, evaluate
hbase.client.write.buffer * hbase.regionserver.handler.count
Default 2097152 (around 2 mb )
I prefer foreachBatch see spark docs (its kind of foreachPartition in spark core) rather foreach
Also in your hbase writer extends ForeachWriter
open method intialize array list of put
in process add the put to the arraylist of puts
in close table.put(listofputs); and then reset the arraylist once you updated the table...
what it does basically your buffer size mentioned above is filled with 2 mb then it will flush in to hbase table. till then records wont go to hbase table.
you can increase that to 10mb and so....
In this way number of RPCs will be reduced. and huge chunk of data will be flushed and will be in hbase table.
when write buffer is filled up and a flushCommits in to hbase table is triggered.
Example code : in my answer
2) switch off WAL you can switch off WAL(write ahead log - Danger is no recovery) but it will speed up writes... if dont want to recover the data.
Note : if you are using solr or cloudera search on hbase tables you
should not turn it off since Solr will work on WAL. if you switch it
off then, Solr indexing wont work.. this is one common mistake many of
us does.
How to swtich off : https://hbase.apache.org/1.1/apidocs/org/apache/hadoop/hbase/client/Put.html#setWriteToWAL(boolean)
Basic architechture and link for further study :
http://hbase.apache.org/book.html#perf.writing
as I mentioned list of puts is good way... this is the old way (foreachPartition with list of puts) of doing before structured streaming example is like below .. where foreachPartition operates for each partition not every row.
def writeHbase(mydataframe: DataFrame) = {
val columnFamilyName: String = "c"
mydataframe.foreachPartition(rows => {
val puts = new util.ArrayList[ Put ]
rows.foreach(row => {
val key = row.getAs[ String ]("rowKey")
val p = new Put(Bytes.toBytes(key))
val columnV = row.getAs[ Double ]("x")
val columnT = row.getAs[ Long ]("y")
p.addColumn(
Bytes.toBytes(columnFamilyName),
Bytes.toBytes("x"),
Bytes.toBytes(columnX)
)
p.addColumn(
Bytes.toBytes(columnFamilyName),
Bytes.toBytes("y"),
Bytes.toBytes(columnY)
)
puts.add(p)
})
HBaseUtil.putRows(hbaseZookeeperQuorum, hbaseTableName, puts)
})
}
To sumup :
What I feel is we need to understand the psycology of spark and hbase
to make then effective pair.

Inconsistency and abrupt behaviour of Spark filter, current timestamp and HBase custom sink in Spark structured streaming

I've a HBase table which look like following in a static Dataframe as HBaseStaticRecorddf
---------------------------------------------------------------
|rowkey|Name|Number|message|lastTS|
|-------------------------------------------------------------|
|266915488007398|somename|8759620897|Hi|1539931239 |
|266915488007399|somename|8759620898|Welcome|1540314926 |
|266915488007400|somename|8759620899|Hello|1540315092 |
|266915488007401|somename|8759620900|Namaskar|1537148280 |
--------------------------------------------------------------
Now I've a file stream source from which I'll get streaming rowkey. Now this timestamp(lastTS) for streaming rowkey's has to be checked whether they're older than one day or not. For this I've the following code where joinedDF is a streaming DataFrame, which is formed by joining another streaming DataFrame and HBase static dataframe as follows.
val HBaseStreamDF = HBaseStaticRecorddf.join(anotherStreamDF,"rowkey")
val newdf = HBaseStreamDF.filter(HBaseStreamDF.col("lastTS").cast("Long") < ((System.currentTimeMillis - 86400*1000)/1000))//records older than one day are eligible to get updated
Once the filter is done I want to save this record to the HBase like below.
newDF.writeStream
.foreach(new ForeachWriter[Row] {
println("inside foreach")
val tableName: String = "dummy"
val hbaseConfResources: Seq[String] = Seq("hbase-site.xml")
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
val hbaseConfig = HBaseConfiguration.create()
hbaseConfResources.foreach(hbaseConfig.addResource)
ConnectionFactory.createConnection(hbaseConfig)
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(tableName))
}
override def process(record: Row): Unit = {
var put = saveToHBase(record)
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def saveToHBase(record: Row): Put = {
val p = new Put(Bytes.toBytes(record.getString(0)))
println("Now updating HBase for " + record.getString(0))
p.add(Bytes.toBytes("messageInfo"),
Bytes.toBytes("ts"),
Bytes.toBytes((System.currentTimeMillis/1000).toString)) //saving as second
p
}
}
).outputMode(OutputMode.Update())
.start().awaitTermination()
Now when any record is coming HBase is getting updated for the first time only. If the same record comes afterwards, it's just getting neglected and not working. However if some unique record comes which has not been processed by the Spark application, then it works. So any duplicated record is not getting processed for the second time.
Now here's some interesting thing.
If I remove the 86400 sec subtraction from (System.currentTimeMillis - 86400*1000)/1000) then everything is getting processed even if there's redundancy among the incoming records. But it's not intended and useful as it doesn't filter 1 day older records.
If I do the comparison in the filter condition in milliseconds without dividing by 1000(this requires HBase data also in millisecond) and save the record as second in the put object then again everything is processed. But If I change the format to seconds in the put object then it doesn't work.
I tried testing individually the filter and HBase put and they both works fine. But together they mess up if System.currentTimeMillis in filter has some arithmetic operations such as /1000 or -864000. If I remove the HBase sink part and use
newDF.writeStream.format("console").start().awaitTermination()
then again the filter logic works. And If I remove the filter then HBase sink works fine. But together, the custom sink for the HBase only works for the first time for the unique records. I tried several other filter logic like below but issue remains the same.
val newDF = newDF1.filter(col("lastTS").lt(LocalDateTime.now().minusDays(1).toEpochSecond(ZoneOffset.of("+05:30"))))
or
val newDF = newDF1.filter(col("lastTS").cast("Long") < LocalDateTime.now().minusDays(1).toEpochSecond(ZoneOffset.of("+05:30")))
How do I make the filter work and save the filtered records to the HBase with updated timestamp? I took reference of several other posts. But the result is same.

log from spark udf to driver

I have a simple UDF in databricks used in spark. I can't use println or log4j or something because it will get outputted to the execution, I need it in the driver. I have a very system log setup
var logMessage = ""
def log(msg: String){
logMessage += msg + "\n"
}
def writeLog(file: String){
println("start write")
println(logMessage)
println("end write")
}
def warning(msg: String){
log("*WARNING* " + msg)
}
val CleanText = (s: int) => {
log("I am in this UDF")
s+2
}
sqlContext.udf.register("CleanText", CleanText)
How can I get this to function properly and log to driver?
The closest mechanism in Apache Spark to what you're trying to do is accumulators. You can accumulate the log lines on the executors and access the result in driver:
// create a collection accumulator using the spark context:
val logLines: CollectionAccumulator[String] = sc.collectionAccumulator("log")
// log function adds a line to accumulator
def log(msg: String): Unit = logLines.add(msg)
// driver-side function can print the log using accumulator's *value*
def writeLog() {
import scala.collection.JavaConverters._
println("start write")
logLines.value.asScala.foreach(println)
println("end write")
}
val CleanText = udf((s: Int) => {
log(s"I am in this UDF, got: $s")
s+2
})
// use UDF in some transformation:
Seq(1, 2).toDF("a").select(CleanText($"a")).show()
writeLog()
// prints:
// start write
// I am in this UDF, got: 1
// I am in this UDF, got: 2
// end write
BUT: this isn't really recommended, especially not for logging purposes. If you log on every record, this accumulator would eventually crash your driver on OutOfMemoryError or just slow you down horribly.
Since you're using Databricks, I would check what options they support for log aggregation, or simply use the Spark UI to view the executor logs.
You can't... unless you want to go crazy and make some sort of log-back appender that sends logs over the network or something like that.
The code for the UDF will be run on all your executors when you evaluate a data frame. So, you might have 2000 hosts running it and each of them will log to their own location; that's how Spark works. The driver isn't the one running the code so it can't be logged to.
You can use YARN log aggregate to pull all the logs from the executors though for later analysis.
You could probably also write to a kafka stream or something creative like that with some work and write the logs contiguously later off the stream.

Apache Flink - Unable to get data From Twitter

I'm trying to get some messages with Twitter Streaming API using Apache Flink.
But, my code is not writing anything in the output file. I'm trying to count the input data for specific words.
Plese check my example:
import java.util.Properties
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.twitter._
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import com.twitter.hbc.core.endpoint.{Location, StatusesFilterEndpoint, StreamingEndpoint}
import org.apache.flink.streaming.api.windowing.time.Time
import scala.collection.JavaConverters._
//////////////////////////////////////////////////////
// Create an Endpoint to Track our terms
class myFilterEndpoint extends TwitterSource.EndpointInitializer with Serializable {
#Override
def createEndpoint(): StreamingEndpoint = {
//val chicago = new Location(new Location.Coordinate(-86.0, 41.0), new Location.Coordinate(-87.0, 42.0))
val endpoint = new StatusesFilterEndpoint()
//endpoint.locations(List(chicago).asJava)
endpoint.trackTerms(List("odebrecht", "lava", "jato").asJava)
endpoint
}
}
object Connection {
def main(args: Array[String]): Unit = {
val props = new Properties()
val params: ParameterTool = ParameterTool.fromArgs(args)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.setGlobalJobParameters(params)
env.setParallelism(params.getInt("parallelism", 1))
props.setProperty(TwitterSource.CONSUMER_KEY, params.get("consumer-key"))
props.setProperty(TwitterSource.CONSUMER_SECRET, params.get("consumer-key"))
props.setProperty(TwitterSource.TOKEN, params.get("token"))
props.setProperty(TwitterSource.TOKEN_SECRET, params.get("token-secret"))
val source = new TwitterSource(props)
val epInit = new myFilterEndpoint()
source.setCustomEndpointInitializer(epInit)
val streamSource = env.addSource(source)
streamSource.map(s => (0, 1))
.keyBy(0)
.timeWindow(Time.minutes(2), Time.seconds(30))
.sum(1)
.map(t => t._2)
.writeAsText(params.get("output"))
env.execute("Twitter Count")
}
}
The point is, I have no error message and I can see at my Dashboard. My source is sending data to my TriggerWindow. But it is not receive any data:
I have two questions in once.
First: Why my source is sending bytes to my TriggerWindow if it is not received anything?
Seccond: Is something wrong to my code that I can't take data from twitter?
Your application source did not send actual records to the window which you can see by looking at the Records sent column. The bytes which are sent belong to control messages which Flink sends from time to time between the tasks. More specifically, it is the LatencyMarker message which is used to measure the end to end latency of a Flink job.
The code looks good to me. I even tried out your code and worked for me. Thus, I conclude that there has to be something wrong with the Twitter connection credentials. Please re-check whether you've entered the right credentials.