How long is a DataFrame cached? - scala

Please help me understanding the scope of the cached dataframe within another function.
Example:
def mydf(): DataFrame = {
val df = sparkSession.sql("select * from emp")
df.cache() // <-- cached here
df
}
def joinWithDept(): Unit = {
val deptdf1 = sparkSession.sql("select * from dept")
val deptdf2 = mydf().join(deptdf1,Seq("empid")) // <-- using the cached dataset?
deptdf2.show()
}
def joinWithLocation() : Unit = {
val locdf1 = sparkSession.sql("select * from from location")
val locdf2 = mydf().join(locdf1,Seq("empid")) // <-- using the cached dataset?
locdf2.show()
}
def run(): Unit = {
joinWithDept()
joinWithLocation()
}
All above functions are defined in same class. I not sure, if will get the benefit of dataframe caching performed in mydf() function? How to do I verify that it is getting the benefit of catching?

joinWithDept and joinWithLocation will both use the (cached logical query plan) of the DataFrame from mydf().
You can check the cached DataFrame in Storage tab of web UI.
You can also verify that the joins use the cached dataframe by reviewing physical query plans (by explain or in web UI) where you should see InMemoryRelations used.

Related

How to iterate Big Query TableResult correctly?

I have a complex join query in Big Query and need to run in a spark job. This is the current code:
val bigquery = BigQueryOptions.newBuilder().setProjectId(bigQueryConfig.bigQueryProjectId)
.setCredentials(credentials)
.build().getService
val query =
//some complex query
val queryConfig: QueryJobConfiguration =
QueryJobConfiguration.newBuilder(
query)
.setUseLegacySql(false)
.setPriority(QueryJobConfiguration.Priority.BATCH) //(tried with and without)
.build()
val jobId: JobId = JobId.newBuilder().setRandomJob().build()
val queryJob: Job = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build).waitFor()
val result = queryJob.getQueryResults()
val output = result.iterateAll().iterator().asScala.to[Seq].map { row: FieldValueList =>
//create case class from the row
}
It keeps running into this error:
Exceeded rate limits: Your project: XXX exceeded quota for tabledata.list bytes per second per project.
Is there a way to better iterate through the results? I have tried to do setPriority(QueryJobConfiguration.Priority.BATCH) on the query job configuration, but it doesn't improve results. Also tried to reduce the number of spark executors to 1, but of no use.
Instead of reading the query results directly, you can use the spark-bigquery-connector to read them into a DataFrame:
val queryConfig: QueryJobConfiguration =
QueryJobConfiguration.newBuilder(
query)
.setUseLegacySql(false)
.setPriority(QueryJobConfiguration.Priority.BATCH) //(tried with and without)
.setDestinationTable(TableId.of(destinationDataset, destinationTable))
.build()
val jobId: JobId = JobId.newBuilder().setRandomJob().build()
val queryJob: Job = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build).waitFor()
val result = queryJob.getQueryResults()
// read into DataFrame
val data = spark.read.format("bigquery")
.option("dataset",destinationDataset)
.option("table" destinationTable)
.load()
We resolved the situation by providing a custom page size on the TableResult

Bulk Insert Data in HBase using Structured Spark Streaming

I'm reading data coming from a Kafka (100.000 line per second) using Structured Spark Streaming, and i'm trying to insert all the data in HBase.
I'm in Cloudera Hadoop 2.6 and I'm using Spark 2.3
I tried something like I've seen here.
eventhubs.writeStream
.foreach(new MyHBaseWriter[Row])
.option("checkpointLocation", checkpointDir)
.start()
.awaitTermination()
MyHBaseWriter looks like this :
class AtomeHBaseWriter[RECORD] extends HBaseForeachWriter[Row] {
override def toPut(record: Row): Put = {
override val tableName: String = "hbase-table-name"
override def toPut(record: Row): Put = {
// Get Json
val data = JSON.parseFull(record.getString(0)).asInstanceOf[Some[Map[String, Object]]]
val key = data.getOrElse(Map())("key")+ ""
val val = data.getOrElse(Map())("val")+ ""
val p = new Put(Bytes.toBytes(key))
//Add columns ...
p.addColumn(Bytes.toBytes(columnFamaliyName),Bytes.toBytes(columnName), Bytes.toBytes(val))
p
}
}
And the HBaseForeachWriter class looks like this :
trait HBaseForeachWriter[RECORD] extends ForeachWriter[RECORD] {
val tableName: String
def pool: Option[ExecutorService] = None
def user: Option[User] = None
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
// I create HBase Connection Here
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(Variables.getTableName()))
}
override def process(record: RECORD): Unit = {
val put = toPut(record)
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def toPut(record: RECORD): Put
}
So here I'm doing a put line by line, even if I allow 20 executors and 4 cores for each, I don't have the data inserted immediatly in HBase. So what I need to do is a bulk load ut I'm struggled because all what I find in the internet is to realize it with RDDs and Map/Reduce.
What I understand is slow rate of record ingestion in to hbase. I have few suggestions to you.
1) hbase.client.write.buffer .
the below property may help you.
hbase.client.write.buffer
Description Default size of the BufferedMutator write buffer in bytes. A bigger buffer takes more memory — on both the client and
server side since server instantiates the passed write buffer to
process it — but a larger buffer size reduces the number of RPCs made.
For an estimate of server-side memory-used, evaluate
hbase.client.write.buffer * hbase.regionserver.handler.count
Default 2097152 (around 2 mb )
I prefer foreachBatch see spark docs (its kind of foreachPartition in spark core) rather foreach
Also in your hbase writer extends ForeachWriter
open method intialize array list of put
in process add the put to the arraylist of puts
in close table.put(listofputs); and then reset the arraylist once you updated the table...
what it does basically your buffer size mentioned above is filled with 2 mb then it will flush in to hbase table. till then records wont go to hbase table.
you can increase that to 10mb and so....
In this way number of RPCs will be reduced. and huge chunk of data will be flushed and will be in hbase table.
when write buffer is filled up and a flushCommits in to hbase table is triggered.
Example code : in my answer
2) switch off WAL you can switch off WAL(write ahead log - Danger is no recovery) but it will speed up writes... if dont want to recover the data.
Note : if you are using solr or cloudera search on hbase tables you
should not turn it off since Solr will work on WAL. if you switch it
off then, Solr indexing wont work.. this is one common mistake many of
us does.
How to swtich off : https://hbase.apache.org/1.1/apidocs/org/apache/hadoop/hbase/client/Put.html#setWriteToWAL(boolean)
Basic architechture and link for further study :
http://hbase.apache.org/book.html#perf.writing
as I mentioned list of puts is good way... this is the old way (foreachPartition with list of puts) of doing before structured streaming example is like below .. where foreachPartition operates for each partition not every row.
def writeHbase(mydataframe: DataFrame) = {
val columnFamilyName: String = "c"
mydataframe.foreachPartition(rows => {
val puts = new util.ArrayList[ Put ]
rows.foreach(row => {
val key = row.getAs[ String ]("rowKey")
val p = new Put(Bytes.toBytes(key))
val columnV = row.getAs[ Double ]("x")
val columnT = row.getAs[ Long ]("y")
p.addColumn(
Bytes.toBytes(columnFamilyName),
Bytes.toBytes("x"),
Bytes.toBytes(columnX)
)
p.addColumn(
Bytes.toBytes(columnFamilyName),
Bytes.toBytes("y"),
Bytes.toBytes(columnY)
)
puts.add(p)
})
HBaseUtil.putRows(hbaseZookeeperQuorum, hbaseTableName, puts)
})
}
To sumup :
What I feel is we need to understand the psycology of spark and hbase
to make then effective pair.

Spark : how to parallelize subsequent specific work on each dataframe partitions

My Spark application is as follow :
1) execute large query with Spark SQL into the dataframe "dataDF"
2) foreach partition involved in "dataDF" :
2.1) get the associated "filtered" dataframe, in order to have only the partition associated data
2.2) do specific work with that "filtered" dataframe and write output
The code is as follow :
val dataSQL = spark.sql("SELECT ...")
val dataDF = dataSQL.repartition($"partition")
for {
row <- dataDF.dropDuplicates("partition").collect
} yield {
val partition_str : String = row.getAs[String](0)
val filtered = dataDF.filter($"partition" .equalTo( lit( partition_str ) ) )
// ... on each partition, do work depending on the partition, and write result on HDFS
// Example :
if( partition_str == "category_A" ){
// do group by, do pivot, do mean, ...
val x = filtered
.groupBy("column1","column2")
...
// write final DF
x.write.parquet("some/path")
} else if( partition_str == "category_B" ) {
// select specific field and apply calculation on it
val y = filtered.select(...)
// write final DF
x.write.parquet("some/path")
} else if ( ... ) {
// other kind of calculation
// write results
} else {
// other kind of calculation
// write results
}
}
Such algorithm works successfully. The Spark SQL query is fully distributed. However the particular work done on each resulting partition is done sequentially, and the result is inneficient especially because each write related to a partition is done sequentially.
In such case, what are the ways to replace the "for yield" by something in parallel/async ?
Thanks
You could use foreachPartition if writing to data stores outside Hadoop scope with specific logic needed for that particular env.
Else map, etc.
.par parallel collections (Scala) - but that is used with caution. For reading files and pre-processing them, otherwise possibly considered risky.
Threads.
You need to check what you are doing and if the operations can be referenced, usewd within a foreachPartition block, etc. You need to try as some aspects can only be written for the driver and then get distributed to the executors via SPARK to the workers. But you cannot write, for example, spark.sql for the worker as per below - at the end due to some formatting aspect errors I just got here in the block of text. See end of post.
Likewise df.write or df.read cannot be used in the below either. What you can do is write individual execute/mutate statements to, say, ORACLE, mySQL.
Hope this helps.
rdd.foreachPartition(iter => {
while(iter.hasNext) {
val item = iter.next()
// do something
spark.sql("INSERT INTO tableX VALUES(2,7, 'CORN', 100, item)")
// do some other stuff
})
or
RDD.foreachPartition (records => {
val JDBCDriver = "com.mysql.jdbc.Driver" ...
...
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val connection = DriverManager.getConnection(ConnectionURL, jdbcUsername, jdbcPassword)
...
val mutateStatement = connection.createStatement()
val queryStatement = connection.createStatement()
...
records.foreach (record => {
val val1 = record._1
val val2 = record._2
...
mutateStatement.execute (s"insert into sample (k,v) values(${val1}, ${nIterVal})")
})
}
)

Inconsistency and abrupt behaviour of Spark filter, current timestamp and HBase custom sink in Spark structured streaming

I've a HBase table which look like following in a static Dataframe as HBaseStaticRecorddf
---------------------------------------------------------------
|rowkey|Name|Number|message|lastTS|
|-------------------------------------------------------------|
|266915488007398|somename|8759620897|Hi|1539931239 |
|266915488007399|somename|8759620898|Welcome|1540314926 |
|266915488007400|somename|8759620899|Hello|1540315092 |
|266915488007401|somename|8759620900|Namaskar|1537148280 |
--------------------------------------------------------------
Now I've a file stream source from which I'll get streaming rowkey. Now this timestamp(lastTS) for streaming rowkey's has to be checked whether they're older than one day or not. For this I've the following code where joinedDF is a streaming DataFrame, which is formed by joining another streaming DataFrame and HBase static dataframe as follows.
val HBaseStreamDF = HBaseStaticRecorddf.join(anotherStreamDF,"rowkey")
val newdf = HBaseStreamDF.filter(HBaseStreamDF.col("lastTS").cast("Long") < ((System.currentTimeMillis - 86400*1000)/1000))//records older than one day are eligible to get updated
Once the filter is done I want to save this record to the HBase like below.
newDF.writeStream
.foreach(new ForeachWriter[Row] {
println("inside foreach")
val tableName: String = "dummy"
val hbaseConfResources: Seq[String] = Seq("hbase-site.xml")
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
val hbaseConfig = HBaseConfiguration.create()
hbaseConfResources.foreach(hbaseConfig.addResource)
ConnectionFactory.createConnection(hbaseConfig)
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(tableName))
}
override def process(record: Row): Unit = {
var put = saveToHBase(record)
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def saveToHBase(record: Row): Put = {
val p = new Put(Bytes.toBytes(record.getString(0)))
println("Now updating HBase for " + record.getString(0))
p.add(Bytes.toBytes("messageInfo"),
Bytes.toBytes("ts"),
Bytes.toBytes((System.currentTimeMillis/1000).toString)) //saving as second
p
}
}
).outputMode(OutputMode.Update())
.start().awaitTermination()
Now when any record is coming HBase is getting updated for the first time only. If the same record comes afterwards, it's just getting neglected and not working. However if some unique record comes which has not been processed by the Spark application, then it works. So any duplicated record is not getting processed for the second time.
Now here's some interesting thing.
If I remove the 86400 sec subtraction from (System.currentTimeMillis - 86400*1000)/1000) then everything is getting processed even if there's redundancy among the incoming records. But it's not intended and useful as it doesn't filter 1 day older records.
If I do the comparison in the filter condition in milliseconds without dividing by 1000(this requires HBase data also in millisecond) and save the record as second in the put object then again everything is processed. But If I change the format to seconds in the put object then it doesn't work.
I tried testing individually the filter and HBase put and they both works fine. But together they mess up if System.currentTimeMillis in filter has some arithmetic operations such as /1000 or -864000. If I remove the HBase sink part and use
newDF.writeStream.format("console").start().awaitTermination()
then again the filter logic works. And If I remove the filter then HBase sink works fine. But together, the custom sink for the HBase only works for the first time for the unique records. I tried several other filter logic like below but issue remains the same.
val newDF = newDF1.filter(col("lastTS").lt(LocalDateTime.now().minusDays(1).toEpochSecond(ZoneOffset.of("+05:30"))))
or
val newDF = newDF1.filter(col("lastTS").cast("Long") < LocalDateTime.now().minusDays(1).toEpochSecond(ZoneOffset.of("+05:30")))
How do I make the filter work and save the filtered records to the HBase with updated timestamp? I took reference of several other posts. But the result is same.

Spark: show and collect-println giving different outputs

I am using Spark 2.2
I feel like I have something odd going on here. Basic premise is that
I have a set of KIE/Drools rules running through a Dataset of profile objects
I am then trying to show/collect-print the resulting output
I then cast the output as a tuple to flatmap it later
Code below
implicit val mapEncoder = Encoders.kryo[java.util.HashMap[String, Any]]
implicit val recommendationEncoder = Encoders.kryo[Recommendation]
val mapper = new ObjectMapper()
val kieOuts = uberDs.map(profile => {
val map = mapper.convertValue(profile, classOf[java.util.HashMap[String, Any]])
val profile = Profile(map)
// setup the kie session
val ks = KieServices.Factory.get
val kContainer = ks.getKieClasspathContainer
val kSession = kContainer.newKieSession() //TODO: stateful session, how to do stateless?
// insert profile object into kie session
val kCmds = ks.getCommands
val cmds = new java.util.ArrayList[Command[_]]()
cmds.add(kCmds.newInsert(profile))
cmds.add(kCmds.newFireAllRules("outFired"))
// fire kie rules
val results = kSession.execute(kCmds.newBatchExecution(cmds))
val fired = results.getValue("outFired").toString.toInt
// collect the inserted recommendation objects and create uid string
import scala.collection.JavaConversions._
var gresults = kSession.getObjects
gresults = gresults.drop(1) // drop the inserted profile object which also gets collected
val recommendations = scala.collection.mutable.ListBuffer[Recommendation]()
gresults.toList.foreach(reco => {
val recommendation = reco.asInstanceOf[Recommendation]
recommendations += recommendation
})
kSession.dispose
val uIds = StringBuilder.newBuilder
if(recommendations.size > 0) {
recommendations.foreach(recommendation => {
uIds.append(recommendation.getOfferId + "_" + recommendation.getScore)
uIds.append(";")
})
uIds.deleteCharAt(uIds.size - 1)
}
new ORecommendation(profile.getAttributes().get("cId").toString.toLong, fired, uIds.toString)
})
println("======================Output#1======================")
kieOuts.show(1000, false)
println("======================Output#2======================")
kieOuts.collect.foreach(println)
//separating cid and and each uid into individual rows
val kieOutsDs = kieOuts.as[(Long, Int, String)]
println("======================Output#3======================")
kieOutsDs.show(1000, false)
(I have sanitized/shortened the id's below, they are much bigger but with a similar format)
What I am seeing as outputs
Output#1 has a set of uIds(as String) come up
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cId |rulesFired | eligibleUIds |
|842 | 17|123-25_2.0;12345678-48_9.0;28a-ad_5.0;123-56_10.0;123-27_2.0;123-32_3.0;c6d-e5_5.0;123-26_2.0;123-51_10.0;8e8-c1_5.0;123-24_2.0;df8-ad_5.0;123-36_5.0;123-16_2.0;123-34_3.0|
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Output#2 has mostly a similar set of uIds show up(usually off by 1 element)
ORecommendation(842,17,123-36_5.0;123-24_2.0;8e8-c1_5.0;df8-ad_5.0;28a-ad_5.0;660-73_5.0;123-34_3.0;123-48_9.0;123-16_2.0;123-51_10.0;123-26_2.0;c6d-e5_5.0;123-25_2.0;123-56_10.0;123-32_3.0)
Output#3 is same as #Output1
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|842 | 17 |123-32_3.0;8e8-c1_5.0;123-51_10.0;123-48_9.0;28a-ad_5.0;c6d-e5_5.0;123-27_2.0;123-16_2.0;123-24_2.0;123-56_10.0;123-34_3.0;123-36_5.0;123-6_2.0;123-25_2.0;660-73_5.0|
Every time I run it the difference between Output#1 and Output#2 is 1 element but never the same element (In the above example, Output#1 has 123-27_2.0 but Output#2 has 660-73_5.0)
Should they not be the same? I am still new to Scala/Spark and feel like I am missing something very fundamental
I think I figured this out, adding cache to kieOuts atleast got me identical outputs between show and collect.
I will be looking at why KIE gives me different output for every run of the same input but that is a different issue