I have a sample Spark Code where I am trying to access the Values for tables from the Spark Configurations provided by spark_conf Option by using the typeSafe application.conf and Spark Conf in the Databricks UI.
The code I am using is below,
When I hit the Run Button in the Databricks UI, the job is finishing successfully, but the println function is printing dummyValue instead of ThisIsTableAOne,ThisIsTableBOne...
I can see from the Spark UI that, the Configurations for TableNames are being passed to the Spark job, but these values are not getting reflected in the Code.
try {
val inputConfig = AppConfig.getConfig("input")
val outputConfig = AppConfig.getConfig("output")
val tableA = inputConfig.getString("tableA")
val tableB = inputConfig.getString("tableB")
val tableC = outputConfig.getString("tableC")
println(
s"""
|$tableA
|$tableB
|$tableC
|""".stripMargin)
val userDataInTable = sparkSession.createDataFrame(Seq(
(1, "dummy", "dummy", "dummy")
)).toDF("id", "col2", "col3", "col4")
userDataInTable.show(false)
println("Completed Entry ")
} catch {
case e: Exception =>
sparkSession.stop()
e.printStackTrace
}
//application.conf contains below text,
spark.tableNameA="dummyValue"
spark.tableNameB="dummyValue"
spark.tableNameC="dummyValue"
input{
tableA=${spark.tableNameA}
tableB=${spark.tableNameB}
}
output{
tableC=${spark.tableNameC}
}
//AppConfig
val appConfig = ConfigFactory.load("application.conf")
def getConfig(moduleName: String): Config = {
val config = appConfig.getConfig(moduleName)
config
}
Related
I would like to know about the unit testing side of Spark Structured Streaming. My scenario is, I am getting data from Kafka and I am consuming it using Spark Structured Streaming and applying some transformations on top of the data.
I am not sure about how can I test this using Scala and Spark. Can someone tell me how to do unit testing in Structured Streaming using Scala. I am new to streaming.
tl;dr Use MemoryStream to add events and memory sink for the output.
The following code should help to get started:
import org.apache.spark.sql.execution.streaming.MemoryStream
implicit val sqlCtx = spark.sqlContext
import spark.implicits._
val events = MemoryStream[Event]
val sessions = events.toDS
assert(sessions.isStreaming, "sessions must be a streaming Dataset")
// use sessions event stream to apply required transformations
val transformedSessions = ...
val streamingQuery = transformedSessions
.writeStream
.format("memory")
.queryName(queryName)
.option("checkpointLocation", checkpointLocation)
.outputMode(queryOutputMode)
.start
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
eventGen.generate(userId = 1, offset = 1.second),
eventGen.generate(userId = 2, offset = 2.seconds))
val currentOffset = events.addData(batch)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
// check the output
// The output is in queryName table
// The following code simply shows the result
spark
.table(queryName)
.show(truncate = false)
So, I tried to implement the answer from #Jacek and I couldn't find how to create the eventGen object and also test a small streaming application for write data on the console. I am also using MemoryStream and here I show a small example working.
The class that I testing is:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SparkSession, functions}
object StreamingDataFrames {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName(StreamingDataFrames.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val lines = readData(spark, "socket")
val streamingQuery = writeData(lines)
streamingQuery.awaitTermination()
}
def readData(spark: SparkSession, source: String = "socket"): DataFrame = {
val lines: DataFrame = spark.readStream
.format(source)
.option("host", "localhost")
.option("port", 12345)
.load()
lines
}
def writeData(df: DataFrame, sink: String = "console", queryName: String = "calleventaggs", outputMode: String = "append"): StreamingQuery = {
println(s"Is this a streaming data frame: ${df.isStreaming}")
val shortLines: DataFrame = df.filter(functions.length(col("value")) >= 3)
val query = shortLines.writeStream
.format(sink)
.queryName(queryName)
.outputMode(outputMode)
.start()
query
}
}
I test only the writeData method. This is way I split the query into 2 methods.
Then here is the Spec to test the class. I use a SharedSparkSession class to facilitate the open and close of spark context. Like it is shown here.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.github.explore.spark.SharedSparkSession
import org.scalatest.funsuite.AnyFunSuite
class StreamingDataFramesSpec extends AnyFunSuite with SharedSparkSession {
test("spark structured streaming can read from memory socket") {
// We can import sql implicits
implicit val sqlCtx = sparkSession.sqlContext
import sqlImplicits._
val events = MemoryStream[String]
val queryName: String = "calleventaggs"
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
"this is a value to read",
"and this is another value"
)
val currentOffset = events.addData(batch)
val streamingQuery = StreamingDataFrames.writeData(events.toDF(), "memory", queryName)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
val result: DataFrame = sparkSession.table(queryName)
result.show
streamingQuery.awaitTermination(1000L)
assertResult(batch.size)(result.count)
val values = result.take(2)
assertResult(batch(0))(values(0).getString(0))
assertResult(batch(1))(values(1).getString(0))
}
}
I want to use SparkContext and SQLContext inside foreachPartition, but unable to do it due to serialization error. I know that both objects are not serializable, but I thought that foreachPartition is executed on the master, where both Spark Context and SQLContext are available..
Notation:
`msg -> Map[String,String]`
`result -> Iterable[Seq[Row]]`
This is my current code (UtilsDM is an object that extends Serializable). The part of code that fails starts from val schema =..., where I want to write result to the DataFrame and then save it to Parquet. Maybe the way I organized the code is inefficient, then I'd like to here your recommendations. Thanks.
// Here I am creating df from parquet file on S3
val exists = FileSystem.get(new URI("s3n://" + bucketNameCode), sc.hadoopConfiguration).exists(new Path("s3n://" + bucketNameCode + "/" + pathToSentMessages))
var df: DataFrame = null
if (exists) {
df = sqlContext
.read.parquet("s3n://bucket/pathToParquetFile")
}
UtilsDM.setDF(df)
// Here I process myDStream
myDStream.foreachRDD(rdd => {
rdd.foreachPartition{iter =>
val r = new RedisClient(UtilsDM.getHost, UtilsDM.getPort)
val producer = UtilsDM.createProducer
var df = UtilsDM.getDF
val result = iter.map{ msg =>
// ...
Seq(msg("key"),msg("value"))
}
// HERE I WANT TO WRITE result TO S3, BUT IT FAILS
val schema = StructType(
StructField("key", StringType, true) ::
StructField("value", StringType, true)
result.foreach { row =>
val rdd = sc.makeRDD(row)
val df2 = sqlContext.createDataFrame(rdd, schema)
// If the parquet file is not created, then create it
var df_final: DataFrame = null
if (df != null) {
df_final = df.unionAll(df2)
} else {
df_final = df2
}
df_final.write.parquet("s3n://bucket/pathToSentMessages)
}
}
})
EDIT:
I am using Spark 1.6.2 and Scala 2.10.6.
It is not possible. SparkContext, SQLContext and SparkSession can be used only on the driver. You can use sqlContext in the top level of foreachRDD:
myDStream.foreachRDD(rdd => {
val df = sqlContext.createDataFrame(rdd, schema)
...
})
You cannot use it in transformation / action:
myDStream.foreachRDD(rdd => {
rdd.foreach {
val df = sqlContext.createDataFrame(...)
...
}
})
You probably want equivalent of:
myDStream.foreachRDD(rdd => {
val foo = rdd.mapPartitions(iter => doSomethingWithRedisClient(iter))
val df = sqlContext.createDataFrame(foo, schema)
df.write.parquet("s3n://bucket/pathToSentMessages)
})
I found out that using an existing SparkContext (assume I have created a sparkContext sc beforehand) inside a loop works i.e.
// this works
stream.foreachRDD( _ => {
// update rdd
.... = SparkContext.getOrCreate().parallelize(...)
})
// this doesn't work - throws a SparkContext not serializable error
stream.foreachRDD( _ => {
// update rdd
.... = sc.parallelize(...)
})
We have a text files of 100K records each and we need to read the file line by line and insert it's value into hbase.
The file is '|' delimited.
Sample textFile example:
SLNO|Name|City|Pincode
1|ABC|Pune|400104
2|BMN|Delhi|100065
Each column will have different column family.
We are trying to implement this in Spark-Scala using HBase Bulk load.
We came across this link suggesting bulk load :
http://www.openkb.info/2015/01/how-to-use-scala-on-spark-to-load-data.html
With the below syntax for inserting into single column family.
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = Job.getInstance(conf)
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable])
job.setMapOutputValueClass (classOf[KeyValue])
HFileOutputFormat.configureIncrementalLoad (job, table)
// Generate 10 sample data:
val num = sc.parallelize(1 to 10)
val rdd = num.map(x=>{
val kv: KeyValue = new KeyValue(Bytes.toBytes(x), "cf".getBytes(),
"c1".getBytes(), "value_xxx".getBytes() )
(new ImmutableBytesWritable(Bytes.toBytes(x)), kv)
})
// Directly bulk load to Hbase/MapRDB tables.
rdd.saveAsNewAPIHadoopFile("/tmp/xxxx19", classOf[ImmutableBytesWritable],
classOf[KeyValue], classOf[HFileOutputFormat], job.getConfiguration())
Can anyone advice on the bulk load insertion for multi-column family.
Do have a look at rdd.saveAsNewAPIHadoopDataset, to insert the data into the hbase table.
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("sparkToHive").enableHiveSupport().getOrCreate()
import spark.implicits._
val config = HBaseConfiguration.create()
config.set("hbase.zookeeper.quorum", "ip's")
config.set("hbase.zookeeper.property.clientPort","2181")
config.set(TableInputFormat.INPUT_TABLE, "tableName")
val newAPIJobConfiguration1 = Job.getInstance(config)
newAPIJobConfiguration1.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "tableName")
newAPIJobConfiguration1.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
val df: DataFrame = Seq(("foo", "1", "foo1"), ("bar", "2", "bar1")).toDF("key", "value1", "value2")
val hbasePuts= df.rdd.map((row: Row) => {
val put = new Put(Bytes.toBytes(row.getString(0)))
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("value1"), Bytes.toBytes(row.getString(1)))
put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("value2"), Bytes.toBytes(row.getString(2)))
(new ImmutableBytesWritable(), put)
})
hbasePuts.saveAsNewAPIHadoopDataset(newAPIJobConfiguration1.getConfiguration())
}
Ref : https://sparkkb.wordpress.com/2015/05/04/save-javardd-to-hbase-using-saveasnewapihadoopdataset-spark-api-java-coding/
I'm working on this use case that involves converting DStreams to Dataframes after some transformations. I've simplified my code into the following snippet so as to reproduce the error. Also, I've mentioned below my environment settings.
Environment:
Spark Version: 2.2.0
Java: 1.8
Execution mode: local/ IntelliJ
Code:
object Tests {
def main(args: Array[String]): Unit = {
val spark: SparkSession = ...
import spark.implicits._
val df = List(
("jim", "usa"),
("raj", "india"))
.toDF("name", "country")
df.rdd
.map(x => x.toSeq)
.map(x => new GenericRowWithSchema(x.toArray, df.schema))
.foreach(println)
}
}
This results in NullPointerException as I'm directly using df.schema in map().
What I don't understand is that if I use the following code (basically storing the schema as a value before transforming), it works just fine.
Modified Code:
object Tests {
def main(args: Array[String]): Unit = {
val spark: SparkSession = ...
import spark.implicits._
val df = List(
("jim", "usa"),
("raj", "india"))
.toDF("name", "country")
val sc = df.schema
df.rdd
.map(x => x.toSeq)
.map(x => new GenericRowWithSchema(x.toArray, sc))
.foreach(println)
}
}
I wonder why this is happening as df.rdd is not an action and there is visible change in state of DataFrame just yet.
Any thoughts on this?
This happens because Apache Spark doesn't permit accessing non-local Datasets from executors and behavior is expected.
In contrast, when you extract schema to variable, it is just a local object which can be safely serialized.
I want to use SparkContext and SQLContext inside foreachPartition, but unable to do it due to serialization error. I know that both objects are not serializable, but I thought that foreachPartition is executed on the master, where both Spark Context and SQLContext are available..
Notation:
`msg -> Map[String,String]`
`result -> Iterable[Seq[Row]]`
This is my current code (UtilsDM is an object that extends Serializable). The part of code that fails starts from val schema =..., where I want to write result to the DataFrame and then save it to Parquet. Maybe the way I organized the code is inefficient, then I'd like to here your recommendations. Thanks.
// Here I am creating df from parquet file on S3
val exists = FileSystem.get(new URI("s3n://" + bucketNameCode), sc.hadoopConfiguration).exists(new Path("s3n://" + bucketNameCode + "/" + pathToSentMessages))
var df: DataFrame = null
if (exists) {
df = sqlContext
.read.parquet("s3n://bucket/pathToParquetFile")
}
UtilsDM.setDF(df)
// Here I process myDStream
myDStream.foreachRDD(rdd => {
rdd.foreachPartition{iter =>
val r = new RedisClient(UtilsDM.getHost, UtilsDM.getPort)
val producer = UtilsDM.createProducer
var df = UtilsDM.getDF
val result = iter.map{ msg =>
// ...
Seq(msg("key"),msg("value"))
}
// HERE I WANT TO WRITE result TO S3, BUT IT FAILS
val schema = StructType(
StructField("key", StringType, true) ::
StructField("value", StringType, true)
result.foreach { row =>
val rdd = sc.makeRDD(row)
val df2 = sqlContext.createDataFrame(rdd, schema)
// If the parquet file is not created, then create it
var df_final: DataFrame = null
if (df != null) {
df_final = df.unionAll(df2)
} else {
df_final = df2
}
df_final.write.parquet("s3n://bucket/pathToSentMessages)
}
}
})
EDIT:
I am using Spark 1.6.2 and Scala 2.10.6.
It is not possible. SparkContext, SQLContext and SparkSession can be used only on the driver. You can use sqlContext in the top level of foreachRDD:
myDStream.foreachRDD(rdd => {
val df = sqlContext.createDataFrame(rdd, schema)
...
})
You cannot use it in transformation / action:
myDStream.foreachRDD(rdd => {
rdd.foreach {
val df = sqlContext.createDataFrame(...)
...
}
})
You probably want equivalent of:
myDStream.foreachRDD(rdd => {
val foo = rdd.mapPartitions(iter => doSomethingWithRedisClient(iter))
val df = sqlContext.createDataFrame(foo, schema)
df.write.parquet("s3n://bucket/pathToSentMessages)
})
I found out that using an existing SparkContext (assume I have created a sparkContext sc beforehand) inside a loop works i.e.
// this works
stream.foreachRDD( _ => {
// update rdd
.... = SparkContext.getOrCreate().parallelize(...)
})
// this doesn't work - throws a SparkContext not serializable error
stream.foreachRDD( _ => {
// update rdd
.... = sc.parallelize(...)
})