Flattening JSON into Tabular Structure using Spark-Scala RDD only fucntion - scala

I have nested JSON and like to have output in tabular structure. I am able to parse the JSON values individually , but having some problems in tabularizing it. I am able to do it via dataframe easily. But I want do it using "RDD ONLY " functions. Any help much appreciated.
Input JSON:
{ "level":{"productReference":{
"prodID":"1234",
"unitOfMeasure":"EA"
},
"states":[
{
"state":"SELL",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":1400.0,
"stockKeepingLevel":"A"
}
},
{
"state":"HELD",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":800.0,
"stockKeepingLevel":"B"
}
}
] }}
Expected Output:
I tried Below Spark code . But getting output like this and Row() object is not able to parse this.
079562193,EA,List(SELLABLE, HELD),List(2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z),List(1400.0, 800.0),List(SINGLE, SINGLE)
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", StringType, true),
StructField("effectiveDateTime", StringType, true),
StructField("quantity", StringType, true),
StructField("stockKeepingLevel", StringType, true)
))
val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")
val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {
parse(eachJsonMessages)
}).map(insideEachJson=>{
implicit val formats = org.json4s.DefaultFormats
val prodID = (insideEachJson\ "level" \"productReference" \"TPNB").extract[String].toString
val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString
val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"state").extract[String]).toString()
val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"effectiveDateTime").extract[String]).toString
val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double]).
toString
val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String]).
toString
//Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
println(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
}).collect()
// sqlContext.createDataFrame(x,salesSchema).show(truncate = false)
}

HI below is the "DATAFRAME" ONLY Solution which I developed. Looking for complete "RDD ONLY" solution
def main (Args : Array[String]):Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark DataFrame few more options").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val sourceJsonDF = sqlContext.read.json("product.json")
val jsonFlatDF_level = sourceJsonDF.withColumn("explode_states",explode($"level.states"))
.withColumn("explode_link",explode($"level._link"))
.select($"level.productReference.TPNB".as("TPNB"),
$"level.productReference.unitOfMeasure".as("level_unitOfMeasure"),
$"level.locationReference.location".as("level_location"),
$"level.locationReference.type".as("level_type"),
$"explode_states.state".as("level_state"),
$"explode_states.effectiveDateTime".as("level_effectiveDateTime"),
$"explode_states.stockQuantity.quantity".as("level_quantity"),
$"explode_states.stockQuantity.stockKeepingLevel".as("level_stockKeepingLevel"),
$"explode_link.rel".as("level_rel"),
$"explode_link.href".as("level_href"),
$"explode_link.method".as("level_method"))
jsonFlatDF_oldLevel.show()
}

DataFrame and DataSet are much more optimized than rdd and there are a lot of options to try with to reach to the solution we desire.
In my opinion, DataFrame is developed to make the developers comfortable viewing data in tabular form so that logics can be implemented with ease. So I always suggest users to use dataframe or dataset.
Talking much less, I am posting you the solution below using dataframe. Once you have a dataframe, switching to rdd is very easy.
Your desired solution is below (you will have to find a way to read json file as its done with json string below : thats an assignment for you :) good luck)
import org.apache.spark.sql.functions._
val json = """ { "level":{"productReference":{
"prodID":"1234",
"unitOfMeasure":"EA"
},
"states":[
{
"state":"SELL",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":1400.0,
"stockKeepingLevel":"A"
}
},
{
"state":"HELD",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":800.0,
"stockKeepingLevel":"B"
}
}
] }}"""
val rddJson = sparkContext.parallelize(Seq(json))
var df = sqlContext.read.json(rddJson)
df = df.withColumn("prodID", df("level.productReference.prodID"))
.withColumn("unitOfMeasure", df("level.productReference.unitOfMeasure"))
.withColumn("states", explode(df("level.states")))
.drop("level")
df = df.withColumn("state", df("states.state"))
.withColumn("effectiveDateTime", df("states.effectiveDateTime"))
.withColumn("quantity", df("states.stockQuantity.quantity"))
.withColumn("stockKeepingLevel", df("states.stockQuantity.stockKeepingLevel"))
.drop("states")
df.show(false)
This will give out put as
+------+-------------+-----+-------------------------+--------+-----------------+
|prodID|unitOfMeasure|state|effectiveDateTime |quantity|stockKeepingLevel|
+------+-------------+-----+-------------------------+--------+-----------------+
|1234 |EA |SELL |2015-10-09T00:55:23.6345Z|1400.0 |A |
|1234 |EA |HELD |2015-10-09T00:55:23.6345Z|800.0 |B |
+------+-------------+-----+-------------------------+--------+-----------------+
Now that you have desired output as dataframe converting to rdd is just calling .rdd
df.rdd.foreach(println)
will give output as below
[1234,EA,SELL,2015-10-09T00:55:23.6345Z,1400.0,A]
[1234,EA,HELD,2015-10-09T00:55:23.6345Z,800.0,B]
I hope this is helpful

There are 2 versions of solutions to your question.
Version 1:
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", StringType, true),
StructField("effectiveDateTime", StringType, true),
StructField("quantity", StringType, true),
StructField("stockKeepingLevel", StringType, true)
))
val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")
val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {
parse(eachJsonMessages)
}).map(insideEachJson=>{
implicit val formats = org.json4s.DefaultFormats
val prodID = (insideEachJson\ "level" \"productReference" \"prodID").extract[String].toString
val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString
val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"state").extract[String]).toString()
val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"effectiveDateTime").extract[String]).toString
val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double]).
toString
val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String]).
toString
Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
})
sqlContext.createDataFrame(x,salesSchema).show(truncate = false)
}
This would give you following output:
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
|prodID|unitOfMeasure|state |effectiveDateTime |quantity |stockKeepingLevel|
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
|1234 |EA |List(SELL, HELD)|List(2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z)|List(1400.0, 800.0)|List(A, B) |
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
Version 2:
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", ArrayType(StringType, true), true),
StructField("effectiveDateTime", ArrayType(StringType, true), true),
StructField("quantity", ArrayType(DoubleType, true), true),
StructField("stockKeepingLevel", ArrayType(StringType, true), true)
))
val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")
val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {
parse(eachJsonMessages)
}).map(insideEachJson=>{
implicit val formats = org.json4s.DefaultFormats
val prodID = (insideEachJson\ "level" \"productReference" \"prodID").extract[String].toString
val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString
val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"state").extract[String])
val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"effectiveDateTime").extract[String])
val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double])
val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String])
Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
})
sqlContext.createDataFrame(x,salesSchema).show(truncate = false)
}
This would give you following output:
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
|prodID|unitOfMeasure|state |effectiveDateTime |quantity |stockKeepingLevel|
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
|1234 |EA |[SELL, HELD]|[2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z]|[1400.0, 800.0]|[A, B] |
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
The difference between Version 1 & 2 is of schema. In Version 1 you are casting every column into String whereas in Version 2 they are being casted into Array.

Related

Create JOIN condition on variable number of columns in Scala

Suppose I have two data frames and I would like to join them based on certain columns. The list of these join columns can differ based on the data frames that I'm joining, but we can always count on the fact that the two data frames we will join using joinDfs will always have the same column names as joinCols.
I am trying to figure out how to form the joinCondition given the assumptions/requirements above. Currently, it is returning (((a.colName1 = b.colName1) AND (a.colName2 = b.colName2)) AND (a.colName3 = b.colName3)), which is not quite returning what I'm expecting from the INNER JOIN in the example below.
Thank you in advance for helping me, a newbie to Scala and Spark figure out how to form a proper joinCondition.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, Row}
import org.apache.spark.sql.types._
def joinDfs(firstDf: DataFrame, secondDf: DataFrame, joinCols: Array[String]): DataFrame = {
val firstDfAlias: String = "a"
val secondDfAlias: String = "b"
// This is what I am trying to figure out and need help with
val joinCondition = joinCols
.map(c => col(f"${firstDfAlias}.${c}") === col(f"${secondDfAlias}.${c}"))
.reduce((x,y) => (x && y))
secondDf.as(secondDfAlias).join(
firstDfAlias.as(firstDfAlias),
joinCondition,
"inner"
).select(cols.map(col): _*)
}
// This is an example of data frames that I'm trying to join
// But these data frames can change in terms of number of columns in each
// and data types, etc. The only thing we know for sure is that these
// data frames will contain some or all columns with the same name and
// we will use them to join the two data frames.
val firstDfSchema = StructType(List(
StructField(name = "colName1", dataType = LongType, nullable=true),
StructField(name = "colName2", dataType = StringType, nullable=true),
StructField(name = "colName3", dataType = LongType, nullable=true),
StructField(name = "market_id", dataType = LongType, nullable=true)
))
val firstDfData = Seq(
Row(123L, "123", 123L, 123L),
Row(234L, "234", 234L, 234L),
Row(345L, "345", 345L, 345L),
Row(456L, "456", 456L, 456L)
)
val firstDf = spark.createDataFrame(spark.sparkContext.parallelize(firstDfData), firstDfSchema)
val secondDfSchema = StructType(List(
StructField(name = "colName1", dataType = LongType, nullable=true),
StructField(name = "colName2", dataType = StringType, nullable = true),
StructField(name = "colName3", dataType = LongType, nullable = true),
StructField(name = "num_orders", dataType = LongType, nullable=true)
))
val secondDfData = Seq(
Row(123L, "123", 123L, 1L),
Row(234L, "234", 234L, 2L),
Row(567L, "567", 567L, 3L)
)
val secondDf = spark.createDataFrame(spark.sparkContext.parallelize(secondDfData), secondDfSchema)
// Suppose we are going to join the two data frames above on the following columns
val joinCols: Array[String] = Array("colName1", "colName2", "colName3")
val finalDf = joinDfs(firstDf, secondDf, joinCols)

Spark Structured Streaming output not displaying in inteliJ console

I'm trying to emulate this example from Jacek Laskowski Book of read a CSV file and aggregate the data in the console, but for some reason, the output is not displaying in the InteliJ console.
scala> spark.version
res4: String = 2.2.0
I found some reference in some places (1, 2, 3, 4, 5) here in SO, I tried everything but I didn't solve the problem.
This is the code:
package org.sample
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
object App {
def main(args : Array[String]): Unit = {
val DIR = new java.io.File(".").getCanonicalPath + "dataset/stream_in"
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Spark Structured Streaming Job")
val spark = SparkSession.builder()
.appName("Spark Structured Streaming Job")
.master("local[*]")
.getOrCreate()
val reader = spark.readStream
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("latestFirst", "true")
.schema(SchemaDefinition.csvSchema)
.load(DIR + "/*")
reader.createOrReplaceTempView("user_records")
val tranformation = spark.sql(
"""
SELECT carrier, marital_status, COUNT(1) as num_users
FROM user_records
GROUP BY carrier, marital_status
"""
)
val consoleStream = tranformation
.writeStream
.format("console")
.option("truncate", false)
.outputMode("complete")
.start()
consoleStream.awaitTermination()
}
}
My output it is only:
18/11/30 15:40:31 INFO StreamExecution: Streaming query made progress: {
"id" : "9420f826-0daf-40c9-a427-e89ed42ee738",
"runId" : "991c9085-3425-4ea6-82af-4cef20007a66",
"name" : null,
"timestamp" : "2018-11-30T14:40:31.117Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 2,
"triggerExecution" : 2
},
"eventTime" : {
"watermark" : "1970-01-01T00:00:00.000Z"
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[file:/structured-streamming-taskdataset/stream_in/*]",
"startOffset" : null,
"endOffset" : null,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSink#6a62e7ef"
}
}
I redefine the file and right now worked for me:
Differences:
remove the unnecessary conf. With SparkSession we do not need to
call the conf
The .load(/*) didn't work. What worked was the keep only the path
dataset/stream_in;
The data to tranformation was wrong (the fields didn't match the
file)
Final code:
package org.sample
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object StreamCities {
def main(args : Array[String]): Unit = {
// Turn off logs in console
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = SparkSession.builder()
.appName("Spark Structured Streaming get CSV and agregate")
.master("local[*]")
.getOrCreate()
// 01. Schema Definition: We'll put the structure of our
// CSV file. Can be done using a class, but for simplicity
// I'll keep it here
import org.apache.spark.sql.types._
def csvSchema = StructType {
StructType(Array(
StructField("id", StringType, true),
StructField("name", StringType, true),
StructField("city", StringType, true)
))
}
// 02. Read the Stream: Create DataFrame representing the
// stream of the CSV according our Schema. The source it is
// the folder in the .load() option
val users = spark.readStream
.format("csv")
.option("sep", ",")
.option("header", true)
.schema(csvSchema)
.load("dataset/stream_in")
// 03. Aggregation of the Stream: To use the .writeStream()
// we must pass a DF aggregated. We can do this using the
// Untyped API or SparkSQL
// 03.1: Aggregation using untyped API
//val aggUsers = users.groupBy("city").count()
// 03.2: Aggregation using Spark SQL
users.createOrReplaceTempView("user_records")
val aggUsers = spark.sql(
"""
SELECT city, COUNT(1) as num_users
FROM user_records
GROUP BY city"""
)
// Print the schema of our aggregation
println(aggUsers.printSchema())
// 04. Output the stream: Now we'll write our stream in
// console and as new files will be included in the folder
// that Spark it's listening the results will be updated
val consoleStream = aggUsers.writeStream
.outputMode("complete")
.format("console")
.start()
.awaitTermination()
}
}
I had the same issue, I solved it with:
.option("startingOffsets", "earliest") \
in the
def read_from_kafka(spark):
df_sales = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "spark_topic_sales") \
.option("startingOffsets", "earliest") \
.load()
return df_sales
this will make sure the data is read from the topic from the earliest. hope it will help someone and he would not spend 2 hours like I did!

Spark Structured Streaming with Hbase integration

We are doing streaming on kafka data which being collected from MySQL. Now once all the analytics has been done i want to save my data directly to Hbase. I have through the spark structured streaming document but couldn't find any sink with Hbase. Code which I used to read the data from Kafka is below.
val records = spark.readStream.format("kafka").option("subscribe", "kaapociot").option("kafka.bootstrap.servers", "XX.XX.XX.XX:6667").option("startingOffsets", "earliest").load
val jsonschema = StructType(Seq(StructField("header", StringType, true),StructField("event", StringType, true)))
val uschema = StructType(Seq(
StructField("MeterNumber", StringType, true),
StructField("Utility", StringType, true),
StructField("VendorServiceNumber", StringType, true),
StructField("VendorName", StringType, true),
StructField("SiteNumber", StringType, true),
StructField("SiteName", StringType, true),
StructField("Location", StringType, true),
StructField("timestamp", LongType, true),
StructField("power", DoubleType, true)
))
val DF_Hbase = records.selectExpr("cast (value as string) as Json").select(from_json($"json",schema=jsonschema).as("data")).select("data.event").select(from_json($"event", uschema).as("mykafkadata")).select("mykafkadata.*")
Now finally, I want to save DF_Hbase dataframe in hbase.
1- add these libraries to your project :
"org.apache.hbase" % "hbase-client" % "2.0.1"
"org.apache.hbase" % "hbase-common" % "2.0.1"
2- add this trait to your code :
import java.util.concurrent.ExecutorService
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Put, Table}
import org.apache.hadoop.hbase.security.User
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.sql.ForeachWriter
trait HBaseForeachWriter[RECORD] extends ForeachWriter[RECORD] {
val tableName: String
val hbaseConfResources: Seq[String]
def pool: Option[ExecutorService] = None
def user: Option[User] = None
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
val hbaseConfig = HBaseConfiguration.create()
hbaseConfResources.foreach(hbaseConfig.addResource)
ConnectionFactory.createConnection(hbaseConfig, pool.orNull, user.orNull)
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(tableName))
}
override def process(record: RECORD): Unit = {
val put = toPut(record)
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def toPut(record: RECORD): Put
}
3- use it for your logic :
val ds = .... //anyDataset[WhatEverYourDataType]
val query = ds.writeStream
.foreach(new HBaseForeachWriter[WhatEverYourDataType] {
override val tableName: String = "hbase-table-name"
//your cluster files, i assume here it is in resources
override val hbaseConfResources: Seq[String] = Seq("core-site.xml", "hbase-site.xml")
override def toPut(record: WhatEverYourDataType): Put = {
val key = .....
val columnFamaliyName : String = ....
val columnName : String = ....
val columnValue = ....
val p = new Put(Bytes.toBytes(key))
//Add columns ...
p.addColumn(Bytes.toBytes(columnFamaliyName),
Bytes.toBytes(columnName),
Bytes.toBytes(columnValue))
p
}
}
).start()
query.awaitTermination()
This method worked for me even using pyspark: https://github.com/hortonworks-spark/shc/issues/205
package HBase
import org.apache.spark.internal.Logging
import org.apache.spark.sql.execution.streaming.Sink
import org.apache.spark.sql.sources.{DataSourceRegister, StreamSinkProvider}
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.execution.datasources.hbase._
class HBaseSink(options: Map[String, String]) extends Sink with Logging {
// String with HBaseTableCatalog.tableCatalog
private val hBaseCatalog = options.get("hbasecat").map(_.toString).getOrElse("")
override def addBatch(batchId: Long, data: DataFrame): Unit = synchronized {
val df = data.sparkSession.createDataFrame(data.rdd, data.schema)
df.write
.options(Map(HBaseTableCatalog.tableCatalog->hBaseCatalog,
HBaseTableCatalog.newTable -> "5"))
.format("org.apache.spark.sql.execution.datasources.hbase").save()
}
}
class HBaseSinkProvider extends StreamSinkProvider with DataSourceRegister {
def createSink(
sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink = {
new HBaseSink(parameters)
}
def shortName(): String = "hbase"
}
I added the file named as HBaseSinkProvider.scala to shc/core/src/main/scala/org/apache/spark/sql/execution/datasources/hbase and built it, the example works perfect
This is example, how to use (scala):
inputDF.
writeStream.
queryName("hbase writer").
format("HBase.HBaseSinkProvider").
option("checkpointLocation", checkPointProdPath).
option("hbasecat", catalog).
outputMode(OutputMode.Update()).
trigger(Trigger.ProcessingTime(30.seconds)).
start
And an example of how i use it in python:
inputDF \
.writeStream \
.outputMode("append") \
.format('HBase.HBaseSinkProvider') \
.option('hbasecat', catalog_kafka) \
.option("checkpointLocation", '/tmp/checkpoint') \
.start()
Are you processing the data coming from Kafka? Or just pumping it to HBase? An option to consider is using Kafka Connect. This gives you a configuration-file based approach for integrating Kafka with other systems, including HBase.

Update Dataframe Schema Read Spark Scala

I am trying to read in a schema from hdfs to load into my dataframe. This allows the schema to be updated and reside outside the Spark Scala code. I was wondering what the best way was to do this? Below is what I have currently inside the code.
val schema_example = StructType(Array(
StructField("EXAMPLE_1", StringType, true),
StructField("EXAMPLE_2", StringType, true),
StructField("EXAMPLE_3", StringType, true))
def main(args: Array[String]): Unit = {
val df_example = get_df("example.txt", schema_example)
}
def get_df(filename: String, schema: StructType): DataFrame = {
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter","~")
.schema(schema)
.option("quote", "'")
.option("quoteMode", "ALL")
.load(filename)
df.select(df.columns.map(c => trim(col(c)).alias(c)): _*)
}
Better would be to read Schema from HOCON Config file, which can be updated as and when required.
schema[
{
columnName = EXAMPLE_1
type = string
},
{
columnName = EXAMPLE_2
type = string
},
{
columnName = EXAMPLE_3
type = string
}
]
They you can read this file using ConfigFactory.
This will be more better and cleaner way to maintain file schema.

Spark: Save Dataframe in ORC format

In the previous version, we used to have a 'saveAsOrcFile()' method on RDD. This is now gone! How do I save data in DataFrame in ORC File format?
def main(args: Array[String]) {
println("Creating Orc File!")
val sparkConf = new SparkConf().setAppName("orcfile")
val sc = new SparkContext(sparkConf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val people = sc.textFile("/apps/testdata/people.txt")
val schemaString = "name age"
val schema = StructType(schemaString.split(" ").map(fieldName => {if(fieldName == "name") StructField(fieldName, StringType, true) else StructField(fieldName, IntegerType, true)}))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), new Integer(p(1).trim)))
//# Infer table schema from RDD**
val peopleSchemaRDD = hiveContext.createDataFrame(rowRDD, schema)
//# Create a table from schema**
peopleSchemaRDD.registerTempTable("people")
val results = hiveContext.sql("SELECT * FROM people")
results.map(t => "Name: " + t.toString).collect().foreach(println)
// Now I want to save this Dataframe(peopleSchemaRDD) in ORC Format. How do I do that?
}
Since Spark 1.4 you can simply use DataFrameWriter and set format to orc:
peopleSchemaRDD.write.format("orc").save("people")
or
peopleSchemaRDD.write.orc("people")