Spark and HBase Snapshots - scala

Under the assumption that we could access data much faster if pulling directly from HDFS instead of using the HBase API, we're trying to build an RDD based on a Table Snapshot from HBase.
So, I have a snapshot called "dm_test_snap". I seem to be able to get most of the configuration stuff working, but my RDD is null (despite there being data in the Snapshot itself).
I'm having a hell of a time finding an example of anyone doing offline analysis of HBase snapshots with Spark, but I can't believe I'm alone in trying to get this working. Any help or suggestions are greatly appreciated.
Here is a snippet of my code:
object TestSnap {
def main(args: Array[String]) {
val config = ConfigFactory.load()
val hbaseRootDir = config.getString("hbase.rootdir")
val sparkConf = new SparkConf()
.setAppName("testnsnap")
.setMaster(config.getString("spark.app.master"))
.setJars(SparkContext.jarOfObject(this))
.set("spark.executor.memory", "2g")
.set("spark.default.parallelism", "160")
val sc = new SparkContext(sparkConf)
println("Creating hbase configuration")
val conf = HBaseConfiguration.create()
conf.set("hbase.rootdir", hbaseRootDir)
conf.set("hbase.zookeeper.quorum", config.getString("hbase.zookeeper.quorum"))
conf.set("zookeeper.session.timeout", config.getString("zookeeper.session.timeout"))
conf.set("hbase.TableSnapshotInputFormat.snapshot.name", "dm_test_snap")
val scan = new Scan
val job = Job.getInstance(conf)
TableSnapshotInputFormat.setInput(job, "dm_test_snap",
new Path("hdfs://nameservice1/tmp"))
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()
System.exit(0)
}
}
Update to include the solution
The trick was, as #Holden mentioned below, the conf wasn't getting passed through. To remedy this, I was able to get it working by changing this the call to newAPIHadoopRDD to this:
val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
There was a second issue that was also highlighted by #victor's answer, that I was not passing in a scan. To fix that, I added this line and method:
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
def convertScanToString(scan : Scan) = {
val proto = ProtobufUtil.toScan(scan);
Base64.encodeBytes(proto.toByteArray());
}
This also let me pull out this line from the conf.set commands:
conf.set("hbase.TableSnapshotInputFormat.snapshot.name", "dm_test_snap")
*NOTE: This was for HBase version 0.96.1.1 on CDH5.0
Final full code for easy reference:
object TestSnap {
def main(args: Array[String]) {
val config = ConfigFactory.load()
val hbaseRootDir = config.getString("hbase.rootdir")
val sparkConf = new SparkConf()
.setAppName("testnsnap")
.setMaster(config.getString("spark.app.master"))
.setJars(SparkContext.jarOfObject(this))
.set("spark.executor.memory", "2g")
.set("spark.default.parallelism", "160")
val sc = new SparkContext(sparkConf)
println("Creating hbase configuration")
val conf = HBaseConfiguration.create()
conf.set("hbase.rootdir", hbaseRootDir)
conf.set("hbase.zookeeper.quorum", config.getString("hbase.zookeeper.quorum"))
conf.set("zookeeper.session.timeout", config.getString("zookeeper.session.timeout"))
val scan = new Scan
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val job = Job.getInstance(conf)
TableSnapshotInputFormat.setInput(job, "dm_test_snap",
new Path("hdfs://nameservice1/tmp"))
val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()
System.exit(0)
}
def convertScanToString(scan : Scan) = {
val proto = ProtobufUtil.toScan(scan);
Base64.encodeBytes(proto.toByteArray());
}
}

Looking at the Job information, its making a copy of the conf object you are supplying to it (The Job makes a copy of the Configuration so that any necessary internal modifications do not reflect on the incoming parameter.) so most likely the information that you need to set on the conf object isn't getting passed down to Spark. You could instead use TableSnapshotInputFormatImpl which has a similar method that works on conf objects. There might be additional things needed but at first pass through the problem this seems like the most likely cause.
As pointed out in the comments, another option is to use job.getConfiguration to get the updated config from the job object.

You have not configured your M/R job properly:
This is example in Java on how to configure M/R over snapshots:
Job job = new Job(conf);
Scan scan = new Scan();
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName,
scan, MyTableMapper.class, MyMapKeyOutput.class,
MyMapOutputValueWritable.class, job, true);
}
You, definitely, skipped Scan. I suggest you taking a look at TableMapReduceUtil's initTableSnapshotMapperJob implementation to get idea how to configure job in Spark/Scala.

Here is complete configuration in mapreduce Java
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // Name of the snapshot
scan, // Scan instance to control CF and attribute selection
DefaultMapper.class, // mapper class
NullWritable.class, // mapper output key
Text.class, // mapper output value
job,
true,
restoreDir);

Related

Spark Listener execute hook on onJobComplete on Executors?

I have a simple spark job which reads csv data from S3, transforms it, partitions it by and saves it to local file system.
I have csv file on s3 with below content
sample input: japan, 01-01-2020, weather, provider, device
case class WeatherReport(country:String, date:String, event:String, provide:String, device:String )
object SampleSpark extends App{
val conf = new SparkConf()
.setAppName("processing")
.setIfMissing("spark.master", "local[*]")
.setIfMissing("spark.driver.host", "localhost")
val sc = new SparkContext(conf)
val baseRdd = sc.textFile("s3a://mybucket/sample/*.csv")
val weatherDataFrame = baseRdd
.filter(_.trim.nonEmpty)
.map(x => WeatherReport(x))
.toDF()
df.write.partitionBy("date")
.mode(SaveMode.Append)
.format("com.databricks.spark.csv")
.save("outputDirectory")
}
The file gets saved in "outputDirectory/date=01-01-2020/part-" with more than 1 part files.
I want to merge the part file and remove prefix date= name like "outputDirectory/01-01-2020/output.csv" and copy this to S3.
How is it possible to do it??
I thought of using SparkListener like below but i guess it'll only run on Drive but the files would be present on Executors.
sparkContext.addListener(new SparkListener {
override def onJobEnd(jobEnd: SparkListenerJobEnd) {
renameDirectory()
mergePartFilesToSingleFiles()
uploadFileToS3()
}
})
Is there a way to run a post Job Completion hook on Executors and Driver which would sync all the local files on them to S3?
You can run post execution hooks on executors by registering TaskCompletionListener
// call this from the code that is running on executor such as your mapper WeatherReport
val taskContext = TaskContext.get
taskContext.addTaskCompletionListener(customTaskCompletionListener)
Reference:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/TaskContext.html#addTaskCompletionListener-scala.Function1-

loading csv file to HBase through Spark

this is simple " how to " question::
We can bring data to Spark environment through com.databricks.spark.csv. I do know how to create HBase table through spark, and write data to the HBase tables manually. But is that even possible to load a text/csv/jason files directly to HBase through Spark? I cannot see anybody talking about it. So, just checking. If possible, please guide me to a good website that explains the scala code in detail to get it done.
Thank you,
There are multiple ways you can do that.
Spark Hbase connector:
https://github.com/hortonworks-spark/shc
You can see lot of examples on the link.
Also you can use SPark core to load the data to Hbase using HbaseConfiguration.
Code Example:
val fileRDD = sc.textFile(args(0), 2)
val transformedRDD = fileRDD.map { line => convertToKeyValuePairs(line) }
val conf = HBaseConfiguration.create()
conf.set(TableOutputFormat.OUTPUT_TABLE, "tableName")
conf.set("hbase.zookeeper.quorum", "localhost:2181")
conf.set("hbase.master", "localhost:60000")
conf.set("fs.default.name", "hdfs://localhost:8020")
conf.set("hbase.rootdir", "/hbase")
val jobConf = new Configuration(conf)
jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].getName)
jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)
transformedRDD.saveAsNewAPIHadoopDataset(jobConf)
def convertToKeyValuePairs(line: String): (ImmutableBytesWritable, Put) = {
val cfDataBytes = Bytes.toBytes("cf")
val rowkey = Bytes.toBytes(line.split("\\|")(1))
val put = new Put(rowkey)
put.add(cfDataBytes, Bytes.toBytes("PaymentDate"), Bytes.toBytes(line.split("|")(0)))
put.add(cfDataBytes, Bytes.toBytes("PaymentNumber"), Bytes.toBytes(line.split("|")(1)))
put.add(cfDataBytes, Bytes.toBytes("VendorName"), Bytes.toBytes(line.split("|")(2)))
put.add(cfDataBytes, Bytes.toBytes("Category"), Bytes.toBytes(line.split("|")(3)))
put.add(cfDataBytes, Bytes.toBytes("Amount"), Bytes.toBytes(line.split("|")(4)))
return (new ImmutableBytesWritable(rowkey), put)
}
Also you can use this one
https://github.com/nerdammer/spark-hbase-connector

How to filter TwitterInputDStream results by user

I'm trying to make a Scala + Spark application that specifically listens for tweets from specific users that I define in my filters array. Currently, editing the filters array only changes the text of the tweet that it is receiving. I want to do this, except for users.
I was originally using
val statuses = TwitterUtils.createStream(ssc, None, filter)
To get my stream, however, in another post, it reccomended using TwitterInputDStream. I just implemented that, but I don't get how to filter based on users.
Any help is MUCH appreciated, as this is something I've struggled with for quite a while.
My code thus far:
val config = new twitter4j.conf.ConfigurationBuilder()
.setOAuthConsumerKey("2pzru6Cd8aRumYHUumoCiKwYS")
.setOAuthConsumerSecret("rKKQfISDzk715OOKS7wvGkSKdkqGsLFOdTrZ9QdJOfAjMXekNB")
.setOAuthAccessToken("962554652-SU0aBc8Iyukka1gFJQY9Ux5XczDXSgczRpuih3ml")
.setOAuthAccessTokenSecret("k0DWDIGeJ6u2ijYJeIv6dfCuCmTPIZT8xPJ5AG4P9PV72")
.build
val twitter_auth = new TwitterFactory(config)
val a = new OAuthAuthorization(config)
val atwitter = twitter_auth.getInstance(a).getAuthorization()
//val auth = new Option[Authorization] {}
val auth = Option(atwitter)
val sparkConf = new SparkConf().setAppName("Main2")
.setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val test = new StorageLevel()
val twitter = new TwitterInputDStream(ssc, auth, filters, test )
val rec = twitter.getReceiver()
twitter.print()
Thanks!

Spark Dataframe content can be printed out but (e.g.) not counted

Strangely this doesnt work. Can someone explain the background? I want to understand why it doesnt take this.
The Inputfiles are parquet files spread across multiple folders. When I print the results, they are structured as I want to. When I use a dataframe.count() on the joined dataframe, the job will run forever. Can anyone help with the Details on that
import org.apache.spark.{SparkContext, SparkConf}
object TEST{
def main(args: Array[String] ) {
val appName = args(0)
val threadMaster = args(1)
val inputPathSent = args(2)
val inputPathClicked = args(3)
// pass spark configuration
val conf = new SparkConf()
.setMaster(threadMaster)
.setAppName(appName)
// Create a new spark context
val sc = new SparkContext(conf)
// Specify a SQL context and pass in the spark context we created
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create two dataframes for sent and clicked files
val dfSent = sqlContext.read.parquet(inputPathSent)
val dfClicked = sqlContext.read.parquet(inputPathClicked)
// Join them
val dfJoin = dfSent.join(dfClicked, dfSent.col("customer_id")
===dfClicked.col("customer_id") && dfSent.col("campaign_id")===
dfClicked.col("campaign_id"), "left_outer")
dfJoin.show(20) // perfectly shows the first 20 rows
dfJoin.count() //Here we run into trouble and it runs forever
}
}
Use println(dfJoin.count())
You will be able to see the count in your screen.

Good Practice to insure only one Spark context in application

I am looking for a good way to insure that my app is only using one single Spark Context (sc). While developing I often run into errors and have to restart my Play! server to re test my modifications.
Would a Singleton pattern be solution ?
object sparckContextSingleton {
#transient private var instance: SparkContext = _
private val conf : SparkConf = new SparkConf()
.setMaster("local[2]")
.setAppName("myApp")
def getInstance(): SparkContext = {
if (instance == null){
instance = new SparkContext(conf)
}
instance
}
}
This does not make a good job. Should I stop the SparkContext?
This should be enough to do the trick, important is to use val and not var.
object SparkContextKeeper {
val conf = new SparkConf().setAppName("SparkApp")
val context= new SparkContext(conf)
val sqlContext = new SQLContext(context)
}
In Play you should write a plugin that exposes the SparkContext. Use the plugin's start and stop hooks to start and stop the context.