HBASE SPARK Query with filter without load all the hbase - scala

I have to query HBASE and then work with the data with spark and scala.
My problem is that with my solution, i take ALL the data of my HBASE table and then i filter, it's not an efficient way because it takes too much memory. So i would like to do the filter directly, how can i do that ?
def HbaseSparkQuery(table: String, gatewayINPUT: String, sparkContext: SparkContext): DataFrame = {
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val conf = HBaseConfiguration.create()
val tableName = table
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val DATAFRAME = hBaseRDD.map(x => {
(Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("eventTime"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("node"), Bytes.toBytes("imei"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("measure"), Bytes.toBytes("rssi"))))
}).toDF()
.withColumnRenamed("_1", "GatewayIMEA")
.withColumnRenamed("_2", "EventTime")
.withColumnRenamed("_3", "ap")
.withColumnRenamed("_4", "RSSI")
.filter($"GatewayIMEA" === gatewayINPUT)
DATAFRAME
}
As you can see in my code, I do the filter after the creation of the dataframe, after the loading of Hbase data ..
Thank you in advance for your answers

Here is the solution I found
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.filter._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil
object HbaseConnector {
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "/usr/local/hadoop")
val sparkConf = new SparkConf().setAppName("CoverageAlgPipeline").setMaster("local[*]")
val sparkContext = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Coverage Algorithm")
.getOrCreate
val GatewayIMEA = "123"
val TABLE_NAME = "TABLE"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME)
val connection = ConnectionFactory.createConnection(conf)
val table = connection.getTable(TableName.valueOf(TABLE_NAME))
val scan = new Scan
val GatewayIDFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(String.valueOf(GatewayIMEA)))
scan.setFilter(GatewayIDFilter)
conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan))
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val DATAFRAME = hBaseRDD.map(x => {
(Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("eventTime"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("node"), Bytes.toBytes("imei"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("measure"), Bytes.toBytes("Measure"))))
}).toDF()
.withColumnRenamed("_1", "GatewayIMEA")
.withColumnRenamed("_2", "EventTime")
.withColumnRenamed("_3", "ap")
.withColumnRenamed("_4", "measure")
DATAFRAME.show()
}
}
What is done is to set your input table, set your filter, do the scan with the filter and get the scan to a RDD, and then transform the RDD to a dataframe (optional)
To do multiple filters :
val timestampFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("eventTime"), CompareFilter.CompareOp.GREATER, Bytes.toBytes(String.valueOf(dateOfDayTimestamp)))
val GatewayIDFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(String.valueOf(GatewayIMEA)))
val filters = new FilterList(GatewayIDFilter, timestampFilter)
scan.setFilter(filters)

You can use a spark-hbase connector with predicate pushdown. e.g.https://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

Related

How to save a JSON file in ElasticSearch using Spark?

I am trying to save a JSON file in ElasticSearch but its not working.
This is my code:
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark.sql._
import org.apache.spark.SparkConf
object HelloEs {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("WriteToES").setMaster("local")
conf.set("es.index.auto.create", "true")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sen_p = sqlContext.read.json("/home/Bureau/mydoc/Orange.json")
sen_p.registerTempTable("sensor_ptable")
sen_p.saveToEs("sensor/metrics")
}
}
I am getting also this error:
Exception in thread "main" java.lang.NoSuchMethodError: org.elasticsearch.spark.sql.package$.sparkDataFrameFunctions(Lorg/apache/spark/sql/Dataset;)Lorg/elasticsearch/spark/sql/package$SparkDataFrameFunctions;
at learnscala.HelloEs$.main(HelloEs.scala:20)
at learnscala.HelloEs.main(HelloEs.scala)
There are multiple ways to save an RDD / Dataframe to Elastic Search.
Spark Dataframe can be written to Elastic Search using:
df.write.format("org.elasticsearch.spark.sql").mode("append").option("es.resource","<ES_RESOURCE_PATH>").option("es.nodes", "http://<ES_HOST>:9200").save()
RDD can be written to ES using:
import org.elasticsearch.spark.rdd.EsSpark
EsSpark.saveToEs(rdd, "<ES_RESOURCE_PATH>")
In your case, modify the code as below:
`
object HelloEs {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("WriteToES").setMaster("local")
conf.set("es.index.auto.create", "true")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sen_p = sqlContext.read.json("/home/Bureau/mydoc/Orange.json")
sen_p.write.format("org.elasticsearch.spark.sql").mode("append").option("es.resource","<ES_RESOURCE_PATH>").option("es.nodes", "http://<ES_HOST>:9200").save()
}
}
`

nanoseconds truncated while inserting data into hive from Spark

While inserting data into Hive TimestampType from spark, nanoseconds are truncated. does anyone has any solution towards it? I have tried writing to orc and csv format on hive.
CSV: it appeared as 2018-03-20T13:04:20.123Z
ORC: 2018-03-20 13:04:20.123456
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.StructField
import java.util.Date
import org.apache.spark.sql.Row
import java.math.{BigDecimal,MathContext,RoundingMode}
/**
* Main class to read Order, Route and Trade records and convert them to ORC File format
* #author Shefali.Nema
* #since 1.0.0
*/
object testDateAndDecimal {
def main(args: Array[String]): Unit = {
execute;
}
private def execute: Unit = {
val sparkConf = new SparkConf().setAppName("Test");
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Define DataTypes
val datetimestring: String = "2018-03-20 13:04:20.123456789"
val dt = java.sql.Timestamp.valueOf(datetimestring)
//val DecimalType = DataTypes.createDecimalType(18, 8)
//Define Values
val id = 1
//System.out.println(new BigDecimal("135.69")); // 135.69
val price = new BigDecimal("1234567890.1234567899")
System.out.println("\n###################################################price###################################" + price + "\n")
System.out.println("\n###################################################dt###################################" + dt + "\n")
val schema = StructType(StructField("id",IntegerType,true) :: StructField("name",TimestampType,true) :: StructField("price",DecimalType(18,8),true) :: Nil)
val values = List(id,dt,price)
val row = Row.fromSeq(values)
// Create `RDD` from `Row`
val rdd = sc.makeRDD(List(row))
val orcFolderName = "testDecimal"
val hiveRowsDF = sqlContext.createDataFrame(rdd, schema)
hiveRowsDF.write.mode(org.apache.spark.sql.SaveMode.Append).orc(orcFolderName)
}
}

Error with spark Row.fromSeq for a text file

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
object fixedLength {
def main(args:Array[String]) {
def getRow(x : String) : Row={
val columnArray = new Array[String](4)
columnArray(0)=x.substring(0,3)
columnArray(1)=x.substring(3,13)
columnArray(2)=x.substring(13,18)
columnArray(3)=x.substring(18,22)
Row.fromSeq(columnArray)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val conf = new SparkConf().setAppName("FixedLength").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val fruits = sc.textFile("in/fruits.txt")
val schemaString = "id,fruitName,isAvailable,unitPrice";
val fields = schemaString.split(",").map( field => StructField(field,StringType,nullable=true))
val schema = StructType(fields)
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
df.show() // Error
println("End of the program")
}
}
I'm getting error in the df.show() command.
My file content is
56 apple TRUE 0.56
45 pear FALSE1.34
34 raspberry TRUE 2.43
34 plum TRUE 1.31
53 cherry TRUE 1.4
23 orange FALSE2.34
56 persimmon FALSE23.2
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
Can you please help?
You are creating rdd in old way SparkContext(conf)
val conf = new SparkConf().setAppName("FixedLength").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val fruits = sc.textFile("in/fruits.txt")
whereas you are creating dataframe in new way using SparkSession
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
Ultimately you are mixing rdd created with old sparkContext functions with dataframe created by using new sparkSession.
I would suggest you to use only one way.
I guess thats the reason for the issue
Update
doing the following should work for you
def getRow(x : String) : Row={
val columnArray = new Array[String](4)
columnArray(0)=x.substring(0,3)
columnArray(1)=x.substring(3,13)
columnArray(2)=x.substring(13,18)
columnArray(3)=x.substring(18,22)
Row.fromSeq(columnArray)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val fruits = spark.sparkContext.textFile("in/fruits.txt")
val schemaString = "id,fruitName,isAvailable,unitPrice";
val fields = schemaString.split(",").map( field => StructField(field,StringType,nullable=true))
val schema = StructType(fields)
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)

How to declare an empty dataset in Spark?

I am new in Spark and Spark dataset. I was trying to declare an empty dataset using emptyDataset but it was asking for org.apache.spark.sql.Encoder. The data type I am using for the dataset is an object of case class Tp(s1: String, s2: String, s3: String).
All you need is to import implicit encoders from SparkSession instance before you create empty Dataset: import spark.implicits._
See full example here
EmptyDataFrame
package com.examples.sparksql
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object EmptyDataFrame {
def main(args: Array[String]){
//Create Spark Conf
val sparkConf = new SparkConf().setAppName("Empty-Data-Frame").setMaster("local")
//Create Spark Context - sc
val sc = new SparkContext(sparkConf)
//Create Sql Context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Import Sql Implicit conversions
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
//Create Schema RDD
val schema_string = "name,id,dept"
val schema_rdd = StructType(schema_string.split(",").map(fieldName => StructField(fieldName, StringType, true)) )
//Create Empty DataFrame
val empty_df = sqlContext.createDataFrame(sc.emptyRDD[Row], schema_rdd)
//Some Operations on Empty Data Frame
empty_df.show()
println(empty_df.count())
//You can register a Table on Empty DataFrame, it's empty table though
empty_df.registerTempTable("empty_table")
//let's check it ;)
val res = sqlContext.sql("select * from empty_table")
res.show
}
}
Alternatively you can convert an empty list into a Dataset:
import sparkSession.implicits._
case class Employee(name: String, id: Int)
val ds: Dataset[Employee] = List.empty[Employee].toDS()

Putting Multiple column names from a HBase Table into one SparkRDD

I have to put multiple column families from a table in HBase into one sparkRDD. I am attempting this using the following code: (question edited after first aanswer)
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import scala.collection.JavaConverters._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark._
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.client._
object HBaseRead {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local").set("spark.driver.allowMultipleContexts","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "TableName"
////setting up required stuff
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 120000)
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
sparkConf.registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.client.Result]))
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}
case class Model(Shoes: String,Clothes: String,T-shirts: String)
var hBaseRDD2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
val result = tuple._2
Model(Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("Shoes"))),
Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("Clothes"))),
Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("T-shirts")))
)
})
val totalcount = transformedRDD.count()
println(totalcount)
}
}
What I want to do is to make a single rdd wherein values of first row (and subsequent rows later on) from these column families would be combined in a single array in the rdd. Any help would be appreciated. Thanks
You can do it couple of ways, inside rdd map you can get all the columns from the parent rdd[hBaseRDD2] and transform it and return it as another single rdd.
or you can create a case class and map it to that columns.
For example:
case class Model(column1: String,
column1: String,
column1: String)
var hBaseRDD2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
val result = tuple._2
Model(Bytes.toString(result.getValue(Bytes.toBytes("cf1"),Bytes.toBytes("Columnname1"))),
Bytes.toString(result.getValue(Bytes.toBytes("cf2"),Bytes.toBytes("Columnname2"))),
Bytes.toString(result.getValue(Bytes.toBytes("cf2"),Bytes.toBytes("Columnname2")))
)
})