Putting Multiple column names from a HBase Table into one SparkRDD - scala

I have to put multiple column families from a table in HBase into one sparkRDD. I am attempting this using the following code: (question edited after first aanswer)
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import scala.collection.JavaConverters._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark._
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.client._
object HBaseRead {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local").set("spark.driver.allowMultipleContexts","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "TableName"
////setting up required stuff
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 120000)
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
sparkConf.registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.client.Result]))
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}
case class Model(Shoes: String,Clothes: String,T-shirts: String)
var hBaseRDD2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
val result = tuple._2
Model(Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("Shoes"))),
Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("Clothes"))),
Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("T-shirts")))
)
})
val totalcount = transformedRDD.count()
println(totalcount)
}
}
What I want to do is to make a single rdd wherein values of first row (and subsequent rows later on) from these column families would be combined in a single array in the rdd. Any help would be appreciated. Thanks

You can do it couple of ways, inside rdd map you can get all the columns from the parent rdd[hBaseRDD2] and transform it and return it as another single rdd.
or you can create a case class and map it to that columns.
For example:
case class Model(column1: String,
column1: String,
column1: String)
var hBaseRDD2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
val result = tuple._2
Model(Bytes.toString(result.getValue(Bytes.toBytes("cf1"),Bytes.toBytes("Columnname1"))),
Bytes.toString(result.getValue(Bytes.toBytes("cf2"),Bytes.toBytes("Columnname2"))),
Bytes.toString(result.getValue(Bytes.toBytes("cf2"),Bytes.toBytes("Columnname2")))
)
})

Related

Convert scala map to dataframe

I am trying to stream twitter data using Apache Spark and I want to save it as csv file into HDFS. I understand that I have to convert it to a dataframe but I am not able to do so.
Here is my full code:
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.twitter.TwitterUtils
//import com.google.gson.Gson
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
//import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
//import org.apache.spark.sql.functions._
import sentimentAnalysis.sentimentScore
case class twitterCaseClass (userID: String = "", user: String = "", createdAt: String = "", text: String = "", sentimentType: String = "")
object twitterStream {
//private val gson = new Gson()
def main(args: Array[String]) {
//Twitter API
Logger.getLogger("org").setLevel(Level.ERROR)
System.setProperty("twitter4j.oauth.consumerKey", "#######")
System.setProperty("twitter4j.oauth.consumerSecret", "#######")
System.setProperty("twitter4j.oauth.accessToken", "#######")
System.setProperty("twitter4j.oauth.accessTokenSecret", "#######")
val spark = SparkSession.builder().appName("twitterStream").master("local[*]").getOrCreate()
val sc: SparkContext = spark.sparkContext
val streamContext = new StreamingContext(sc, Seconds(5))
import spark.implicits._
val filters = Array("Singapore")
val filtered = TwitterUtils.createStream(streamContext, None, filters)
val englishTweets = filtered.filter(_.getLang() == "en")
englishTweets.print()
val tweets = englishTweets.map{ col => {
(
"userID" -> col.getId,
"user" -> col.getUser.getScreenName,
"createdAt" -> col.getCreatedAt.toInstant.toString,
"text" -> col.getText.toLowerCase.split(" ").filter(_.matches("^[a-zA-Z0-9 ]+$")).fold("")((a, b) => a + " " + b).trim,
"sentimentType" -> sentimentScore(col.getText).toString
)
}
}
//val tweets = englishTweets.map(gson.toJson(_))
//tweets.saveAsTextFiles("hdfs://localhost:9000/usr/sparkApp/test/")
streamContext.start()
streamContext.awaitTermination()
}
}
I am not sure where did I possibly went wrong. There is another way to go about which is using case class. Is there a good example I can follow?
Update
The result of the Map function which is save into HDFS is like this:
((userID,1345940003533312000),(user,rei_yang),(createdAt,2021-01-04T03:47:57Z),(text,just posted a photo singapore),(sentimentType,NEUTRAL))
Is there a way to code it to a dataframe?

HBASE SPARK Query with filter without load all the hbase

I have to query HBASE and then work with the data with spark and scala.
My problem is that with my solution, i take ALL the data of my HBASE table and then i filter, it's not an efficient way because it takes too much memory. So i would like to do the filter directly, how can i do that ?
def HbaseSparkQuery(table: String, gatewayINPUT: String, sparkContext: SparkContext): DataFrame = {
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val conf = HBaseConfiguration.create()
val tableName = table
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val DATAFRAME = hBaseRDD.map(x => {
(Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("eventTime"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("node"), Bytes.toBytes("imei"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("measure"), Bytes.toBytes("rssi"))))
}).toDF()
.withColumnRenamed("_1", "GatewayIMEA")
.withColumnRenamed("_2", "EventTime")
.withColumnRenamed("_3", "ap")
.withColumnRenamed("_4", "RSSI")
.filter($"GatewayIMEA" === gatewayINPUT)
DATAFRAME
}
As you can see in my code, I do the filter after the creation of the dataframe, after the loading of Hbase data ..
Thank you in advance for your answers
Here is the solution I found
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.filter._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil
object HbaseConnector {
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "/usr/local/hadoop")
val sparkConf = new SparkConf().setAppName("CoverageAlgPipeline").setMaster("local[*]")
val sparkContext = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Coverage Algorithm")
.getOrCreate
val GatewayIMEA = "123"
val TABLE_NAME = "TABLE"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME)
val connection = ConnectionFactory.createConnection(conf)
val table = connection.getTable(TableName.valueOf(TABLE_NAME))
val scan = new Scan
val GatewayIDFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(String.valueOf(GatewayIMEA)))
scan.setFilter(GatewayIDFilter)
conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan))
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val DATAFRAME = hBaseRDD.map(x => {
(Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("eventTime"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("node"), Bytes.toBytes("imei"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("measure"), Bytes.toBytes("Measure"))))
}).toDF()
.withColumnRenamed("_1", "GatewayIMEA")
.withColumnRenamed("_2", "EventTime")
.withColumnRenamed("_3", "ap")
.withColumnRenamed("_4", "measure")
DATAFRAME.show()
}
}
What is done is to set your input table, set your filter, do the scan with the filter and get the scan to a RDD, and then transform the RDD to a dataframe (optional)
To do multiple filters :
val timestampFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("eventTime"), CompareFilter.CompareOp.GREATER, Bytes.toBytes(String.valueOf(dateOfDayTimestamp)))
val GatewayIDFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(String.valueOf(GatewayIMEA)))
val filters = new FilterList(GatewayIDFilter, timestampFilter)
scan.setFilter(filters)
You can use a spark-hbase connector with predicate pushdown. e.g.https://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

nanoseconds truncated while inserting data into hive from Spark

While inserting data into Hive TimestampType from spark, nanoseconds are truncated. does anyone has any solution towards it? I have tried writing to orc and csv format on hive.
CSV: it appeared as 2018-03-20T13:04:20.123Z
ORC: 2018-03-20 13:04:20.123456
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.StructField
import java.util.Date
import org.apache.spark.sql.Row
import java.math.{BigDecimal,MathContext,RoundingMode}
/**
* Main class to read Order, Route and Trade records and convert them to ORC File format
* #author Shefali.Nema
* #since 1.0.0
*/
object testDateAndDecimal {
def main(args: Array[String]): Unit = {
execute;
}
private def execute: Unit = {
val sparkConf = new SparkConf().setAppName("Test");
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Define DataTypes
val datetimestring: String = "2018-03-20 13:04:20.123456789"
val dt = java.sql.Timestamp.valueOf(datetimestring)
//val DecimalType = DataTypes.createDecimalType(18, 8)
//Define Values
val id = 1
//System.out.println(new BigDecimal("135.69")); // 135.69
val price = new BigDecimal("1234567890.1234567899")
System.out.println("\n###################################################price###################################" + price + "\n")
System.out.println("\n###################################################dt###################################" + dt + "\n")
val schema = StructType(StructField("id",IntegerType,true) :: StructField("name",TimestampType,true) :: StructField("price",DecimalType(18,8),true) :: Nil)
val values = List(id,dt,price)
val row = Row.fromSeq(values)
// Create `RDD` from `Row`
val rdd = sc.makeRDD(List(row))
val orcFolderName = "testDecimal"
val hiveRowsDF = sqlContext.createDataFrame(rdd, schema)
hiveRowsDF.write.mode(org.apache.spark.sql.SaveMode.Append).orc(orcFolderName)
}
}

How to declare an empty dataset in Spark?

I am new in Spark and Spark dataset. I was trying to declare an empty dataset using emptyDataset but it was asking for org.apache.spark.sql.Encoder. The data type I am using for the dataset is an object of case class Tp(s1: String, s2: String, s3: String).
All you need is to import implicit encoders from SparkSession instance before you create empty Dataset: import spark.implicits._
See full example here
EmptyDataFrame
package com.examples.sparksql
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object EmptyDataFrame {
def main(args: Array[String]){
//Create Spark Conf
val sparkConf = new SparkConf().setAppName("Empty-Data-Frame").setMaster("local")
//Create Spark Context - sc
val sc = new SparkContext(sparkConf)
//Create Sql Context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Import Sql Implicit conversions
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
//Create Schema RDD
val schema_string = "name,id,dept"
val schema_rdd = StructType(schema_string.split(",").map(fieldName => StructField(fieldName, StringType, true)) )
//Create Empty DataFrame
val empty_df = sqlContext.createDataFrame(sc.emptyRDD[Row], schema_rdd)
//Some Operations on Empty Data Frame
empty_df.show()
println(empty_df.count())
//You can register a Table on Empty DataFrame, it's empty table though
empty_df.registerTempTable("empty_table")
//let's check it ;)
val res = sqlContext.sql("select * from empty_table")
res.show
}
}
Alternatively you can convert an empty list into a Dataset:
import sparkSession.implicits._
case class Employee(name: String, id: Int)
val ds: Dataset[Employee] = List.empty[Employee].toDS()

Scala - Groupby and Max on pair RDD

I am new in spark scala and want to find the max salary in each department
Dept,Salary
Dept1,1000
Dept2,2000
Dept1,2500
Dept2,1500
Dept1,1700
Dept2,2800
I implemented below code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object MaxSalary {
val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))
case class Dept(dept_name : String, Salary : Int)
val data = sc.textFile("file:///home/user/Documents/dept.txt").map(_.split(","))
val recs = data.map(r => (r(0), Dept(r(0), r(1).toInt)))
val a = recs.max()???????
})
}
but stuck how to implement group by and max function. I am using pair RDD.
Thanks
This can be done using RDDs with the following code:
val emp = sc.textFile("file:///home/user/Documents/dept.txt")
.mapPartitionsWithIndex( (idx, row) => if(idx==0) row.drop(1) else row )
.map(x => (x.split(",")(0).toString, x.split(",")(1).toInt))
val maxSal = emp.reduceByKey(math.max(_,_))
Should give you:
Array[(String, Int)] = Array((Dept1,2500), (Dept2,2800))
If you are using Dataset here is the solution
case class Dept(dept_name : String, Salary : Int)
val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))
val sq = new SQLContext(sc)
import sq.implicits._
val file = "resources/ip.csv"
val data = sc.textFile(file).map(_.split(","))
val recs = data.map(r => Dept(r(0), r(1).toInt )).toDS()
recs.groupBy($"dept_name").agg(max("Salary").alias("max_solution")).show()
Output:
+---------+------------+
|dept_name|max_solution|
+---------+------------+
| Dept2| 2800|
| Dept1| 2500|
+---------+------------+