Dump data to HBase table from Spark dataframe - scala

I want to dump data into the HBase table from dataframe using the Spark scala code. I tried using HBaseTableCatalog.
Added following Dependencies :
shc-core-1.1.0.3.1.5.6-1.jar
hbase libraries(hbase-client.jar,hbase-common.jar,hbase-protocol.jar,hbase-server.jar,hbase-spark.jar,hbase-shaeded*.jar,htrace-core,hbase-mapreduce.jar,hadoop-mapreduce-client-core-{version}.jar)
Below is the code:
case class HBaseRecord(col0: String, col1: String, col2: String)
val catalog = s"""{
"table":{"namespace":"default", "name":"shcExampleTable", "tableCoder":"PrimitiveType"},
"rowkey":"key",
"columns":{
"col0":{"cf":"rowkey", "col":"key", "type":"string"},
"col1":{"cf":"cf1", "col":"col1", "type":"string"},
"col2":{"cf":"cf2", "col":"col2", "type":"string"}
}
}""".stripMargin
val AFINN = sc.textFile("hdfs://sandbox-hdp.hortonworks.com:8020/Input/AFINN1.txt").map(x=> x.split("\t")).map(x => HBaseRecord(x(0).toString,x(1).toString,x(2).toString))
val AFINNDF = AFINN.toDF("col0","col1","col2")
AFINNDF.createOrReplaceTempView("rating")
val DF = AFINNDF.select($"col0",$"col1",$"col2")
DF.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "4")).format("org.apache.spark.sql.execution.datasources.hbase").save()
It throws below error:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/TableOutputFormat
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.insert(HBaseRelation.scala:230)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:65)
I already added hadoop-mapreduce*.jar still it's throwing error.
Which library is needed for TableOutputFormat?
Thanks,

Related

Scala exception in thread "main" java.lang.NoSuchMethodError

I am new to Scala programming and am using IntelliJ IDE. I am getting the below exception when I run my Scala sample code. Not sure if I am missing any dependency.
Sample code
package com.assessments.example
object Example extends App {
//Create a spark context, using a local master so Spark runs on the local machine
val spark = SparkSession.builder().master("local[*]").appName("ScoringModel").getOrCreate()
//importing spark implicits allows functions such as dataframe.as[T]
import spark.implicits._
//Set logger level to Warn
Logger.getRootLogger.setLevel(Level.WARN)
case class CustomerData(
customerId: String,
forename: String,
surname: String
)
case class FullName(
firstName: String,
surname: String
)
case class CustomerModel(
customerId: String,
forename: String,
surname: String,
fullname: FullName
)
val customerData = spark.read.option("header","true").csv("src/main/resources/customer_data.csv").as[CustomerData]
val customerModel = customerData
.map(
customer =>
CustomerModel(
customerId = customer.customerId,
forename = customer.forename,
surname = customer.surname,
fullname = FullName(
firstName = customer.forename,
surname = customer.surname))
)
customerModel.show(truncate = false)
customerModel.write.mode("overwrite").parquet("src/main/resources/customerModel.parquet")
}
Exception message:
Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.mutable.Buffer$.empty()Lscala/collection/GenTraversable;
at org.apache.spark.sql.SparkSessionExtensions.<init>(SparkSessionExtensions.scala:103)
at org.apache.spark.sql.SparkSession$Builder.<init>(SparkSession.scala:793)
at org.apache.spark.sql.SparkSession$.builder(SparkSession.scala:984)
at com.assessments.example.Example$.delayedEndpoint$com$assessments$example$Example$1(Example.scala:10)
at com.assessments.example.Example$delayedInit$body.apply(Example.scala:6)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:76)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
at scala.collection.AbstractIterable.foreach(Iterable.scala:926)
at scala.App.main(App.scala:76)
at scala.App.main$(App.scala:74)
at com.assessments.example.Example$.main(Example.scala:6)
at com.assessments.example.Example.main(Example.scala)
I am using spark version 3.1.2 and Scala version of 2.12.10. When I checked this version of Scala seems to support spark.
Appreciate any guidance on how to get this resolved. Thanks
To solve your problem I would not use case-class as schema. Also I would go with spark dataframes. You define your schema as follows:
val dataSchema = StructType(Array(
StructField("customerId", StringType, true),
StructField("forename", StringType, true),
StructField("surname", StringType, true)
))
// Load data
val rawDf = context.read.format("csv")
.option("delimiter", ",") //edit accordingly
.option("escape", "\"") //edit accordingly
.option("header", "true")
.option("mode", "PERMISSIVE")
.schema(dataSchema)
.load("src/main/resources") // Will read all the csv in directory
rawDf.show()
Once you can see your data then move on to transformations. Create a Struct or a Map data type as suggested here
PySpark - Combine DF columns into named StructType
This is in pySpark but the idea is the same. You play with sparkSQL.
Once you do this simply write to parquet.
finalDf.write.mode("overwrite").parquet("src/main/resources/customerModel")
Notice the output path, there is no file name there. Spark will writeout the data in customerModel directory.

Scala Spark job to upload CSV file data to mongo DB fails due to CodecConfigurationException

I'm new to both spark and scala. I'm trying to upload a csv file to Mongo DB using a spark job in Scala.
On upload, facing the following error during the job execution,
org.bson.codecs.configuration.CodecConfigurationException: Can't find a codec for class .
Path to input file will be passed during the execution.
I'm kind of stuck with this issue for past 2 days. Any help to overcome this issue is appreciated.
Thanks.
I have tried it for uploading to elastic search and it worked like a charm.
import org.apache.spark.sql.Row
import com.mongodb.spark._
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.sql.{SaveMode, SparkSession}
import com.test.Config
object MongoUpload {
val host = <host>
val user = <user>
val pwd = <password>
val database = <db>
val collection = <collection>
val uri = "mongodb://${user}:${pwd}#${host}/"
val NOW = java.time.LocalDate.now.toString
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("Mongo-Test-Upload")
.config("spark.mongodb.output.uri", uri)
.getOrCreate()
spark
.read
.format("csv")
.option("header", "true")
.load(args(0))
.rdd
.map(toEligibility)
.saveToMongoDB(
WriteConfig(
Map(
"uri" -> uri,
"database" -> database,
"collection" -> collection
)
)
)
}
def toEligibility(row: Row): Eligibility =
Eligibility(
row.getAs[String]("DATE_OF_BIRTH"),
row.getAs[String]("GENDER"),
row.getAs[String]("INDIVIDUAL_ID"),
row.getAs[String]("PRODUCT_NAME"),
row.getAs[String]("STATE_CODE"),
row.getAs[String]("ZIPCODE"),
NOW
)
}
case class Eligibility (
dateOfBirth: String,
gender: String,
recordId: String,
ProductIdentifier: String,
stateCode: String,
zipCode: String,
updateDate: String
)
Spark job fails with the following error, Caused by: org.bson.codecs.configuration.CodecConfigurationException: Can't find a codec for class Eligibility
You can either map to a Document of the desired format or convert to a Dataset and then save it eg:
import spark.implicits._
spark
.read
.format("csv")
.option("header", "true")
.load(args(0))
.rdd
.map(toEligibility)
.toDS()
.write()
.format("com.mongodb.spark.sql.DefaultSource")
.options(Map("uri" -> uri,"database" -> database, "collection" -> collection)
.save()
}

Spark: HBase Bulk Load using Scala

We have a text files of 100K records each and we need to read the file line by line and insert it's value into hbase.
The file is '|' delimited.
Sample textFile example:
SLNO|Name|City|Pincode
1|ABC|Pune|400104
2|BMN|Delhi|100065
Each column will have different column family.
We are trying to implement this in Spark-Scala using HBase Bulk load.
We came across this link suggesting bulk load :
http://www.openkb.info/2015/01/how-to-use-scala-on-spark-to-load-data.html
With the below syntax for inserting into single column family.
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = Job.getInstance(conf)
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable])
job.setMapOutputValueClass (classOf[KeyValue])
HFileOutputFormat.configureIncrementalLoad (job, table)
// Generate 10 sample data:
val num = sc.parallelize(1 to 10)
val rdd = num.map(x=>{
val kv: KeyValue = new KeyValue(Bytes.toBytes(x), "cf".getBytes(),
"c1".getBytes(), "value_xxx".getBytes() )
(new ImmutableBytesWritable(Bytes.toBytes(x)), kv)
})
// Directly bulk load to Hbase/MapRDB tables.
rdd.saveAsNewAPIHadoopFile("/tmp/xxxx19", classOf[ImmutableBytesWritable],
classOf[KeyValue], classOf[HFileOutputFormat], job.getConfiguration())
Can anyone advice on the bulk load insertion for multi-column family.
Do have a look at rdd.saveAsNewAPIHadoopDataset, to insert the data into the hbase table.
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("sparkToHive").enableHiveSupport().getOrCreate()
import spark.implicits._
val config = HBaseConfiguration.create()
config.set("hbase.zookeeper.quorum", "ip's")
config.set("hbase.zookeeper.property.clientPort","2181")
config.set(TableInputFormat.INPUT_TABLE, "tableName")
val newAPIJobConfiguration1 = Job.getInstance(config)
newAPIJobConfiguration1.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "tableName")
newAPIJobConfiguration1.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
val df: DataFrame = Seq(("foo", "1", "foo1"), ("bar", "2", "bar1")).toDF("key", "value1", "value2")
val hbasePuts= df.rdd.map((row: Row) => {
val put = new Put(Bytes.toBytes(row.getString(0)))
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("value1"), Bytes.toBytes(row.getString(1)))
put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("value2"), Bytes.toBytes(row.getString(2)))
(new ImmutableBytesWritable(), put)
})
hbasePuts.saveAsNewAPIHadoopDataset(newAPIJobConfiguration1.getConfiguration())
}
Ref : https://sparkkb.wordpress.com/2015/05/04/save-javardd-to-hbase-using-saveasnewapihadoopdataset-spark-api-java-coding/

Insert Spark dataframe into hbase

I have a dataframe and I want to insert it into hbase. I follow this documenation .
This is how my dataframe look like:
--------------------
|id | name | address |
|--------------------|
|23 |marry |france |
|--------------------|
|87 |zied |italie |
--------------------
I create a hbase table using this code:
val tableName = "two"
val conf = HBaseConfiguration.create()
if(!admin.isTableAvailable(tableName)) {
print("-----------------------------------------------------------------------------------------------------------")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("z1".getBytes()))
admin.createTable(tableDesc)
}else{
print("Table already exists!!--------------------------------------------------------------------------------------")
}
And now how may I insert this dataframe into hbase ?
In another example I succeed to insert into hbase using this code:
val myTable = new HTable(conf, tableName)
for (i <- 0 to 1000) {
var p = new Put(Bytes.toBytes(""+i))
p.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(""+(i*5)))
p.add("z1".getBytes(), "age".getBytes(), Bytes.toBytes("2017-04-20"))
p.add("z2".getBytes(), "job".getBytes(), Bytes.toBytes(""+i))
p.add("z2".getBytes(), "salary".getBytes(), Bytes.toBytes(""+i))
myTable.put(p)
}
myTable.flushCommits()
But now I am stuck, how to insert each record of my dataframe into my hbase table.
Thank you for your time and attention
An alternate is to look at rdd.saveAsNewAPIHadoopDataset, to insert the data into the hbase table.
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("sparkToHive").enableHiveSupport().getOrCreate()
import spark.implicits._
val config = HBaseConfiguration.create()
config.set("hbase.zookeeper.quorum", "ip's")
config.set("hbase.zookeeper.property.clientPort","2181")
config.set(TableInputFormat.INPUT_TABLE, "tableName")
val newAPIJobConfiguration1 = Job.getInstance(config)
newAPIJobConfiguration1.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "tableName")
newAPIJobConfiguration1.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
val df: DataFrame = Seq(("foo", "1", "foo1"), ("bar", "2", "bar1")).toDF("key", "value1", "value2")
val hbasePuts= df.rdd.map((row: Row) => {
val put = new Put(Bytes.toBytes(row.getString(0)))
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("value1"), Bytes.toBytes(row.getString(1)))
put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("value2"), Bytes.toBytes(row.getString(2)))
(new ImmutableBytesWritable(), put)
})
hbasePuts.saveAsNewAPIHadoopDataset(newAPIJobConfiguration1.getConfiguration())
}
Ref : https://sparkkb.wordpress.com/2015/05/04/save-javardd-to-hbase-using-saveasnewapihadoopdataset-spark-api-java-coding/
Below is a full example using the spark hbase connector from Hortonworks available in Maven.
This example shows
how to check if HBase table is existing
create HBase table if not existing
Insert DataFrame into HBase table
import org.apache.hadoop.hbase.client.{ColumnFamilyDescriptorBuilder, ConnectionFactory, TableDescriptorBuilder}
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog
object Main extends App {
case class Employee(key: String, fName: String, lName: String, mName: String,
addressLine: String, city: String, state: String, zipCode: String)
// as pre-requisites the table 'employee' with column families 'person' and 'address' should exist
val tableNameString = "default:employee"
val colFamilyPString = "person"
val colFamilyAString = "address"
val tableName = TableName.valueOf(tableNameString)
val colFamilyP = colFamilyPString.getBytes
val colFamilyA = colFamilyAString.getBytes
val hBaseConf = HBaseConfiguration.create()
val connection = ConnectionFactory.createConnection(hBaseConf);
val admin = connection.getAdmin();
println("Check if table 'employee' exists:")
val tableExistsCheck: Boolean = admin.tableExists(tableName)
println(s"Table " + tableName.toString + " exists? " + tableExistsCheck)
if(tableExistsCheck == false) {
println("Create Table employee with column families 'person' and 'address'")
val colFamilyBuild1 = ColumnFamilyDescriptorBuilder.newBuilder(colFamilyP).build()
val colFamilyBuild2 = ColumnFamilyDescriptorBuilder.newBuilder(colFamilyA).build()
val tableDescriptorBuild = TableDescriptorBuilder.newBuilder(tableName)
.setColumnFamily(colFamilyBuild1)
.setColumnFamily(colFamilyBuild2)
.build()
admin.createTable(tableDescriptorBuild)
}
// define schema for the dataframe that should be loaded into HBase
def catalog =
s"""{
|"table":{"namespace":"default","name":"employee"},
|"rowkey":"key",
|"columns":{
|"key":{"cf":"rowkey","col":"key","type":"string"},
|"fName":{"cf":"person","col":"firstName","type":"string"},
|"lName":{"cf":"person","col":"lastName","type":"string"},
|"mName":{"cf":"person","col":"middleName","type":"string"},
|"addressLine":{"cf":"address","col":"addressLine","type":"string"},
|"city":{"cf":"address","col":"city","type":"string"},
|"state":{"cf":"address","col":"state","type":"string"},
|"zipCode":{"cf":"address","col":"zipCode","type":"string"}
|}
|}""".stripMargin
// define some test data
val data = Seq(
Employee("1","Horst","Hans","A","12main","NYC","NY","123"),
Employee("2","Joe","Bill","B","1337ave","LA","CA","456"),
Employee("3","Mohammed","Mohammed","C","1Apple","SanFran","CA","678")
)
// create SparkSession
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("HBaseConnector")
.getOrCreate()
// serialize data
import spark.implicits._
val df = spark.sparkContext.parallelize(data).toDF
// write dataframe into HBase
df.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "3")) // create 3 regions
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
}
This worked for me while I had the relevant site-xmls ("core-site.xml", "hbase-site.xml", "hdfs-site.xml") available in my resources.
using answer for code formatting purposes
Doc tells:
sc.parallelize(data).toDF.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5"))
.format("org.apache.hadoop.hbase.spark ")
.save()
where sc.parallelize(data).toDF is your DataFrame. Doc example turns scala collection to dataframe using sc.parallelize(data).toDF
You already have your DataFrame, just try to call
yourDataFrame.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5"))
.format("org.apache.hadoop.hbase.spark ")
.save()
And it should work. Doc is pretty clear...
UPD
Given a DataFrame with specified schema, above will create an HBase
table with 5 regions and save the DataFrame inside. Note that if
HBaseTableCatalog.newTable is not specified, the table has to be
pre-created.
It's about data partitioning. Each HBase table can have 1...X regions. You should carefully pick number of regions. Low regions number is bad. High region numbers is also bad.

Spark kryo encoder ArrayIndexOutOfBoundsException

I'm trying to create a dataset with some geo data using spark and esri. If Foo only have Point field, it'll work but if I add some other fields beyond a Point, I get ArrayIndexOutOfBoundsException.
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
object Main {
case class Foo(position: Point, name: String)
object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]
implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
}
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
}
}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69)
at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:263) at
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at
Main$.main(Main.scala:24) at Main.main(Main.scala)
Kryo create encoder for complex data types based on Spark SQL Data Types. So check the result of schema that kryo create:
val enc: Encoder[Foo] = Encoders.kryo[Foo]
println(enc.schema) // StructType(StructField(value,BinaryType,true))
val numCols = schema.fieldNames.length // 1
So you have one column data in Dataset and it's in Binary format. But It's strange that why Spark attempting to show Dataset in more than one column (and that error occurs). To fix this, upgrade Spark version to 2.0.0.
By using Spark 2.0.0, you still have problem with columns data types. I hope writing manual schema works if you can write StructType for esri Point class:
val schema = StructType(
Seq(
StructField("point", StructType(...), true),
StructField("name", StringType, true)
)
)
val rdd = sc.parallelize(Seq(Row(new Point(0,0), "bar")))
sqlContext.createDataFrame(rdd, schema).toDS