Read HBase in Scala - it.nerdammer - scala

I would like to read HBase data in a Spark stream code for looking up and further enhancement of streaming data. I am using spark-hbase-connector_2.10-1.0.3.jar.
In my code the following line is successful
val docRdd =
sc.hbaseTable[(Option[String], Option[String])]("hbase_customer_profile")
.select("id","gender").inColumnFamily("data")
docRdd.count returns the right count.
docRdd is of type
HBaseReaderBuilder(org.apache.spark.SparkContext#3a49e5,hbase_customer_profile,Some(data),WrappedArray(id,
gender),None,None,List())
How can I read all the rows in id, gender columns please. Also how can I convert docRdd into a data frame so that SparkSQL can be used.

You can read all rows from the RDD using
docRdd.collect().foreach(println)
To convert the RDD to a DataFrame you could define a case class:
case class Customer(rowKey: String, id: Option[String], gender: Option[String])
I have added the row key to the case class; that's not strictly necessary, so if you don't need it, you can omit it.
Then map over the RDD:
// Row key, id, gender
type Record = (String, Option[String], Option[String])
val rdd =
sc.hbaseTable[Record]("customers")
.select("id","gender")
.inColumnFamily("data")
.map(r => Customer(r._1, r._2, r._3))
and then - based on the case class - convert the RDD to a DataFrame
import sqlContext.implicits._
val df = rdd.toDF()
df.show()
df.printSchema()
The output from spark-shell looks like this:
scala> df.show()
+---------+----+------+
| rowKey| id|gender|
+---------+----+------+
|customer1| 1| null|
|customer2|null| f|
|customer3| 3| m|
+---------+----+------+
scala> df.printSchema()
root
|-- rowKey: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)

Related

How to wrap multiple sql functions into a UDF in Spark?

I am working with Spark 2.3.2.
On one column within my Dataframe I am performing many spark.sql.functions sequentually. How can I wrap this sequence of functions into a user-defined-function (UDF) to make it reusable?
Here is my example focusing on the one column "columnName". First I am creating my test data:
val testSchema = new StructType()
.add("columnName", new StructType()
.add("2020-11", LongType)
.add("2020-12", LongType)
)
val testRow = Seq(Row(Row(1L, 2L)))
val testRDD = spark.sparkContext.parallelize(testRow)
val testDF = spark.createDataFrame(testRDD, testSchema)
testDF.printSchema()
/*
root
|-- columnName: struct (nullable = true)
| |-- 2020-11: long (nullable = true)
| |-- 2020-12: long (nullable = true)
*/
testDF.show(false)
/*
+----------+
|columnName|
+----------+
|[1, 2] |
+----------+
*/
And here is the sequence of applied Spark SQL functions (just as an example):
val testResult = testDF
.select(explode(split(regexp_replace(to_json(col("columnName")), "[\"{}]", ""), ",")).as("result"))
I am failing to create a UDF "myUDF", such that I can get the same result when calling
val testResultWithUDF = testDF.select(myUDF(col("columnName"))
This is what I "would like" to do:
def parseAndExplode(spalte: Column): Column = {
explode(split(regexp_replace(to_json(spalte), "[\"{}]", ""), ",")
}
val myUDF = udf(parseAndExplode _)
testDF.withColumn("udf_result", myUDF(col("columnName"))).show(false)
but it is throwing an Exception:
Schema for type org.apache.spark.sql.Column is not supported
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
Also tried with using a Row as input parameter but then again failed trying to apply built-in SQL functions.
There is no need to use an udf here. explode, split and most other functions from org.apache.spark.sql.functions return already an object of type Column.
def parseAndExplode(spalte: Column): Column = {
explode(split(regexp_replace(to_json(spalte), "[\"{}]", ""), ","))
}
testDF.withColumn("udf_result",parseAndExplode('columnName)).show(false)
prints
+----------+----------+
|columnName|udf_result|
+----------+----------+
|[1, 2] |2020-11:1 |
|[1, 2] |2020-12:2 |
+----------+----------+

Spark : converting Array[Byte] data to RDD or DataFrame

I have data in the form of Array[Byte] which I want to convert into Spark RDD or DataFrame so that I can write my data directly into a Google bucket in the form of a file. I am not able to write Array[Byte] data into Google bucket directly. So looking for this conversion.
My below code is able to write data into Local FS, but not Google bucket
val encrypted = encrypt(original, readPublicKey(pubKey), outFile, true, true)
val dfis = new FileOutputStream(outFile)
dfis.write(encrypted)
dfis.close()
def encrypt(clearData: Array[Byte], encKey: PGPPublicKey, fileName: String, withIntegrityCheck: Boolean, armor: Boolean): Array[Byte] = {
...
}
So how can I convert Array[Byte] data to RDD or DataFrame? I am using Scala.
just use .toDF() or .toDF().rdd
scala> val arr: Array[Byte] = Array(192.toByte, 168.toByte, 1.toByte, 4.toByte)
arr: Array[Byte] = Array(-64, -88, 1, 4)
scala> val df = arr.toSeq.toDF()
df: org.apache.spark.sql.DataFrame = [value: tinyint]
scala> df.show()
+-----+
|value|
+-----+
| -64|
| -88|
| 1|
| 4|
+-----+
scala> df.printSchema()
root
|-- value: byte (nullable = false)

create a spark dataframe from a nested json file in scala [duplicate]

This question already has an answer here:
How to access sub-entities in JSON file?
(1 answer)
Closed 5 years ago.
I have a json file that looks like this
{
"group" : {},
"lang" : [
[ 1, "scala", "functional" ],
[ 2, "java","object" ],
[ 3, "py","interpreted" ]
]
}
I tried to create a dataframe using
val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()
when I run this I get
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
How do we create a df based on contents of "lang" key? I do not care about group{} all I need is, pull data out of "lang" and apply case class like this
case class ProgLang (id: Int, lang: String, type: String )
I have read this post Reading JSON with Apache Spark - `corrupt_record` and understand that each record needs to be on a newline but in my case I cannot change the file structure
The json format is wrong. The the json api of sqlContext is reading it as corrupt record. Correct form is
{"group":{},"lang":[[1,"scala","functional"],[2,"java","object"],[3,"py","interpreted"]]}
and supposing you have it in a file ("/home/test.json"), then you can use following method to get the dataframe you want
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = sqlContext.read.json("/home/test.json")
val df2 = df.withColumn("lang", explode($"lang"))
.withColumn("id", $"lang"(0))
.withColumn("langs", $"lang"(1))
.withColumn("type", $"lang"(2))
.drop("lang")
.withColumnRenamed("langs", "lang")
.show(false)
You should have
+---+-----+-----------+
|id |lang |type |
+---+-----+-----------+
|1 |scala|functional |
|2 |java |object |
|3 |py |interpreted|
+---+-----+-----------+
Updated
If you don't want to change your input json format as mentioned in your comment below, you can use wholeTextFiles to read the json file and parse it as below
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val readJSON = sc.wholeTextFiles("/home/test.json")
.map(x => x._2)
.map(data => data.replaceAll("\n", ""))
val df = sqlContext.read.json(readJSON)
val df2 = df.withColumn("lang", explode($"lang"))
.withColumn("id", $"lang"(0).cast(IntegerType))
.withColumn("langs", $"lang"(1))
.withColumn("type", $"lang"(2))
.drop("lang")
.withColumnRenamed("langs", "lang")
df2.show(false)
df2.printSchema
It should give you dataframe as above and schema as
root
|-- id: integer (nullable = true)
|-- lang: string (nullable = true)
|-- type: string (nullable = true)
As of Spark 2.2 you can use multiLine option to deal with the case of multi-line JSONs.
scala> spark.read.option("multiLine", true).json("jsonFile.json").printSchema
root
|-- lang: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
Before Spark 2.2 see How to access sub-entities in JSON file? or Read multiline JSON in Apache Spark.

How to apply a function to a column of a Spark DataFrame?

Let's assume that we have a Spark DataFrame
df.getClass
Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame
with the following schema
df.printSchema
root
|-- rawFV: string (nullable = true)
|-- tk: array (nullable = true)
| |-- element: string (containsNull = true)
Given that each row of the tk column is an array of strings, how to write a Scala function that will return the number of elements in each row?
You don't have to write a custom function because there is one:
import org.apache.spark.sql.functions.size
df.select(size($"tk"))
If you really want you can write an udf:
import org.apache.spark.sql.functions.udf
val size_ = udf((xs: Seq[String]) => xs.size)
or even create custom a expression but there is really no point in that.
One way is to access them using the sql like below.
df.registerTempTable("tab1")
val df2 = sqlContext.sql("select tk[0], tk[1] from tab1")
df2.show()
To get size of array column,
val df3 = sqlContext.sql("select size(tk) from tab1")
df3.show()
If your Spark version is older, you can use HiveContext instead of Spark's SQL Context.
I would also try for some thing that traverses.

Convert String to Double in Scala / Spark?

I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Obs(f1: Double, f2: Double, price: Array[String])
val obs1 = new Obs(1,2,Array("USD", "5.00"))
val obs2 = new Obs(2,1,Array("USD", "3.00"))
val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema
df.show()
val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))
labeled.take(2).foreach(println)
The output looks like:
df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
root
|-- f1: double (nullable = false)
|-- f2: double (nullable = false)
|-- price: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+-----------+
| f1| f2| price|
+---+---+-----------+
|1.0|2.0|[USD, 5.00]|
|2.0|1.0|[USD, 3.00]|
+---+---+-----------+
but then I wind up getting a ClassCastException:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?
The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.
I think problem here:
.asInstanceOf[Array[String]]
Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.Row
val assembler = new VectorAssembler()
.setInputCols(Array("f1", "f2"))
.setOutputCol("features")
val labeled = assembler.transform(df)
.select($"price".getItem(1).cast("double"), $"features")
.map{case Row(price: Double, features: Vector) =>
LabeledPoint(price, features)}
Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use
import scala.collection.mutable.WrappedArray
row.getAs[WrappedArray[String]](2)
or simply
row.getAs[Seq[String]](2)