Retrieve and format data from HBase to scala Dataframe - scala

I am trying to get data from hbase table into apache spark environment, but I am not able to figure out how to format it. Can somebody help me.
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
type Record = (String, Option[String], Option[String])
val hBaseRDD_iacp = sc.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("test_fam")
scala> hBaseRDD_iacp.map(x => systems(x._1,x._2,x._3)).toDF().show()
+--------------+-----------------+--------------------+
| rowkey| iacp| temp|
+--------------+-----------------+--------------------+
| ab7|0.051,0.052,0.055| 17.326,17.344,17.21|
| k6c| 0.056,NA,0.054|17.277,17.283,17.256|
| ad| NA,23.0| 24.0,23.6|
+--------------+-----------------+--------------------+
However, I actually want it as in the following format. Each comma separated value is in the new row and each NA is replaced by null values. Values in iacp and temp column should be float type. Each row can have varying number of comma separated values.
Thanks in Advance!
+--------------+-----------------+--------------------+
| rowkey| iacp| temp|
+--------------+-----------------+--------------------+
| ab7| 0.051| 17.326|
| ab7| 0.052| 17.344|
| ab7| 0.055| 17.21|
| k6c| 0.056| 17.277|
| k6c| null| 17.283|
| k6c| 0.054| 17.256|
| ad| null| 24.0|
| ad| 23| 26.0|
+--------------+-----------------+--------------------+

Your hBaseRDD_iacp.map(x => systems(x._1, x._2, x._3)).toDF code line should generate a DataFrame equivalent to the following:
val df = Seq(
("ab7", Some("0.051,0.052,0.055"), Some("17.326,17.344,17.21")),
("k6c", Some("0.056,NA,0.054"), Some("17.277,17.283,17.256")),
("ad", Some("NA,23.0"), Some("24.0,23.6"))
).toDF("rowkey", "iacp", "temp")
To transform the dataset into the wanted result, you can apply a UDF that pairs up elements of the iacp and temp CSV strings to produce an array of (Option[Double], Option[Double]) which is then explode-ed, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
def pairUpCSV = udf{ (s1: String, s2: String) =>
import scala.util.Try
def toNumericArr(csv: String) = csv.split(",").map{
case s if Try(s.toDouble).isSuccess => Some(s)
case _ => None
}
toNumericArr(s1).zipAll(toNumericArr(s2), None, None)
}
df.
withColumn("csv_pairs", pairUpCSV($"iacp", $"temp")).
withColumn("csv_pair", explode($"csv_pairs")).
select($"rowkey", $"csv_pair._1".as("iacp"), $"csv_pair._2".as("temp")).
show(false)
// +------+-----+------+
// |rowkey|iacp |temp |
// +------+-----+------+
// |ab7 |0.051|17.326|
// |ab7 |0.052|17.344|
// |ab7 |0.055|17.21 |
// |k6c |0.056|17.277|
// |k6c |null |17.283|
// |k6c |0.054|17.256|
// |ad |null |24.0 |
// |ad |23.0 |23.6 |
// +------+-----+------+
Note that value NA falls into the default case in method toNumericArr hence isn't singled out as a separate case. Also, zipAll (rather than zip) is used in the UDF to cover cases in which the iacp and temp CSV strings have different element sizes.

Related

How to merge/join Spark/Scala RDD to List so each value in RDD gets a new row with each List item

Lets say I have a List[String] and I want to merge it with a RDD Object so that each object in the RDD gets each value in the List added to it:
List[String] myBands = ["Band1","Band2"];
Table: BandMembers
|name | instrument |
| ----- | ---------- |
| slash | guitar |
| axl | vocals |
case class BandMembers ( name:String, instrument:String );
var myRDD = BandMembersTable.map(a => new BandMembers(a.name, a.instrument));
//join the myRDD to myBands
// how do I do this?
//var result = myRdd.join/merge/union(myBands);
Desired result:
|name | instrument | band |
| ----- | ---------- |------|
| slash | guitar | band1|
| slash | guitar | band2|
| axl | vocals | band1|
| axl | vocals | band2|
I'm not quite sure how to go about this in the best way for Spark/Scala. I know I can convert to DF and then use spark sql to do the joins, but there has to be a better way with the RDD and List, or so I think.
The style is a bit off here, but assuming you really need RDD's instead of Dataset
So with RDD:
case class BandMembers ( name:String, instrument:String )
val myRDD = spark.sparkContext.parallelize(BandMembersTable.map(a => new BandMembers(a.name, a.instrument)))
val myBands = spark.sparkContext.parallelize(Seq("Band1","Band2"))
val res = myRDD.cartesian(myBands).map { case (a,b) => Row(a.name, a.instrument, b) }
With Dataset:
case class BandMembers ( name:String, instrument:String )
val myRDD = BandMembersTable.map(a => new BandMembers(a.name, a.instrument)).toDS
val myBands = Seq("Band1","Band2").toDS
val res = myRDD.crossJoin(myBands)
Input data:
val BandMembersTable = Seq(BandMembers("a", "b"), BandMembers("c", "d"))
val myBands = Seq("Band1","Band2")
Output with Dataset:
+----+----------+-----+
|name|instrument|value|
+----+----------+-----+
|a |b |Band1|
|a |b |Band2|
|c |d |Band1|
|c |d |Band2|
+----+----------+-----+
Println with RDDs (these are Rows)
[a,b,Band1]
[c,d,Band2]
[c,d,Band1]
[a,b,Band2]
Consider using RDD zip for this.. From official docs
RDD<scala.Tuple2<T,U>> zip(RDD other, scala.reflect.ClassTag evidence$11)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD,

Spark creating a new column based on a mapped value of an existing column

I am trying to map the values of one column in my dataframe to a new value and put it into a new column using a UDF, but I am unable to get the UDF to accept a parameter that isn't also a column. For example I have a dataframe dfOriginial like this:
+-----------+-----+
|high_scores|count|
+-----------+-----+
| 9| 1|
| 21| 2|
| 23| 3|
| 7| 6|
+-----------+-----+
And I'm trying to get a sense of the bin the numeric value falls into, so I may construct a list of bins like this:
case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
val binMin = binMax - binWidth
// only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
def rangeAsString(): String = {
val sb = new StringBuilder()
sb.append(trimDecimal(binMin)).append(" - ").append(trimDecimal(binMax))
sb.toString()
}
}
And then I want to transform my old dataframe like this to make dfBin:
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
| 9| 1| 0 - 10 |
| 21| 2| 20 - 30 |
| 23| 3| 20 - 30 |
| 7| 6| 0 - 10 |
+-----------+-----+---------+
So that I can ultimately get a count of the instances of the bins by calling .groupBy("bin_range").count().
I am trying to generate dfBin by using the withColumn function with an UDF.
Here's the code with the UDF I am attempting to use:
val convertValueToBinRangeUDF = udf((value:String, binList:List[Bin]) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
val dfBin = dfOriginal.withColumn("bin_range", convertValueToBinRangeUDF(col("high_scores"), binList))
But it's giving me a type mismatch:
Error:type mismatch;
found : List[Bin]
required: org.apache.spark.sql.Column
val valueCountsWithBin = valuesCounts.withColumn(binRangeCol, convertValueToBinRangeUDF(col(columnName), binList))
Seeing the definition of an UDF makes me think it should handle the conversion fine, but it's clearly not, any ideas?
The problem is that parameters to an UDF should all be of column type. One solution would be to convert binList into a column and pass it to the UDF similar to the current code.
However, it is simpler to adjust the UDF slightly and turn it into a def. In this way you can easily pass other non-column type data:
def convertValueToBinRangeUDF(binList: List[Bin]) = udf((value:String) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
Usage:
val dfBin = valuesCounts.withColumn("bin_range", convertValueToBinRangeUDF(binList)($"columnName"))
Try this -
scala> case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
| val binMin = binMax - binWidth
|
| // only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
| def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
|
| def rangeAsString(): String = {
| val sb = new StringBuilder()
| sb.append(binMin).append(" - ").append(binMax)
| sb.toString()
| }
| }
defined class Bin
scala> val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
binList: List[Bin] = List(Bin(10,10), Bin(20,10), Bin(30,10), Bin(40,10), Bin(50,10))
scala> spark.udf.register("convertValueToBinRangeUDF", (value: String) => {
| val number = BigDecimal(value)
| val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
| bin.rangeAsString()
| })
res13: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
//-- Testing with one record
scala> val dfOriginal = spark.sql(s""" select "9" as `high_scores`, "1" as count """)
dfOriginal: org.apache.spark.sql.DataFrame = [high_scores: string, count: string]
scala> dfOriginal.createOrReplaceTempView("dfOriginal")
scala> val dfBin = spark.sql(s""" select high_scores, count, convertValueToBinRangeUDF(high_scores) as bin_range from dfOriginal """)
dfBin: org.apache.spark.sql.DataFrame = [high_scores: string, count: string ... 1 more field]
scala> dfBin.show(false)
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
|9 |1 |0 - 10 |
+-----------+-----+---------+
Hope this will help.

Case class mapping to csv

cat department
dept_id,dept_name
1,acc
2,finance
3,sales
4,marketing
Why there is difference in output of show() when used in df.show() and rdd.toDF.show(). can someone please help?
scala> case class Department (dept_id: Int, dept_name: String)
defined class Department
scala> val dept = sc.textFile("/home/sam/Projects/department")
scala> val mappedDpt = dept.map(p => Department( p(0).toInt,p(1).toString))
scala> mappedDpt.toDF.show()
+-------+---------+
|dept_id|dept_name|
+-------+---------+
| 49| ,|
| 50| ,|
| 51| ,|
| 52| ,|
+-------+---------+
scala>
val dept_df = spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("mode","permissive")
.load("/home/sam/Projects/department")
scala> dept_df.show()
+-------+---------+
|dept_id|dept_name|
+-------+---------+
| 1| acc|
| 2| finance|
| 3| sales|
| 4|marketing|
+-------+---------+
scala>
The problem is here
val mappedDpt = dept.map(p => Department( p(0).toInt,p(1).toString))
p here is a String not a Row (as you may think). To be more precise here p is each line of the text file, you can confirm that reading the scaladoc.
"returns RDD of lines of the text file".
So, when you apply the apply method ((0)) you're accessing a character by position on the line.
That is why you end up with "49, ','" 49 from the toInt of the first char which returns the ascii value of the character and the ',' from the second character on the line.
Edit
If you need to reproduce the read method you can do the following:
object Department {
/** The Option here is to handle errors. */
def fromRawArray(data: Array[String]): Option[Department] = data match {
case Array(raw_dept_id, dept_name) => Some(Department(raw_dept_id.toInt, dept_name))
case _ => None
}
}
// We use flatMap instead of map, to unwrap the values from the Option, the Nones get removed.
val mappedDpt = dept.flatMap(line => Department.fromRawArray(line.split(",")))
However, I hope this is only for learning. On production code you should always use the read version. Since it will be more robust (handling missing values, doing a better type cast, etc).
For example, the above code will throw an exception if the first value can't be casted to Int.
Always use spark.read.* variants since that gives you the dataframe and you can infer the schema as well.
Coming to your issue, in your RDD version, you have to filter the first line and then split the lines using comma separator, then you can map it to case class Department.
Once you map it to Department, note that you are creating a typed dataframe.. so it is a dataset. So you should use createDataset
The below code worked for me.
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object RDDSample {
case class Department(dept_id: Int, dept_name: String)
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Spark_processing").master("local[*]").getOrCreate()
import spark.implicits._
val dept = spark.sparkContext.textFile("in/department.txt")
val mappedDpt = dept.filter(line => !line.contains("dept_id")).map(p => {
val y = p.split(","); Department(y(0).toInt, y(1).toString)
})
spark.createDataset(mappedDpt).show
}
}
Results:
+-------+---------+
|dept_id|dept_name|
+-------+---------+
| 1| acc|
| 2| finance|
| 3| sales|
| 4|marketing|
+-------+---------+

Spark (scala) - Iterate over DF column and count number of matches from a set of items

So I can now iterate over a column of strings in a dataframe and check whether any of the strings contain any items in a large dictionary (see here, thanks to #raphael-roth and #tzach-zohar). The basic udf (not including broadcasting the dict list) for that is:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The next thing I am trying to do is also COUNT the number of matches that occur from the dict set, in the most efficient way possible (i'm dealing with very large datasets and dict files).
I have been trying to use findAllMatchIn in the udf, using both count and map:
val checkerUdf = udf { (s: String) => dict.count(_.r.findAllMatchIn(s))
// OR
val checkerUdf = udf { (s: String) => dict.map(_.r.findAllMatchIn(s))
But this returns a list of iterators (empty and non-empty) I get a type mismatch (found Iterator, required Boolean). I am not sure how to count the non-empty iterators (count and size and length don't work).
Any idea what i'm doing wrong? Is there a better / more efficient way to achieve what I am trying to do?
you can just change a little bit of the answers from your other question as
import org.apache.spark.sql.functions._
val checkerUdf = udf { (s: String) => dict.count(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
Given the dataframe as
+---+---------+
|id |words |
+---+---------+
|1 |foo |
|2 |barriofoo|
|3 |gitten |
|4 |baa |
+---+---------+
and dict file as
val dict = Set("foo","bar","baaad")
You should have output as
+---+---------+----------+
| id| words|word_check|
+---+---------+----------+
| 1| foo| 1|
| 2|barriofoo| 2|
| 3| gitten| 0|
| 4| baa| 0|
+---+---------+----------+
I hope the answer is helpful

Spark Dataset API - join

I am trying to use the Spark Dataset API but I am having some issues doing a simple join.
Let's say I have two dataset with fields: date | value, then in the case of DataFrame my join would look like:
val dfA : DataFrame
val dfB : DataFrame
dfA.join(dfB, dfB("date") === dfA("date") )
However for Dataset there is the .joinWith method, but the same approach does not work:
val dfA : Dataset
val dfB : Dataset
dfA.joinWith(dfB, ? )
What is the argument required by .joinWith ?
To use joinWith you first have to create a DataSet, and most likely two of them. To create a DataSet, you need to create a case class that matches your schema and call DataFrame.as[T] where T is your case class. So:
case class KeyValue(key: Int, value: String)
val df = Seq((1,"asdf"),(2,"34234")).toDF("key", "value")
val ds = df.as[KeyValue]
// org.apache.spark.sql.Dataset[KeyValue] = [key: int, value: string]
You could also skip the case class and use a tuple:
val tupDs = df.as[(Int,String)]
// org.apache.spark.sql.Dataset[(Int, String)] = [_1: int, _2: string]
Then if you had another case class / DF, like this say:
case class Nums(key: Int, num1: Double, num2: Long)
val df2 = Seq((1,7.7,101L),(2,1.2,10L)).toDF("key","num1","num2")
val ds2 = df2.as[Nums]
// org.apache.spark.sql.Dataset[Nums] = [key: int, num1: double, num2: bigint]
Then, while the syntax of join and joinWith are similar, the results are different:
df.join(df2, df.col("key") === df2.col("key")).show
// +---+-----+---+----+----+
// |key|value|key|num1|num2|
// +---+-----+---+----+----+
// | 1| asdf| 1| 7.7| 101|
// | 2|34234| 2| 1.2| 10|
// +---+-----+---+----+----+
ds.joinWith(ds2, df.col("key") === df2.col("key")).show
// +---------+-----------+
// | _1| _2|
// +---------+-----------+
// | [1,asdf]|[1,7.7,101]|
// |[2,34234]| [2,1.2,10]|
// +---------+-----------+
As you can see, joinWith leaves the objects intact as parts of a tuple, while join flattens out the columns into a single namespace. (Which will cause problems in the above case because the column name "key" is repeated.)
Curiously enough, I have to use df.col("key") and df2.col("key") to create the conditions for joining ds and ds2 -- if you use just col("key") on either side it does not work, and ds.col(...) doesn't exist. Using the original df.col("key") does the trick, however.
From https://docs.cloud.databricks.com/docs/latest/databricks_guide/05%20Spark/1%20Intro%20Datasets.html
it looks like you could just do
dfA.as("A").joinWith(dfB.as("B"), $"A.date" === $"B.date" )
For the above example, you can try the below:
Define a case class for your output
case class JoinOutput(key:Int, value:String, num1:Double, num2:Long)
Join two Datasets with Seq("key"), this will help you to avoid two duplicate key columns in the output, which will also help to apply the case class or fetch the data in the next step
val joined = ds.join(ds2, Seq("key")).as[JoinOutput]
// res27: org.apache.spark.sql.Dataset[JoinOutput] = [key: int, value: string ... 2 more fields]
The result will be flat instead:
joined.show
+---+-----+----+----+
|key|value|num1|num2|
+---+-----+----+----+
| 1| asdf| 7.7| 101|
| 2|34234| 1.2| 10|
+---+-----+----+----+