Spark Dataset API - join - scala

I am trying to use the Spark Dataset API but I am having some issues doing a simple join.
Let's say I have two dataset with fields: date | value, then in the case of DataFrame my join would look like:
val dfA : DataFrame
val dfB : DataFrame
dfA.join(dfB, dfB("date") === dfA("date") )
However for Dataset there is the .joinWith method, but the same approach does not work:
val dfA : Dataset
val dfB : Dataset
dfA.joinWith(dfB, ? )
What is the argument required by .joinWith ?

To use joinWith you first have to create a DataSet, and most likely two of them. To create a DataSet, you need to create a case class that matches your schema and call DataFrame.as[T] where T is your case class. So:
case class KeyValue(key: Int, value: String)
val df = Seq((1,"asdf"),(2,"34234")).toDF("key", "value")
val ds = df.as[KeyValue]
// org.apache.spark.sql.Dataset[KeyValue] = [key: int, value: string]
You could also skip the case class and use a tuple:
val tupDs = df.as[(Int,String)]
// org.apache.spark.sql.Dataset[(Int, String)] = [_1: int, _2: string]
Then if you had another case class / DF, like this say:
case class Nums(key: Int, num1: Double, num2: Long)
val df2 = Seq((1,7.7,101L),(2,1.2,10L)).toDF("key","num1","num2")
val ds2 = df2.as[Nums]
// org.apache.spark.sql.Dataset[Nums] = [key: int, num1: double, num2: bigint]
Then, while the syntax of join and joinWith are similar, the results are different:
df.join(df2, df.col("key") === df2.col("key")).show
// +---+-----+---+----+----+
// |key|value|key|num1|num2|
// +---+-----+---+----+----+
// | 1| asdf| 1| 7.7| 101|
// | 2|34234| 2| 1.2| 10|
// +---+-----+---+----+----+
ds.joinWith(ds2, df.col("key") === df2.col("key")).show
// +---------+-----------+
// | _1| _2|
// +---------+-----------+
// | [1,asdf]|[1,7.7,101]|
// |[2,34234]| [2,1.2,10]|
// +---------+-----------+
As you can see, joinWith leaves the objects intact as parts of a tuple, while join flattens out the columns into a single namespace. (Which will cause problems in the above case because the column name "key" is repeated.)
Curiously enough, I have to use df.col("key") and df2.col("key") to create the conditions for joining ds and ds2 -- if you use just col("key") on either side it does not work, and ds.col(...) doesn't exist. Using the original df.col("key") does the trick, however.

From https://docs.cloud.databricks.com/docs/latest/databricks_guide/05%20Spark/1%20Intro%20Datasets.html
it looks like you could just do
dfA.as("A").joinWith(dfB.as("B"), $"A.date" === $"B.date" )

For the above example, you can try the below:
Define a case class for your output
case class JoinOutput(key:Int, value:String, num1:Double, num2:Long)
Join two Datasets with Seq("key"), this will help you to avoid two duplicate key columns in the output, which will also help to apply the case class or fetch the data in the next step
val joined = ds.join(ds2, Seq("key")).as[JoinOutput]
// res27: org.apache.spark.sql.Dataset[JoinOutput] = [key: int, value: string ... 2 more fields]
The result will be flat instead:
joined.show
+---+-----+----+----+
|key|value|num1|num2|
+---+-----+----+----+
| 1| asdf| 7.7| 101|
| 2|34234| 1.2| 10|
+---+-----+----+----+

Related

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.
Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee

Spark scala dataframe get value for each row and assign to variables

I have a dataframe like below :
val df=spark.sql("select * from table")
row1|row2|row3
A1,B1,C1
A2,B2,C2
A3,B3,C3
i want to iterate for loop to get values like this :
val value1="A1"
val value2="B1"
val value3="C1"
function(value1,value2,value3)
Please help me.
emphasized text
You have 2 options :
Solution 1- Your data is big, then you must stick with dataframes. So to apply a function on every row. We must define a UDF.
Solution 2- Your data is small, then you can collect the data to the driver machine and then iterate with a map.
Example:
val df = Seq((1,2,3), (4,5,6)).toDF("a", "b", "c")
def sum(a: Int, b: Int, c: Int) = a+b+c
// Solution 1
import org.apache.spark.sql.Row
val myUDF = udf((r: Row) => sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
df.select(myUDF(struct($"a", $"b", $"c")).as("sum")).show
//Solution 2
df.collect.map(r=> sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
Output for both cases:
+---+
|sum|
+---+
| 6|
| 15|
+---+
EDIT:
val myUDF = udf((r: Row) => {
val value1 = r.getAs[Int](0)
val value2 = r.getAs[Int](1)
val value3 = r.getAs[Int](2)
myFunction(value1, value2, value3)
})

Retrieve and format data from HBase to scala Dataframe

I am trying to get data from hbase table into apache spark environment, but I am not able to figure out how to format it. Can somebody help me.
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
type Record = (String, Option[String], Option[String])
val hBaseRDD_iacp = sc.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("test_fam")
scala> hBaseRDD_iacp.map(x => systems(x._1,x._2,x._3)).toDF().show()
+--------------+-----------------+--------------------+
| rowkey| iacp| temp|
+--------------+-----------------+--------------------+
| ab7|0.051,0.052,0.055| 17.326,17.344,17.21|
| k6c| 0.056,NA,0.054|17.277,17.283,17.256|
| ad| NA,23.0| 24.0,23.6|
+--------------+-----------------+--------------------+
However, I actually want it as in the following format. Each comma separated value is in the new row and each NA is replaced by null values. Values in iacp and temp column should be float type. Each row can have varying number of comma separated values.
Thanks in Advance!
+--------------+-----------------+--------------------+
| rowkey| iacp| temp|
+--------------+-----------------+--------------------+
| ab7| 0.051| 17.326|
| ab7| 0.052| 17.344|
| ab7| 0.055| 17.21|
| k6c| 0.056| 17.277|
| k6c| null| 17.283|
| k6c| 0.054| 17.256|
| ad| null| 24.0|
| ad| 23| 26.0|
+--------------+-----------------+--------------------+
Your hBaseRDD_iacp.map(x => systems(x._1, x._2, x._3)).toDF code line should generate a DataFrame equivalent to the following:
val df = Seq(
("ab7", Some("0.051,0.052,0.055"), Some("17.326,17.344,17.21")),
("k6c", Some("0.056,NA,0.054"), Some("17.277,17.283,17.256")),
("ad", Some("NA,23.0"), Some("24.0,23.6"))
).toDF("rowkey", "iacp", "temp")
To transform the dataset into the wanted result, you can apply a UDF that pairs up elements of the iacp and temp CSV strings to produce an array of (Option[Double], Option[Double]) which is then explode-ed, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
def pairUpCSV = udf{ (s1: String, s2: String) =>
import scala.util.Try
def toNumericArr(csv: String) = csv.split(",").map{
case s if Try(s.toDouble).isSuccess => Some(s)
case _ => None
}
toNumericArr(s1).zipAll(toNumericArr(s2), None, None)
}
df.
withColumn("csv_pairs", pairUpCSV($"iacp", $"temp")).
withColumn("csv_pair", explode($"csv_pairs")).
select($"rowkey", $"csv_pair._1".as("iacp"), $"csv_pair._2".as("temp")).
show(false)
// +------+-----+------+
// |rowkey|iacp |temp |
// +------+-----+------+
// |ab7 |0.051|17.326|
// |ab7 |0.052|17.344|
// |ab7 |0.055|17.21 |
// |k6c |0.056|17.277|
// |k6c |null |17.283|
// |k6c |0.054|17.256|
// |ad |null |24.0 |
// |ad |23.0 |23.6 |
// +------+-----+------+
Note that value NA falls into the default case in method toNumericArr hence isn't singled out as a separate case. Also, zipAll (rather than zip) is used in the UDF to cover cases in which the iacp and temp CSV strings have different element sizes.

Read csv file in spark of varying columns

I would like to read csv file into dataframe in spark using Scala.
My csv file has first record which has three columns and remaining records have 5 columns. My csv file does not come with column names. I have mentioned here's for understanding
Ex:
I'dtype date recordsCount
0 13-02-2015 300
I'dtype date type location. locationCode
1 13-02-2015. R. USA. Us
1. 13-02-2015. T. London. Lon
My question is how I will read this file into dataframe,as first and remaining rows have different columns.
The solution what I tried is read file as rdd and filter out header record and then convert remaining records into dataframe.
Is there any better solution for ? Please help me
You can load the files as raw text, and then use case classes, Either instances, and pattern matching to sort out what goes where. Example of that below.
case class Col3(c1: Int, c2: String, c3: Int)
case class Col5(c1: Int, c2: String, c5_col3: String, c4:String, c5: String)
case class Header(value: String)
type C3 = Either[Header, Col3]
type C5 = Either[Header, Col5]
// assume sqlC & sc created
val path = "tmp.tsv"
val rdd = sc.textFile(path)
val eitherRdd: RDD[Either[C3, C5]] = rdd.map{s =>
val spl = s.split("\t")
spl.length match{
case 3 =>
val res = Try{
Col3(spl(0).toInt, spl(1), spl(2).toInt)
}
res match{
case Success(c3) => Left(Right(c3))
case Failure(_) => Left(Left(Header(s)))
}
case 5 =>
val res = Try{
Col5(spl(0).toInt, spl(1), spl(2), spl(3), spl(4))
}
res match{
case Success(c5) => Right(Right(c5))
case Failure(_) => Right(Left(Header(s)))
}
case _ => throw new Exception("fail")
}
}
val rdd3 = eitherRdd.flatMap(_.left.toOption)
val rdd3Header = rdd3.flatMap(_.left.toOption).collect().head
val df3 = sqlC.createDataFrame(rdd3.flatMap(_.right.toOption))
val rdd5 = eitherRdd.flatMap(_.right.toOption)
val rdd5Header = rdd5.flatMap(_.left.toOption).collect().head
val df5 = sqlC.createDataFrame(rdd5.flatMap(_.right.toOption))
df3.show()
df5.show()
Tested with simple tsv below:
col1 col2 col3
0 sfd 300
1 asfd 400
col1 col2 col4 col5 col6
2 pljdsfn R USA Us
3 sad T London Lon
which gives output
+---+----+---+
| c1| c2| c3|
+---+----+---+
| 0| sfd|300|
| 1|asfd|400|
+---+----+---+
+---+-------+-------+------+---+
| c1| c2|c5_col3| c4| c5|
+---+-------+-------+------+---+
| 2|pljdsfn| R| USA| Us|
| 3| sad| T|London|Lon|
+---+-------+-------+------+---+
For simplicity sake, I have ignored the date formatting, simply storing those fields as Strings. however it would not be much more complicated to add a date parser to get you a proper column type.
Likewise, I have relied on parsing failure to indicate a header row. You may substitute different logic if either the parsing would not fail, or if a more complicated determination must be made. Similarly, more complicated logic would be needed to differentiate between different record types of the same length, or which may contain (escaped) split character
It's a bit of a hack but here is a solution to ignore the first line of the file.
val cols = Array("dtype", "date", "type", "location", "locationCode")
val schema = new StructType(cols.map(n => StructField(n ,StringType, true)))
spark.read
.schema(schema) // we specify the schema
.option("header", true) // and tell spark that there is a header
.csv("path/file.csv")
The first line is the header, but the schema is specified. The first line is thus ignored.

Case class mapping to csv

cat department
dept_id,dept_name
1,acc
2,finance
3,sales
4,marketing
Why there is difference in output of show() when used in df.show() and rdd.toDF.show(). can someone please help?
scala> case class Department (dept_id: Int, dept_name: String)
defined class Department
scala> val dept = sc.textFile("/home/sam/Projects/department")
scala> val mappedDpt = dept.map(p => Department( p(0).toInt,p(1).toString))
scala> mappedDpt.toDF.show()
+-------+---------+
|dept_id|dept_name|
+-------+---------+
| 49| ,|
| 50| ,|
| 51| ,|
| 52| ,|
+-------+---------+
scala>
val dept_df = spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("mode","permissive")
.load("/home/sam/Projects/department")
scala> dept_df.show()
+-------+---------+
|dept_id|dept_name|
+-------+---------+
| 1| acc|
| 2| finance|
| 3| sales|
| 4|marketing|
+-------+---------+
scala>
The problem is here
val mappedDpt = dept.map(p => Department( p(0).toInt,p(1).toString))
p here is a String not a Row (as you may think). To be more precise here p is each line of the text file, you can confirm that reading the scaladoc.
"returns RDD of lines of the text file".
So, when you apply the apply method ((0)) you're accessing a character by position on the line.
That is why you end up with "49, ','" 49 from the toInt of the first char which returns the ascii value of the character and the ',' from the second character on the line.
Edit
If you need to reproduce the read method you can do the following:
object Department {
/** The Option here is to handle errors. */
def fromRawArray(data: Array[String]): Option[Department] = data match {
case Array(raw_dept_id, dept_name) => Some(Department(raw_dept_id.toInt, dept_name))
case _ => None
}
}
// We use flatMap instead of map, to unwrap the values from the Option, the Nones get removed.
val mappedDpt = dept.flatMap(line => Department.fromRawArray(line.split(",")))
However, I hope this is only for learning. On production code you should always use the read version. Since it will be more robust (handling missing values, doing a better type cast, etc).
For example, the above code will throw an exception if the first value can't be casted to Int.
Always use spark.read.* variants since that gives you the dataframe and you can infer the schema as well.
Coming to your issue, in your RDD version, you have to filter the first line and then split the lines using comma separator, then you can map it to case class Department.
Once you map it to Department, note that you are creating a typed dataframe.. so it is a dataset. So you should use createDataset
The below code worked for me.
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object RDDSample {
case class Department(dept_id: Int, dept_name: String)
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Spark_processing").master("local[*]").getOrCreate()
import spark.implicits._
val dept = spark.sparkContext.textFile("in/department.txt")
val mappedDpt = dept.filter(line => !line.contains("dept_id")).map(p => {
val y = p.split(","); Department(y(0).toInt, y(1).toString)
})
spark.createDataset(mappedDpt).show
}
}
Results:
+-------+---------+
|dept_id|dept_name|
+-------+---------+
| 1| acc|
| 2| finance|
| 3| sales|
| 4|marketing|
+-------+---------+