Null pointer exception - Apache Spark Dataset left outer join - apache-spark-dataset

I am trying to learn spark dataset (spark 2.0.1). Below left outer join is creating Null pointer exception.
case class Employee(name: String, age: Int, departmentId: Int, salary: Double)
case class Department(id: Int, depname: String)
case class Record(name: String, age: Int, salary: Double, departmentId: Int, departmentName: String)
val employeeDataSet = sc.parallelize(Seq(Employee("Jax", 22, 5, 100000.0),Employee("Max", 22, 1, 100000.0))).toDS()
val departmentDataSet = sc.parallelize(Seq(Department(1, "Engineering"), Department(2, "Marketing"))).toDS()
val averageSalaryDataset = employeeDataset.joinWith(departmentDataSet, $"departmentId" === $"id", "left_outer")
.map(record => Record(record._1.name, record._1.age, record._1.salary, record._1.departmentId , record._2.depname))
averageSalaryDataset.show()
16/12/14 16:48:26 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 12)
java.lang.NullPointerException
This is because While doing left outer join it is giving null values for record._2.depname.
How to handle this? Thanks

Solved this by using---
val averageSalaryDataset1 = employeeDataSet.joinWith(departmentDataSet, $"departmentId" === $"id", "left_outer").selectExpr("nvl(_1.name, ' ') as name","nvl(_1.age, 0) as age","nvl(_1.salary, 0.0D) as salary","nvl(_1.departmentId, 0) as departmentId","nvl(_2.depname, ' ') as departmentName").as[Record]
averageSalaryDataset1.show()

null can be handled using if..else condition.
val averageSalaryDataset = employeeDataSet.joinWith(departmentDataSet, $"departmentId" === $"id", "left_outer").map(record => Record(record._1.name, record._1.age, record._1.salary, record._1.departmentId , if (record._2 == null) null else record._2.depname ))
After the join operation, the resulting dataset columns are stored as Map(key-value pairs) , and in map operation, we are calling the keys but the key is "null' when you are calling record._2.depName which is why the exception.
val averageSalaryDataset = employeeDataSet.joinWith(departmentDataSet, $"departmentId" === $"id", "left_outer")
Dataset after left join

Related

How to implement Window Function in Apache Flink?

everyone,
I have a kafka topic source, I group it by a 1 minute window.
What I want to do in that window is to create new columns with Window Function as in SQL, for example I want to use
SUM(amount) OVER(PARTITION BY
COUNT(user) OVER(PARTITION BY
ROW_NUMBER() OVER(PARTITION BY
Can I use DataStream functions for these operations? or
How can I operate my kafka data to convert it to DataTable and use sqlQuery?
Destination is another kafka topic.
val stream = senv
.addSource(new FlinkKafkaConsumer[String]("flink", new SimpleStringSchema(), properties))
I've tried to do this
val tableA = tableEnv.fromDataStream(stream, 'user, 'product, 'amount)
but I get the following error back
Exception in thread "main" org.apache.flink.table.api.ValidationException: Too many fields referenced from an atomic type.
test data
1,"beer",3
1,"beer",1
2,"beer",3
3,"diaper",4
4,"diaper",1
5,"diaper",5
6,"rubber",2
Query example
SELECT
user, product, amount,
COUNT(user) OVER(PARTITION BY product) AS count_product
FROM table;
expected performance
1,"beer",3,3
1,"beer",1,3
2,"beer",3,3
3,"diaper",4,3
4,"diaper",1,3
5,"diaper",5,3
6,"rubber",2,1
You need to parse the string into fields and then rename them afterwards.
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)
val stream = env.fromElements("1,beer,3",
"1,beer,1","2,beer,3","3,diaper,4","4,diaper,1","5,diaper,5","6,rubber,2");
val parsed = stream.map(x=> {
val arr = x.split(",")
(arr(0).toInt, arr(1), arr(2).toInt)
})
val tableA = tEnv.fromDataStream(parsed, $"_1" as "user", $"_2" as "product", $"_3" as "amount")
// example query
val result = tEnv.sqlQuery(s"SELECT user, product, amount from $tableA")
val rs = result.toAppendStream[(Int, String, Int)]
rs.print()
I'm not sure how can we implement the desired window function in Flink SQL. Alternatively, it can be implemented in simple Flink as follows:
parsed.keyBy(x => x._2) // key by product id.
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.process(new ProcessWindowFunction[
(Int, String, Int), (Int, String, Int, Int), String, TimeWindow
]() {
override def process(key: String, context: Context,
elements: Iterable[(Int, String, Int)],
out: Collector[(Int, String, Int, Int)]): Unit = {
val lst = elements.toList
lst.foreach(x => out.collect((x._1, x._2, x._3, lst.size)))
}
})
.print()

Spark DataFrame not supporting Char datatype

I am creating a Spark DataFrame from a text file. Say Employee file which contains String, Int, Char.
created a class:
case class Emp (
Name: String,
eid: Int,
Age: Int,
Sex: Char,
Sal: Int,
City: String)
Created RDD1 using split, then created RDD2:
val textFileRDD2 = textFileRDD1.map(attributes => Emp(
attributes(0),
attributes(1).toInt,
attributes(2).toInt,
attributes(3).charAt(0),
attributes(4).toInt,
attributes(5)))
And Final RDDS as:
finalRDD = textFileRDD2.toDF
when I create final RDD it throws the error:
java.lang.UnsupportedOperationException: No Encoder found for scala.Char"
can anyone help me out why and how to resolve it?
Spark SQL doesn't provide Encoders for Char and generic Encoders are not very useful.
You can either use a StringType:
attributes(3).slice(0, 1)
or ShortType (or BooleanType, ByteType if you accept only binary response):
attributes(3)(0) match {
case 'F' => 1: Short
...
case _ => 0: Short
}

filtering dataframe in scala

Let say I have a dataframe created from a text file using case class schema. Below is the data stored in dataframe.
id - Type- qt - P
1, X, 10, 100.0
2, Y, 20, 200.0
1, Y, 15, 150.0
1, X, 5, 120.0
I need to filter dataframe by "id" and Type. And for every "id" iterate through the dataframe for some calculation.
I tried this way but it did not work. Code snapshot.
case class MyClass(id: Int, type: String, qt: Long, PRICE: Double)
val df = sc.textFile("xyz.txt")
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)
.toDF().cache()
val productList: List[Int] = df.map{row => row.getInt(0)}.distinct.collect.toList
val xList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "X" })}.toList
val yList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "Y" })}.toList
Taking the distinct idea from your example, simply iterate over all the IDs and filter the DataFrame according to the current ID. After this you have a DataFrame with only the relevant data:
val df3 = sc.textFile("src/main/resources/importantStuff.txt") //Your data here
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)).toDF().cache()
val productList: List[Int] = df3.map{row => row.getInt(0)}.distinct.collect.toList
println(productList)
productList.foreach(id => {
val sqlDF = df3.filter(df3("id") === id)
sqlDF.show()
})
sqlDF in the loop is the DF with the relevant data, later you can run your calculations on it.

How to convert datatype in SPARK SQL to specific datatype but RDD result to a specifical class

I am reading a csv file and need to create a RDDSchema
I read the file by using the sqlContext.csvFile
val testfile = sqlContext.csvFile("file")
testfile.registerTempTable(testtable)
I wanted to change the pick some of the fields and return an RDD type of those fields
For example : class Test(ID: String, order_date: Date, Name: String, value: Double)
Using sqlContext.sql("Select col1, col2, col3, col4 FROM ...)
val testfile = sqlContext.sql("Select col1, col2, col3, col4 FROM testtable).collect
testfile.getClass
Class[_ <: Array[org.apache.spark.sql.Row]] = class [Lorg.apache.spark.sql.Row;
So I wanted to change col1 to double, col2 to a date , and column3 to string?
Is there a way to do this in the sqlContext.sql or I have to run a map function to the result and then turn it back to RDD..
I tried to do the do the item in one statement and I got this error:
val old_rdd : RDD[Test] = sqlContext.sql("SELECT col, col2, col3,col4 FROM testtable").collect.map(t => (t(0) : String ,dateFormat.parse(dateFormat.format(1)),t(2) : String, t(3) : Double))
The issue I am having is the assignment does not result on RDD[Test] where Test is a class defined
The error is saying that the map command is coming out as an Array Class and not an RDD Class
found : Array[edu.model.Test]
[error] required: org.apache.spark.rdd.RDD[edu.model.Test]
Lets say you have a case class like this:
case class Test(
ID: String, order_date: java.sql.Date, Name: String, value: Double)
Since you load your data with csvFile with default parameters it doesn't perform any schema inference and your data is stored as plain strings. Lets assume that there are no other fields:
val df = sc.parallelize(
("ORD1", "2016-01-02", "foo", "2.23") ::
("ORD2", "2016-07-03", "bar", "9.99") :: Nil
).toDF("col1", "col2", "col3", "col4")
Your attempt to use map is wrong for more than one reason:
function you use annotates individual values with incorrect types. Not only Row.apply is of type Int => Any but also your data table contains shouldn't contain any Double values
since you collect (which doesn't makes sense here) you fetch all data to the driver and result is local Array not RDD
finally, if all previous issues were resolved, (String, Date, String, Double) is clearly not a Test
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
val casted = df.select(
$"col1".alias("ID"),
$"col2".cast("date").alias("order_date"),
$"col3".alias("name"),
$"col4".cast("double").alias("value")
)
val tests: RDD[Test] = casted.map {
case Row(id: String, date: java.sql.Date, name: String, value: Double) =>
Test(id, date, name, value)
}
You can also try to use new Dataset API but it is far from stable:
casted.as[Test].rdd

spark join operation based on two columns

I'm trying to join two datasets based on two columns. It works until I use one column but fails with below error
:29: error: value join is not a member of org.apache.spark.rdd.RDD[(String, String, (String, String, String, String, Double))]
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Code :
import org.apache.spark.rdd.RDD
def zipWithIndex[T](rdd: RDD[T]) = {
val partitionSizes = rdd.mapPartitions(p => Iterator(p.length)).collect
val ranges = partitionSizes.foldLeft(List((0, 0))) { case(accList, count) =>
val start = accList.head._2
val end = start + count
(start, end) :: accList
}.reverse.tail.toArray
rdd.mapPartitionsWithIndex( (index, partition) => {
val start = ranges(index)._1
val end = ranges(index)._2
val indexes = Iterator.range(start, end)
partition.zip(indexes)
})
}
val dimension = sc.
textFile("dimension.txt").
map{ line =>
val parts = line.split("\t")
(parts(0),parts(1),parts(2),parts(3),parts(4),parts(5))
}
val dimensionWithSK =
zipWithIndex(dimension).map { case((nk1,nk2,prop3,prop4,prop5,prop6), idx) => (nk1,nk2,(prop3,prop4,prop5,prop6,idx + nextSurrogateKey)) }
val fact = sc.
textFile("fact.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
(parts(0),parts(1), (parts(2),parts(3), parts(4),parts(5),parts(6).toDouble))
}
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Request someone's help here..
Thanks
Sridhar
If you look at the signature of join it works on an RDD of pairs:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
You have a triple. I guess your trying to join on the first 2 elements of the tuple, and so you need to map your triple to a pair, where the first element of the pair is a pair containing the first two elements of the triple, e.g. for any Types V1 and V2
val left: RDD[(String, String, V1)] = ??? // some rdd
val right: RDD[(String, String, V2)] = ??? // some rdd
left.map {
case (key1, key2, value) => ((key1, key2), value)
}
.join(
right.map {
case (key1, key2, value) => ((key1, key2), value)
})
This will give you an RDD of the form RDD[(String, String), (V1, V2)]
rdd1 Schema :
field1,field2, field3, fieldX,.....
rdd2 Schema :
field1, field2, field3, fieldY,.....
val joinResult = rdd1.join(rdd2,
Seq("field1", "field2", "field3"), "outer")
joinResult schema :
field1, field2, field3, fieldX, fieldY, ......
val emp = sc.
textFile("emp.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val emp_new = sc.
textFile("emp_new.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val finalemp =
emp_new.join(emp).
map { case((nk1,nk2) ,((parts1), (val1))) => (nk1,parts1,val1) }