NullPointerException when using UDF in Spark - scala

I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!

I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.

Related

Assigning elements in an array into the same DataFrame using scala ad spark

I input an array and then I want to get their unicodes and store into a dataframe. Here is my code
def getUnicodeOfEmoji (emojiArray : Array[String]) : DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
import existingSparkSession.implicits._
var result: DataFrame = null
var df : DataFrame = null
for (i <- 0 until emojiArray.length) {
df = Seq(emojiArray(i)).toDF("emoji")
df.show()
result = df.selectExpr(
"emoji",
"'U+' || trim('0' , string(hex(encode(emoji, 'utf-32')))) as result"
)
}
result.show(false)
return result
}
}
input = val emojis="😃😜😍"
actual output
|emoji|result |
+-----+-------+
|😍 |U+1F60D|
+-----+-------+
But I need to have all 3 emojis with their specific unicodes within the dataframe.
You don't need a for loop to construct the dataframe. You can convert the array to a Seq and use the toDF method of a Seq to construct the resulting dataframe.
def getUnicodeOfEmoji (emojiArray : Array[String]) : DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
import existingSparkSession.implicits._
val df = emojiArray.toSeq.toDF("emoji")
val result = df.selectExpr(
"emoji",
"'U+' || trim('0' , string(hex(encode(emoji, 'utf-32')))) as result"
)
result.show(false)
return result
}
val emojis = "😃😜😍"
val input = raw"\p{block=Emoticons}".r.findAllIn(emojis).toArray
val converted = getUnicodeOfEmoji(input)
+-----+-------+
|emoji|result |
+-----+-------+
|😃 |U+1F603|
|😜 |U+1F61C|
|😍 |U+1F60D|
+-----+-------+
A slight improvement is to convert your string of emojis to a Seq[String] directly before feeding into the function, e.g.
def getUnicodeOfEmoji (emojiArray : Seq[String]) : DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
import existingSparkSession.implicits._
val df = emojiArray.toDF("emoji")
val result = df.selectExpr(
"emoji",
"'U+' || trim('0' , string(hex(encode(emoji, 'utf-32')))) as result"
)
result.show(false)
return result
}
val emojis = "😃😜😍"
val input = raw"\p{block=Emoticons}".r.findAllIn(emojis).toSeq
val converted = getUnicodeOfEmoji(input)

Processing List of nested column in spark huge dataframe in scala

I want to store list of nested json in spark dataframe, and also wanted to process that column. There is also the need for operations like update on some value or delete.
{
"studentName": "abc",
"mailId": "abc#gmail.com",
"class" : 7,
"scoreBoard" : [
{"subject":"Math","score":90,"grade":"A"},
{"subject":"Science","score":82,"grade":"A"},
{"subject":"History","score":80,"grade":"A"},
{"subject":"Hindi","score":75,"grade":"B"},
{"subject":"English","score":80,"grade":"A"},
{"subject":"Geography","score":80,"grade":"A"},
]
}
Trying to process scoreBoard field from above data, find out top five subject, delete lowest score subject row, also change grade of some subject.
case class Student(subject: String, score: Long, grade : String)
var studentTest = sc.read.json("**/testStudent.json")
val studentSchema = ArrayType(new StructType().add("subject", StringType).add("score", LongType).add("grade", StringType))
val parseStudentUDF = udf((scoreBoard : Seq[Row]) => {
//do data processing and return updated data
ListBuffer(Subtable(subject,score,grade), , ,)
}, subtableSchema)
studentTest = studentTest.withColumn("scoreBoard",parseStudentUDF(col("scoreBoard")))
I am not sure how to convert seq[Row] to DataFrame in UDF, or how to process seq to sort data and delete any row.
Is there any way to do this?
Any different approach also acceptable.
this approach is using Spark dataframes and Spark SQL. I hope this can help you.
package tests
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object ProcessingList {
val spark = SparkSession
.builder()
.appName("ProcessingList")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","ProcessingList") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
val input = "/home/cloudera/files/tests/list_processing.json"
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
import org.apache.spark.sql.functions._
val studentTest = sqlContext
.read
.json(input)
studentTest
.filter(col("grade").isNotNull)
.select(col("grade"), col("score"), col("subject"))
.cache()
.createOrReplaceTempView("student_test")
sqlContext
.sql(
"""SELECT grade, score, subject
|FROM student_test
|ORDER BY score DESC
|LIMIT 5
|""".stripMargin)
.show()
// To have the opportunity to view the web console of Spark: http://localhost:4041/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped")
spark.stop()
println("SparkSession stopped")
}
}
}
+-----+-----+---------+
|grade|score| subject|
+-----+-----+---------+
| A| 90| Math|
| A| 82| Science|
| A| 80| History|
| A| 80| English|
| A| 80|Geography|
+-----+-----+---------+
Regards.
Firstly, the comment by mvasyliv is good in my opinion.
To modify it, you can use
plain scala collection methods like .filter(), however you dont need spark for that. See Scala collections API on how to use
You can write UnaryTransformers, which transform specific columns and insert a new one. See this simple Special Chars remover as an example. Notice the outputDataType method and createTransformFunc, which is based on .map collection method.
class SpecialCharsRemover (override val uid: String)
extends UnaryTransformer[Seq[String], Seq[String], SpecialCharsRemover] with DefaultParamsWritable {
def this() = this(Identi
fiable.randomUID("tokenPermutationGenerato
r"))
override protected def createTransformFunc: Seq[String] => Seq[String] = (tokensWithSpecialChars: Seq[String]) => {
tokensWithSpecialChars.map(token => {
removeSpecialCharsImpl(token)
})
}
private def removeSpecialCharsImpl(token: String): String = {
if(token.equals("")) {
return
""
}
//remove sonderzeichen
var tempToken = token;
tempToken = tempToken.replace(",", "")
tempToken = tempToken.replace("'", "")
tempToken = tempToken.replace("'", "")
tempToken = tempToken.replace("_", "")
tempToken = tempToken.replace("-", "")
tempToken = tempToken.replace("!", "")
tempToken = tempToken.replace(".", "")
tempToken = tempToken.replace("?", "")
tempToken = tempToken.replace(":", "")
tempToken = tempToken.replace(")", "")
tempToken = tempToken.replace("(", "")
tempToken = tempToken.replace(",", "")
tempToken = tempToken.replace("‘", "")
tempToken = tempToken.replace("}", "")
tempToken = tempToken.replace("{", "")
tempToken = tempToken.replace("[", "")
tempToken = tempToken.replace("]", "")
tempToken = tempToken.replace("]", "")
tempToken = tempToken.replace("®", "")
tempToken = ThesaurusUtils.stemToken(tempToken);
tempToken
}
override protected def outputDataType: DataType = new ArrayType(StringType, false)
}
Or you can register an arbitrary function as an UDF (Java Code):
ds.sparkSession().sqlContext().udf().register("THE_BOB", (UDF1<String, String>) this::getSomeBob, DataTypes.StringType);
private String getSomeBob(String text) {
return "bob";
}
then call it with:
bobColumn = functions.callUDF("THE_BOB", bobColumn);

Spark Rdd handle different fields of each row by field name

I am new to Spark and Scala, now I'm somehow stuck with a problem: how to handle different field of each row by field name, then into a new rdd.
This is my pseudo code:
val newRdd = df.rdd.map(x=>{
def Random1 => random(1,10000) //pseudo
def Random2 => random(10000,20000) //pseduo
x.schema.map(y=> {
if (y.name == "XXX1")
x.getAs[y.dataType](y.name)) = Random1
else if (y.name == "XXX2")
x.getAs[y.dataType](y.name)) = Random2
else
x.getAs[y.dataType](y.name)) //pseduo,keeper the same
})
})
There are 2 less errors in above:
the second map,"x.getAs" is a error syntax
how to into a new rdd
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Thanks Ramesh Maharjan, it works now.
def randomString(len: Int): String = {
val rand = new scala.util.Random(System.nanoTime)
val sb = new StringBuilder(len)
val ab = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
for (i <- 0 until len) {
sb.append(ab(rand.nextInt(ab.length)))
}
sb.toString
}
def testUdf = udf((value: String) =>randomString(2))
val df = sqlContext.createDataFrame(Seq((1,"Android"), (2, "iPhone")))
df.withColumn("_2", testUdf(df("_2")))
+---+---+
| _1| _2|
+---+---+
| 1| F3|
| 2| Ag|
+---+---+
If you are intending to filter certain felds "XXX1" "XXX2" then simple select function should do the trick
df.select("XXX1", "XXX2")
and convert that to rdd
If you are intending something else then your x.getAs should look as below
val random1 = x.getAs(y.name)
It seems that you are trying to change values in some columns "XXX1" and "XXX2"
For that a simple udf function and withColumn should do the trick
Simple udf function is as below
def testUdf = udf((value: String) => {
//do your logics here and what you return from here would be reflected in the value you passed from the column
})
And you can call the udf function as
df.withColumn("XXX1", testUdf(df("XXX1")))
Similarly you can do for "XXX2"

Spliting columns in a Spark dataframe in to new rows [Scala]

I have output from a spark data frame like below:
Amt |id |num |Start_date |Identifier
43.45|19840|A345|[2014-12-26, 2013-12-12]|[232323,45466]|
43.45|19840|A345|[2010-03-16, 2013-16-12]|[34343,45454]|
My requirement is to generate output in below format from the above output
Amt |id |num |Start_date |Identifier
43.45|19840|A345|2014-12-26|232323
43.45|19840|A345|2013-12-12|45466
43.45|19840|A345|2010-03-16|34343
43.45|19840|A345|2013-16-12|45454
Can somebody help me to achieve this.
Is this the thing you're looking for?
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sparkSession = ...
import sparkSession.implicits._
val input = sc.parallelize(Seq(
(43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq(232323,45466)),
(43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq(34343,45454))
)).toDF("amt", "id", "num", "start_date", "identifier")
val zipArrays = udf { (dates: Seq[String], identifiers: Seq[Int]) =>
dates.zip(identifiers)
}
val output = input.select($"amt", $"id", $"num", explode(zipArrays($"start_date", $"identifier")))
.select($"amt", $"id", $"num", $"col._1".as("start_date"), $"col._2".as("identifier"))
output.show()
Which returns:
+-----+-----+----+----------+----------+
| amt| id| num|start_date|identifier|
+-----+-----+----+----------+----------+
|43.45|19840|A345|2014-12-26| 232323|
|43.45|19840|A345|2013-12-12| 45466|
|43.45|19840|A345|2010-03-16| 34343|
|43.45|19840|A345|2013-16-12| 45454|
+-----+-----+----+----------+----------+
EDIT:
Since you would like to have multiple columns that should be zipped, you should try something like this:
val input = sc.parallelize(Seq(
(43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq("232323","45466"), Seq("123", "234")),
(43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq("34343","45454"), Seq("345", "456"))
)).toDF("amt", "id", "num", "start_date", "identifier", "another_column")
val zipArrays = udf { seqs: Seq[Seq[String]] =>
for(i <- seqs.head.indices) yield seqs.fold(Seq.empty)((accu, seq) => accu :+ seq(i))
}
val columnsToSelect = Seq($"amt", $"id", $"num")
val columnsToZip = Seq($"start_date", $"identifier", $"another_column")
val outputColumns = columnsToSelect ++ columnsToZip.zipWithIndex.map { case (column, index) =>
$"col".getItem(index).as(column.toString())
}
val output = input.select($"amt", $"id", $"num", explode(zipArrays(array(columnsToZip: _*)))).select(outputColumns: _*)
output.show()
/*
+-----+-----+----+----------+----------+--------------+
| amt| id| num|start_date|identifier|another_column|
+-----+-----+----+----------+----------+--------------+
|43.45|19840|A345|2014-12-26| 232323| 123|
|43.45|19840|A345|2013-12-12| 45466| 234|
|43.45|19840|A345|2010-03-16| 34343| 345|
|43.45|19840|A345|2013-16-12| 45454| 456|
+-----+-----+----+----------+----------+--------------+
*/
If I understand correctly, you want the first elements of col 3 and 4.
Does this make sense?
val newDataFrame = for {
row <- oldDataFrame
} yield {
val zro = row(0) // 43.45
val one = row(1) // 19840
val two = row(2) // A345
val dates = row(3) // [2014-12-26, 2013-12-12]
val numbers = row(4) // [232323,45466]
Row(zro, one, two, dates(0), numbers(0))
}
You could use SparkSQL.
First you create a view with the information we need to process:
df.createOrReplaceTempView("tableTest")
Then you can select the data with the expansions:
sparkSession.sqlContext.sql(
"SELECT Amt, id, num, expanded_start_date, expanded_id " +
"FROM tableTest " +
"LATERAL VIEW explode(Start_date) Start_date AS expanded_start_date " +
"LATERAL VIEW explode(Identifier) AS expanded_id")
.show()

How to calculate a certain character in the Spark

I'd like to calculate the character 'a' in spark-shell.
I have a somewhat troublesome method, split by 'a' and "length - 1" is what i want.
Here is the code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val test_data = sqlContext.read.json("music.json")
test_data.registerTempTable("test_data")
val temp1 = sqlContext.sql("select user.id_str as userid, text from test_data")
val temp2 = temp1.map(t => (t.getAs[String]("userid"),t.getAs[String]("text").split('a').length-1))
However, someone told me this is not remotely safe. I don't know why and can you give me a better way to do this?
It is not safe because if value is NULL:
val df = Seq((1, None), (2, Some("abcda"))).toDF("id", "text")
getAs[String] will return null:
scala> df.first.getAs[String]("text") == null
res1: Boolean = true
and split will give NPE:
scala> df.first.getAs[String]("text").split("a")
java.lang.NullPointerException
...
which is most likely the situation you got in your previous question.
One simple solution:
import org.apache.spark.sql.functions._
val aCnt = coalesce(length(regexp_replace($"text", "[^a]", "")), lit(0))
df.withColumn("a_cnt", aCnt).show
// +---+-----+-----+
// | id| text|a_cnt|
// +---+-----+-----+
// | 1| null| 0|
// | 2|abcda| 2|
// +---+-----+-----+
If you want to make your code relatively safe you should either check for existence of null:
def countAs1(s: String) = s match {
case null => 0
case chars => s.count(_ == 'a')
}
countAs1(df.where($"id" === 1).first.getAs[String]("text"))
// Int = 0
countAs1(df.where($"id" === 2).first.getAs[String]("text"))
// Int = 2
or catch possible exceptions:
import scala.util.Try
def countAs2(s: String) = Try(s.count(_ == 'a')).toOption.getOrElse(0)
countAs2(df.where($"id" === 1).first.getAs[String]("text"))
// Int = 0
countAs2(df.where($"id" === 2).first.getAs[String]("text"))
// Int = 2