decode vs string.getBytes - scala

I am trying to decode a string using the inbuilt spark function. It doesn't work. The string is still unreadable.
df.
withColumn("col", decode($"col", "ISO-8859-1"))
I try doing the same using an udf, it works.
val decodeUdf: UserDefinedFunction = udf((name:String) => decode(name))
def decode(name: String): String = {
val s = name.getBytes("ISO-8859-1")
new String(s)
}
df.
withColumn("col", decodeUdf($"col"))
What's the difference?

Related

Dynamic Query on Scala Spark

I'm basically trying to do something like this but spark doesn’t recognizes it.
val colsToLower: Array[String] = Array("col0", "col1", "col2")
val selectQry: String = colsToLower.map((x: String) => s"""lower(col(\"${x}\")).as(\"${x}\"), """).mkString.dropRight(2)
df
.select(selectQry)
.show(5)
Is there a way to do something like this in spark/scala?
If you need to lowercase the name of your columns there is a simple way of doing it. Here is one example:
df.columns.foreach(c => {
val newColumnName = c.toLowerCase
df = df.withColumnRenamed(c, newColumnName)
})
This will allow you to lowercase the column names, and update it in the spark dataframe.
I believe I found a way to build it:
def lowerTextColumns(cols: Array[String])(df: DataFrame): DataFrame = {
val remainingCols: String = (df.columns diff cols).mkString(", ")
val lowerCols: String = cols.map((x: String) => s"""lower(${x}) as ${x}, """).mkString.dropRight(2)
val selectQry: String =
if (colsToSelect.nonEmpty) lowerCols + ", " + remainingCols
else lowerCols
df
.selectExpr(selectQry.split(","):_*)
}

spark scala - UDF usage for creating new column

I need to create a new column called hash_id from uid column of my dataframe, Below is my code:
//1.Define a hashing function
def calculate_hashid (uid: String) : BigInteger ={
val md = java.security.MessageDigest.getInstance("SHA-1")
val ha = new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
return ha
}
//2.Convert function to UDF
val calculate_hashidUDF = udf(calculate_hashid)
//3.Apply udf on spark dataframe
val userAgg_Data_hashid = userAgg_Data.withColumn("hash_id", calculate_hashidUDF($"uid"))
I am getting error at udf(calculate_hashid) saying
missing arguments for the method calculate_hashid(string)
I have gone through many examples online but could not resolve it, what am I missing here.
You can register your udf as
val calculate_hashidUDF = udf[String, BigInteger](calculate_hashidUDF)
You can also rewrite your udf as
def calculate_hashidUDF = udf(((uid: String) => {
val md = java.security.MessageDigest.getInstance("SHA-1")
new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
}): String => BigInteger)
Or even without return type
def calculate_hashidUDF = udf((uid: String) => {
val md = java.security.MessageDigest.getInstance("SHA-1")
new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
})

Why do SparkSQL UDF return a dataframe with columns names in the format UDF("Original Column Name")?

So the dataframe I get after running the following code is exactly how I want it to be. It is the same dataframe as the original but all cells with purely numeric data have had all brackets and slashes removed (brackets are replaced with a minus sign at the front).
stringModifierIterator takes in a dataframe and returns a List[Column]. The List[Column] can then be used like in the command dataframe.select(List[Column]: _*) to create a new dataframe.
Unfortunately, the column names have been altered to something like UDF("Original Column Name") and I can't figure out why.
def stringModifierIterator(dataFrame: DataFrame, dataFrameColumns: Array[String], uDF: UserDefinedFunction): List[Column] ={
if(dataFrameColumns.isEmpty){
Nil
} else {
uDF(dataFrame(dataFrameColumns.head)) :: stringModifierIterator(dataFrame, dataFrameColumns.tail, uDF)
}
}
val stringModifierFunction: (String => String) = { s: String => Option(s).map(modifier).getOrElse("0") }
def modifier(inputString: String): String = {
???
}
This is what the column names look like when I use df.show()
You can solve this by explicitly naming the columns you create with the UDF in stringModifierIterator using Column.as:
def stringModifierIterator(dataFrame: DataFrame, dataFrameColumns: Array[String], uDF: UserDefinedFunction): List[Column] ={
if(dataFrameColumns.isEmpty){
Nil
} else {
val col = dataFrameColumns.head
uDF(dataFrame(col)).as(col) :: stringModifierIterator(dataFrame, dataFrameColumns.tail, uDF)
}
}
BTW, this method can be be much shorter and simpler without recursion:
def stringModifierIterator(dataFrame: DataFrame, dataFrameColumns: Array[String], uDF: UserDefinedFunction): List[Column] ={
dataFrameColumns.toList.map(col => uDF(dataFrame(col)).as(col))
}

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

How to convert Spark's TableRDD to RDD[Array[Double]] in Scala?

I am trying to perform Scala operation on Shark. I am creating an RDD as follows:
val tmp: shark.api.TableRDD = sc.sql2rdd("select duration from test")
I need it to convert it to RDD[Array[Double]]. I tried toArray, but it doesn't seem to work.
I also tried converting it to Array[String] and then converting using map as follows:
val tmp_2 = tmp.map(row => row.getString(0))
val tmp_3 = tmp_2.map { row =>
val features = Array[Double] (row(0))
}
But this gives me a Spark's RDD[Unit] which cannot be used in the function. Is there any other way to proceed with this type conversion?
Edit I also tried using toDouble, but this gives me an RDD[Double] type, not RDD[Array[Double]]
val tmp_5 = tmp_2.map(_.toDouble)
Edit 2:
I managed to do this as follows:
A sample of the data:
296.98567000000003
230.84362999999999
212.89751000000001
914.02404000000001
305.55383
A Spark Table RDD was created first.
val tmp = sc.sql2rdd("select duration from test")
I made use of getString to translate it to a RDD[String] and then converted it to an RDD[Array[Double]].
val duration = tmp.map(row => Array[Double](row.getString(0).toDouble))