Decoding Base64 in Spark Scala - scala

I have created the following DataFrame:
val data = spark.sparkContext.parallelize(Seq(("SnVsZXMgTmV3b25l"), ("Jason Kidd"), ("TXIgUm9uYWxkIE0=")))
val df_data = data.toDF()
val decoded_got = df_data.withColumn("xxx", unbase64(col("value")).cast("String"))
And I get the following:
+----------------+------------+
|name |xxx |
+----------------+------------+
|SnVsZXMgTmV3b25l|Jules Newone|
|Jason Kidd |%�(��� |
|TXIgUm9uYWxkIE0=|Mr Ronald M |
+----------------+------------+
What I want to do is avoid the values of the column name that are not in base 64. For example, get the following Df:
+----------------+------------+
|name |xxx |
+----------------+------------+
|SnVsZXMgTmV3b25l|Jules Newone|
|Jason Kidd |Jason Kidd |
|TXIgUm9uYWxkIE0=|Mr Ronald M |
+----------------+------------+
I am trying something like this but is not working for me:
val regex1 = """^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$"""
val check = df_data.withColumn("xxx", when(regex1 matches col("value"), unbase64(col("value"))).otherwise(col("value")))
Is there an option in Spark Scala to check if the value is in base64 or how could I do this?

To check whether the value is a valid base64 encoded string or not, you can try to decode it and encode it again, you should get the initial value. If not, then it's not a base64 string:
val decoded_got = df_data.withColumn(
"xxx",
when(
base64(unbase64(col("value"))) === col("value"),
unbase64(col("value")).cast("string")
).otherwise(col("value"))
)
decoded_got.show
//+----------------+------------+
//| value| xxx|
//+----------------+------------+
//|SnVsZXMgTmV3b25l|Jules Newone|
//| Jason Kidd| Jason Kidd|
//|TXIgUm9uYWxkIE0=| Mr Ronald M|
//+----------------+------------+

Related

Why "withColumn" transformation on spark dataframe is not checking records from an external list?

I am using Spark and Scala for learning purpose. I came around a situation wherein I need to compare the validity of records present in one of the columns of spark dataframe.
This is how I created one dataframe, "dataframe1":
import sparkSession.implicits._
val dataframe1 = Seq("AB","BC","CD","DA","AB","BC").toDF("col1")
dataframe1:
+----+
|col1|
+----+
| AB|
| BC|
| CD|
| DA|
| AB|
| BC|
+----+
The validity of records depends on the condition if the record is "AB" or "BC". Here is my first attempt:
val dataframe2 = dataframe1.withColumn("col2", when('col1.contains("AB") or 'col1.contains("BC"), "valid").otherwise("invalid"))
dataframe2:
+----+-------+
|col1| col2|
+----+-------+
| AB| valid|
| BC| valid|
| CD|invalid|
| DA|invalid|
| AB| valid|
| BC| valid|
+----+-------+
But I don't think this is a good way of doing because if I need to add more valid records then I need to add conditions in "when" clause which will increase the code length and disturbs the code readability.
So I tried to put all the valid records in one list and check if the record string is present in the list. If it is present then it is a valid record otherwise not. Here is the code snippet for this trial:
val validRecList = Seq("AB", "BC").toList
val dataframe3 = dataframe1.withColumn("col2", if(validRecList.contains('col1.toString())) lit("valid") else lit("invalid"))
But somehow it is not working as expected, as the result of this is:
+----+-------+
|col1| col2|
+----+-------+
| AB|invalid|
| BC|invalid|
| CD|invalid|
| DA|invalid|
| AB|invalid|
| BC|invalid|
+----+-------+
Can anybody tell me what mistake am I doing here? And, any other generic suggestion for such a scenario.
Thank you.
Try this:
import spark.implicits._
import org.apache.spark.sql.functions._
val dataframe1 = Seq("AB","BC","CD","DA","AB","BC", "XX").toDF("col1").as[(String)]
val validRecList = List("AB", "BC")
val dataframe2 = dataframe1.withColumn("col2", when($"col1".isin(validRecList: _*), lit("valid")).otherwise (lit("invalid")))
dataframe2.show(false)
returns:
+----+-------+
|col1|col2 |
+----+-------+
|AB |valid |
|BC |valid |
|CD |invalid|
|DA |invalid|
|AB |valid |
|BC |valid |
|XX |invalid|
+----+-------+
dataframe3 code is not working because when we see the documentation about "withColumn" function on Dataset https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset
We'd see the withColumn receive "String" and "Column" as the parameter type.
So this code
val dataframe3 = dataframe1.withColumn("col2", if(validRecList.contains('col1.toString())) lit("valid") else lit("invalid"))
will give col2 as the new column name, but will give lit("valid") or lit("invalid") as the Column name. The if(validRecList.contains('col1.toString) lit("valid") else lit("invalid") will be executed as scala code not executed as the Dataset operation nor the Column operation.
I mean this if(validRecList.contains('col1.toString) is executed by scala not spark because the "invalid" result is derived from validRecList is not have 'col1 on the List. But when you define val validRecList = Seq('col1, "AB", "BC") the validRecList.contains('col1) will return true
Also, IF operator is not supported on Dataset and on Column
If you want a condition on withColumn function, you need to express the Column type expression like this:
dataframe3.withColumn("isContainRecList", $"col1".isin(validRecList: _*))
this $"col1".isin(validRecList: _*) is a Column type expression because it will return Column (based on the documentation) or you can use when(the_condition, value_if_true, value_if_false).
So, I think it is important to understand the types that the spark engine will work with our data, if we are not give the Column type expression, it will not refer to the 'col1 data but it will refer to 'col1 as a scala symbol.
Also, when you want to use IF, maybe you could create a User Defined Functions.
import org.apache.spark.sql.functions.udf
def checkValidRecList(needle: String): String = if(validRecList.contains(needle)) "valid" else "invalid"
val checkUdf = udf[String, String](checkValidRecList)
val dataframe3 = dataframe1.withColumn("col2", checkUdf('col1))
the result is:
scala> dataframe3.show(false)
+----+-------+
|col1|col2 |
+----+-------+
|AB |valid |
|BC |valid |
|CD |invalid|
|DA |invalid|
|AB |valid |
|BC |valid |
+----+-------+
But, I think we should use remember this UDF stuff is not always recommended.

replacing strings inside df using dictionary scala

I'm new to Scala. Im trying to replace parts of strings using a dictionary.
my dictionary would be:
val dict = Seq(("fruits", "apples"),("color", "red"), ("city", "paris")).
toDF(List("old", "new").toSeq:_*)
+------+------+
| old| new|
+------+------+
|fruits|apples|
| color| red|
| city| paris|
+------+------+
I would then translate fields from a column in another df which is:
+--------------------------+
|oldCol |
+--------------------------+
|I really like fruits |
|they are colored brightly |
|i live in city!! |
+--------------------------+
the desired output:
+------------------------+
|newCol |
+------------------------+
|I really like apples |
|they are reded brightly |
|i live in paris!! |
+------------------------+
please help! I've tried to covert dict to a map and then use replaceAllIn() function but really can't solve this one.
I've also tried foldleft following this answer: Scala replace an String with a List of Key/Values.
Thanks
Create a Map from dict dataframe and then you can easily do this using udf like below
import org.apache.spark.sql.functions._
//Creating Map from dict dataframe
val oldNewMap=dict.map(row=>row.getString(0)->row.getString(1)).collect.toMap
//Creating udf
val replaceUdf=udf((str:String)=>oldNewMap.foldLeft (str) {case (acc,(key,value))=>acc.replaceAll(key+".", value).replaceAll(key, value)})
//Select old column from oldDf and apply udf
oldDf.withColumn("newCol",replaceUdf(oldDf.col("oldCol"))).drop("oldCol").show
//Output:
+--------------------+
| newCol|
+--------------------+
|I really like apples|
|they are reded br...|
| i live in paris!!|
+--------------------+
I hope this will help you

Convert Map(key-value) into spark scala Data-frame

convert myMap = Map([Col_1->1],[Col_2->2],[Col_3->3])
to Spark scala Data-frame key as column and value as column value,i am not
getting expected result, please check my code and provide solution.
var finalBufferList = new ListBuffer[String]()
var finalDfColumnList = new ListBuffer[String]()
var myMap:Map[String,String] = Map.empty[String,String]
for ((k,v) <- myMap){
println(k+"->"+v)
finalBufferList += v
//finalDfColumnList += "\""+k+"\""
finalDfColumnList += k
}
val dff = Seq(finalBufferList.toSeq).toDF(finalDfColumnList.toList.toString())
dff.show()
My result :
+------------------------+
|List(Test, Rest, Incedo)|
+------------------------+
| [4, 5, 3]|
+------------------------+
Expected result :
+------+-------+-------+
|Col_1 | Col_2 | Col_3 |
+------+-------+-------+
| 4 | 5 | 3 |
+------+-------+-------+
please give me suggestion .
if you have defined your Map as
val myMap = Map("Col_1"->"1", "Col_2"->"2", "Col_3"->"3")
then you should create RDD[Row] using the values as
import org.apache.spark.sql.Row
val rdd = sc.parallelize(Seq(Row.fromSeq(myMap.values.toSeq)))
then you create a schema using the keys as
import org.apache.spark.sql.types._
val schema = StructType(myMap.keys.toSeq.map(StructField(_, StringType)))
then finally use createDataFrame function to create the dataframe as
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)
finally you should have
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
|1 |2 |3 |
+-----+-----+-----+
I hope the answer is helpful
But remember all this would be useless if you are working in small dataset.

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+

How to Handle ValueError from Dataframe use Scala

I am developing Spark using Scala, and I don't have any background of Scala. I don't get the ValueError Yet, but I am preparing the ValueError Handler for my code.
|location|arrDate|deptDate|
|JFK |1201 |1209 |
|LAX |1208 |1212 |
|NYC | |1209 |
|22 |1201 |1209 |
|SFO |1202 |1209 |
If we have data like this, I would like to store Third row and Fourth row into Error.dat then process the fifth row again. In the error log, I would like to put the information of the data such as which file, the number of the row, and details of error. For logger, I am using log4j now.
What is the best way to implement that function? Can you guys help me?
I am assuming all the three columns are type String. in that case I would solve this using the below snippet. I have created two udf to check for the error records.
if a field is has only numeric characters [isNumber]
and if the string field is empty [isEmpty]
code snippet
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.udf
val df = rdd.zipWithIndex.map({case ((x,y,z),index) => (index+1,x,y,z)}).toDF("row_num", "c1", "c2", "c3")
val isNumber = udf((x: String) => x.replaceAll("\\d","") == "")
val isEmpty = udf((x: String) => x.trim.length==0)
val errDF = df.filter(isNumber($"c1") || isEmpty($"c2"))
val validDF = df.filter(!(isNumber($"c1") || isEmpty($"c2")))
scala> df.show()
+-------+---+-----+-----+
|row_num| c1| c2| c3|
+-------+---+-----+-----+
| 1|JFK| 1201| 1209|
| 2|LAX| 1208| 1212|
| 3|NYC| | 1209|
| 4| 22| 1201| 1209|
| 5|SFO| 1202| 1209|
+-------+---+-----+-----+
scala> errDF.show()
+-------+---+----+----+
|row_num| c1| c2| c3|
+-------+---+----+----+
| 3|NYC| |1209|
| 4| 22|1201|1209|
+-------+---+----+----+