Encrypt a CSV column via UDF, Spark - Scala - scala

I am trying to encrypt a column in my CSV file. I am trying to do that using UDF. But I am getting compilation error. Here is my code :
import org.apache.spark.sql.functions.{col, udf}
val upperUDF1 = udf { str: String => Encryptor.aes(str) }
val rawDF = spark
.read
.format("csv")
.option("header", "true")
.load(inputPath)
rawDF.withColumn("id", upperUDF1("id")).show() //Compilation error.
I am getting the compilation error in the last line, am I using the incorrect syntax. Thanks in advance.

You should send a Column not a String, you can reference to a column by different syntaxes:
$"<columnName>"
col("<columnName>")
So you should try this:
rawDF.withColumn("id", upperUDF1($"id")).show()
or this:
rawDF.withColumn("id", upperUDF1(col("id"))).show()
Personally i like the dollar syntax the most, seems more elegant to me

In addition to the answer from SCouto, you could also register your udf as a Spark SQL function by
spark.udf.register("upperUDF2", upperUDF1)
Your subsequent select expression could then look like this
rawDF.selectExpr("id", "upperUDF2(id)").show()

Related

Spark Scala: How to use wild card as literal in a LIKE statement

I have a simple use case.
I have to use a wildcard as a value in LIKE condition.
I am trying to filter out records from a string column that contains _A_.
Its a simple LIKE statement use case. But since _ in _A_ is a wild card, a LIKE would throw wrong result.
In SQL we can use ESCAPE to achieve this.
How can i achieve this in Spark?
I have not tried Regular expression. Wanted to know if there is any other simpler workaround
I am using Spark 1.5 with Scala.
Thanks in advance!
You can use .contains (or) like (or) rlike functions for this case and use \\ to escape _ in like
val df=Seq(("apo_A_"),("asda"),("aAc")).toDF("str")
//using like
df.filter(col("str").like("%\\_A\\_%")).show()
//using rlike
df.filter(col("str").rlike(".*_A_.*")).show()
//using contains
df.filter(col("str").contains("_A_")).show()
//+------+
//| str|
//+------+
//|apo_A_|
//+------+
If you can use Spark with Dataframe code would be as simple as
object EscapeChar {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = List("_A_","A").toDF()
df.printSchema()
df.filter($"value".contains("_A_")).show()
}
}

How read a csv with quotes using sparkcontext

I've started recently to use scala spark, in particular I'm trying to use GraphX in order to make a graph from a csv. To read a csv file with spark context I always do this:
val rdd = sc.textFile("file/path")
.map(line => line.split(","))
In this way I obtain an RDD of objects Array[String].
My problem is that the csv file contains strings delimited by quotes ("") and number without quotes, an example of some lines inside the file is the following:
"Luke",32,"Rome"
"Mary",43,"London"
"Mario",33,"Berlin"
If I use the method split(",") I obtain String objects that inside contain quotes, for instance the string Luke is saved as "Luke" and not as Luke.
How can I do to not consider quotes and make the correct string objects?
I hope I was clear to explain my problem
you can let the Spark DataFrame level CSV parser resolve that for you
val rdd=spark.read.csv("file/path").rdd.map(_.mkString(",")).map(_.split(","))
by the way, you can transform the Row directly to VertexId, (String,String) in the first map based on the Row fields
Try with below example.
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val filePath="C://zipcodes.csv"
//Chaining multiple options
val df2 = spark.read.options(Map("inferSchema"->"true","sep"->",","header"->"true")).csv(filePath)
df2.show(false)
df2.printSchema()
}
}

Spark UDF - Return type DataFrame

I have created a UDF which will add a column flag in DataFrame and return new dataFrame.
def find_mismatch = udf((df: DataFrame) => {
df.withColumn("Flag",when(df("T_RTR_NUM").isNull && df("P_RTR_NUM").isNull ,
"Present in Flex but missing Trn and Platform"))
}
)
I am able to create UDF but when I pass a DataFrame into this , it gets errored out.
It works with normal function but when it comes to Spark UDF , it gets errored out.
Also, help me in understanding what difference will it make If I use normal function instead of spark UDF.
Please help. I have attached screenshot of code.
You can't pass a DataFrame to a UDF as a DataFrame is handled by a spark context i.e. at the driver and you can't pass that along to a UDF which runs on the different executors (and only hold a fraction of a dataframe)
Specifically about the problem you're trying to solve - as mentioned by #Manoj you don't actually need to use a UDF to get the result you need
You can do this without udf like below
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
def findMismatch(df:Dataset[Row]):Dataset[Row]={
val transDF=df.withColumn("Flag",when(df("T_RTR_NUM").isNull && df("P_RTR_NUM").isNull ,"Present in Flex but missing Trn and Platform"))
transDF
}
val transDF=findMismatch(df)

How to load a csv directly into a Spark Dataset?

I have a csv file [1] which I want to load directly into a Dataset. The problem is that I always get errors like
org.apache.spark.sql.AnalysisException: Cannot up cast `probability` from string to float as it may truncate
The type path of the target object is:
- field (class: "scala.Float", name: "probability")
- root class: "TFPredictionFormat"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
Moreover, and specifically for the phrases field (check case class [2]) it get
org.apache.spark.sql.AnalysisException: cannot resolve '`phrases`' due to data type mismatch: cannot cast StringType to ArrayType(StringType,true);
If I define all the fields in my case class [2] as type String then everything works fine but this is not what I want. Is there a simple way to do it [3]?
References
[1] An example row
B017NX63A2,Merrell,"['merrell_for_men', 'merrell_mens_shoes', 'merrel']",merrell_shoes,0.0806054356579781
[2] My code snippet is as follows
import spark.implicits._
val INPUT_TF = "<SOME_URI>/my_file.csv"
final case class TFFormat (
doc_id: String,
brand: String,
phrases: Seq[String],
prediction: String,
probability: Float
)
val ds = sqlContext.read
.option("header", "true")
.option("charset", "UTF8")
.csv(INPUT_TF)
.as[TFFormat]
ds.take(1).map(println)
[3] I have found ways to do it by first defining columns on a DataFrame level and the convert things to Dataset (like here or here or here) but I am almost sure this is not the way things supposed to be done. I am also pretty sure that Encoders are probably the answer but I don't have a clue how
TL;DR With csv input transforming with standard DataFrame operations is the way to go. If you want to avoid you should use input format which has is expressive (Parquet or even JSON).
In general data to be converted to statically typed dataset must be already of the correct type. The most efficient way to do it is to provide schema argument for csv reader:
val schema: StructType = ???
val ds = spark.read
.option("header", "true")
.schema(schema)
.csv(path)
.as[T]
where schema could be inferred by reflection:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
val schema = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]
Unfortunately it won't work with your data and class because csv reader doesn't support ArrayType (but it would work for atomic types like FloatType) so you have to use the hard way. A naive solution could be expressed as below:
import org.apache.spark.sql.functions._
val df: DataFrame = ??? // Raw data
df
.withColumn("probability", $"probability".cast("float"))
.withColumn("phrases",
split(regexp_replace($"phrases", "[\\['\\]]", ""), ","))
.as[TFFormat]
but you may need something more sophisticated depending on the content of phrases.

Customise way in which Spark is read

I have the following data in a CSV file(actually, my real data is larger but this is a good simplification):
ColumnA,ColumnB
1,X
5,G
9,F
I am reading it the following way, where url is the location of the file:
val rawData = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(url)
For reading, I am using https://github.com/databricks/spark-csv
Then, I am applying a map on it:
val formattedData = rawData.map(me => me("ColumnA") match {
//some other code
})
However, when I am referencing the column like this: me("ColumnA") I am getting a type mismatch:
Type mismatch, expected: Int, actual: String
Why is this occurring? Shouldn't every row of rawData be a Map?
when you reference a perticular column in datafram's row, you have several methods to do this.
if you are using apply method then you need to pass the index of column.
or if you want to get a column by name you need to use getAs[T] function of Row.
so you can use :
me(0)
or
me.getAs[T]("ColumnA")
hope it will help you.