Applying function to Spark Dataframe Column - scala

Coming from R, I am used to easily doing operations on columns. Is there any easy way to take this function that I've written in scala
def round_tenths_place( un_rounded:Double ) : Double = {
val rounded = BigDecimal(un_rounded).setScale(1, BigDecimal.RoundingMode.HALF_UP).toDouble
return rounded
}
And apply it to a one column of a dataframe - kind of what I hoped this would do:
bid_results.withColumn("bid_price_bucket", round_tenths_place(bid_results("bid_price")) )
I haven't found any easy way and am struggling to figure out how to do this. There's got to be an easier way than converting the dataframe to and RDD and then selecting from rdd of rows to get the right field and mapping the function across all of the values, yeah? And also something more succinct creating a SQL table and then doing this with a sparkSQL UDF?

You can define an UDF as follows:
val round_tenths_place_udf = udf(round_tenths_place _)
bid_results.withColumn(
"bid_price_bucket", round_tenths_place_udf($"bid_price"))
although built-in Round expression is using exactly the same logic as your function and should be more than enough, not to mention much more efficient:
import org.apache.spark.sql.functions.round
bid_results.withColumn("bid_price_bucket", round($"bid_price", 1))
See also following:
Updating a dataframe column in spark
How to apply a function to a column of a Spark DataFrame?

Related

Iterating through a DataFrame using Pandas UDF and outputting a dataframe

I have a piece of code that I want to translate into a Pandas UDF in PySpark but I'm having a bit of trouble understanding whether or not you can use conditional statements.
def is_pass_in(df):
x = list(df["string"])
result = []
for i in x:
if "pass" in i:
result.append("YES")
else:
result.append("NO")
df["result"] = result
return df
The code is super simple all I'm trying to do is iterate through a column and in each row contains a sentence. I want to check if the word pass is in that sentence and if so append that to a list that will later become a column right next to the df["string"] column. Ive tried to do this using Pandas UDF but the error messages I'm getting are something that I don't understand because I'm new to spark. Could someone point me in the correct direction?
There is no need to use a UDF. This can be done in pyspark as follows. Even in pandas, I would advice you dont do what you have done. use np.where()
df.withColumn('result', when(col('store')=='target','YES').otherwise('NO')).show()

Applying transformations with filter or map which one is faster Scala spark

Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

Spark's toDS vs to DF

I understand that one can convert an RDD to a Dataset using rdd.toDS. However there also exists rdd.toDF. Is there really any benefit of one over the other?
After playing with the Dataset API for a day, I find out that almost any operation takes me out to a DataFrame (for instance withColumn). After converting an RDD with toDS, I often find out that another conversion to a DataSet is needed, because something brought me to a DataFrame again.
Am I using the API wrongly? Should I stick with .toDF and only convert to a DataSet in the end of a chain of operations? Or is there a benefit to using toDS earlier?
Here is a small concrete example
spark
.read
.schema (...)
.json (...)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => ... }
.toDS // now with a Dataset API (should use toDF here?)
.withColumnRenamed ("_1", "id" ) // now back to a DataFrame, not type safe :(
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_2", "overall")
.as[ParsedReview] // back to a Dataset
Michael Armburst nicely explained that shift to dataset and dataframe and the difference between the two. Basically in spark 2.x they converged dataset and dataframe API into one with slight difference.
"DataFrame is just DataSet of generic row objects. When you don't know all the fields, DF is the answer".

Spark Scala - Apply ML/Complex functions on a GroupBy DataFrame

I have a large DataFrame (Spark 1.6 Scala) which looks like this:
Type,Value1,Value2,Value3,...
--------------------------
A,11.4,2,3
A,82.0,1,2
A,53.8,3,4
B,31.0,4,5
B,22.6,5,6
B,43.1,6,7
B,11.0,7,8
C,22.1,8,9
C,3.2,9,1
C,13.1,2,3
From this I want to group by Type and apply machine learning algorithms and/or perform complex functions on each group.
My objective is perform complex functions on each group in parallel.
I have tried the following approaches:
Approach 1) Convert Dataframe to Dataset and then use ds.mapGroups() api. But this is giving me an Iterator of each group values.
If i want to perform RandomForestClassificationModel.transform(dataset: DataFrame), i need a DataFrame with only a particular group values.
I was not sure converting Iterator to a Dataframe within mapGroups is a good idea.
Approach 2) Distinct on Type, then map on them and then filter for each Type with in the map loop:
val types = df.select("Type").distinct()
val ff = types.map(row => {
val type = row.getString(0)
val thisGroupDF = df.filter(col("Type") == type)
// Apply complex functions on thisGroupDF
(type, predictedValue)
})
For some reason, the above is never completing (seems to be getting into some kind of infinite loop)
Approach 3) Exploring Window functions, but did not find a method which can provide dataframe of particular group values.
Please help.