Transforming Spark Dataframe Column - scala

I am working with Spark dataframes. I have a categorical variable in my dataframe with many levels. I am attempting a simple transformation of this variable - Only pick the top few levels which has greater than n observations (say,1000). Club all other levels into an "Others" category.
I am fairly new to Spark, so I have been struggling to implement this. This is what I have been able to achieve so far:
# Extract all levels having > 1000 observations (df is the dataframe name)
val levels_count = df.groupBy("Col_name").count.filter("count >10000").sort(desc("count"))
# Extract the level names
val level_names = level_count.select("Col_name").rdd.map(x => x(0)).collect
This gives me an Array which has the level names that I would like to retain. Next, I should define the transformation function which can be applied to the column. This is where I am getting stuck. I believe we need to create a User defined function. This is what I tried:
# Define UDF
val var_transform = udf((x: String) => {
if (level_names contains x) x
else "others"
})
# Apply UDF to the column
val df_new = df.withColumn("Var_new", var_transform($"Col_name"))
However, when I try df_new.show it throws a "Task not serializable" exception. What am I doing wrong? Also, is there a better way to do this?
Thanks!

Here is a solution that would be, in my opinion, better for such a simple transformation: stick to the DataFrame API and trust catalyst and Tungsten to be optimised (e.g. making a broadcast join):
val levels_count = df
.groupBy($"Col_name".as("new_col_name"))
.count
.filter("count >10000")
val df_new = df
.join(levels_count,$"Col_name"===$"new_col_name", joinType="leftOuter")
.drop("Col_name")
.withColumn("new_col_name",coalesce($"new_col_name", lit("other")))

Related

Apply function to subset of Spark Datasets (Iteratively)

I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:
val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
val latMin = row._1
val latMax = latMin + 0.0001
val lonMin = row._2
val lonMax = lonMin + 0.0001
val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
val sampleDS = filterDF.sample(false, 0.05, 1234)
(sampleDS)
})
val output = sampleDataDS.reduce(_ union _)
I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.
I'm thinking I need to find a different solution, but I don't see it.
I've referenced these questions thus far:
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
Creating a Spark DataFrame from an RDD of lists

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

how to use select() and map() in spark - scala?

Im writing a code for data migration from mysql to cassandra using spark. I m trying to generalize it so that given a conf file it can migrate any table. Here im stuck at 2 places:
val dataframe2 = dataframe.select("a","b","c","d","e","f")
After Loading the table from mysql i wish to select only a few columns, i have the names of these columns as a list. How can it be used here?
val RDDtuple = dataframe2.map(r => (r.getAs(0), r.getAs(1), r.getAs(2), r.getAs(3), r.getAs(4), r.getAs(5)))
Here again every table may have a different number of columns, so how can this be achieved?
To use variable number of columns in select(), your list of columns can be converted like this:
val columns = List("a", "b", "c", "d")
val dfSelectedCols = dataFrame.select(columns.head, columns.tail :_*)
Explanation: the first param in DataFrame's select(String, String...) is mandatory, so use columns.head. The remaining part of the list need to be converted to varargs using columns.tail :_*.
It's not very clear from your example, but I suppose that x is a RDD[Row] and that you are trying to convert into a RDD of Tuples, right ? Please give more details and also use meaningful variable names. x, y or z are bad choices, especially if there is no explicit typing.

Spark Build Custom Column Function, user defined function

I’m using Scala and want to build my own DataFrame function. For example, I want to treat a column like an array , iterate through each element and make a calculation.
To start off, I’m trying to implement my own getMax method. So column x would have the values [3,8,2,5,9], and the expected output of the method would be 9.
Here is what it looks like in Scala
def getMax(inputArray: Array[Int]): Int = {
var maxValue = inputArray(0)
for (i <- 1 until inputArray.length if inputArray(i) > maxValue) {
maxValue = inputArray(i)
}
maxValue
}
This is what I have so far, and get this error
"value length is not a member of org.apache.spark.sql.column",
and I don't know how else to iterate through the column.
def getMax(col: Column): Column = {
var maxValue = col(0)
for (i <- 1 until col.length if col(i) > maxValue){
maxValue = col(i)
}
maxValue
}
Once I am able to implement my own method, I will create a column function
val value_max:org.apache.spark.sql.Column=getMax(df.col(“value”)).as(“value_max”)
And then I hope to be able to use this in a SQL statement, for example
val sample = sqlContext.sql("SELECT value_max(x) FROM table")
and the expected output would be 9, given input column [3,8,2,5,9]
I am following an answer from another thread Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame where they create a private method for standard deviation.
The calculations I will do will be more complex than this, (e.g I will be comparing each element in the column) , am I going in the correct directions or should I be looking more into User Defined Functions?
In a Spark DataFrame, you can't iterate through the elements of a Column using the approaches you thought of because a Column is not an iterable object.
However, to process the values of a column, you have some options and the right one depends on your task:
1) Using the existing built-in functions
Spark SQL already has plenty of useful functions for processing columns, including aggregation and transformation functions. Most of them you can find in the functions package (documentation here). Some others (binary functions in general) you can find directly in the Column object (documentation here). So, if you can use them, it's usually the best option. Note: don't forget the Window Functions.
2) Creating an UDF
If you can't complete your task with the built-in functions, you may consider defining an UDF (User Defined Function). They are useful when you can process each item of a column independently and you expect to produce a new column with the same number of rows as the original one (not an aggregated column). This approach is quite simple: first, you define a simple function, then you register it as an UDF, then you use it. Example:
def myFunc: (String => String) = { s => s.toLowerCase }
import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)
val newDF = df.withColumn("newCol", myUDF(df("oldCol")))
For more information, here's a nice article.
3) Using an UDAF
If your task is to create aggregated data, you can define an UDAF (User Defined Aggregation Function). I don't have a lot of experience with this, but I can point you to a nice tutorial:
https://ragrawal.wordpress.com/2015/11/03/spark-custom-udaf-example/
4) Fall back to RDD processing
If you really can't use the options above, or if you processing task depends on different rows for processing one and it's not an aggregation, then I think you would have to select the column you want and process it using the corresponding RDD. Example:
val singleColumnDF = df("column")
val myRDD = singleColumnDF.rdd
// process myRDD
So, there was the options I could think of. I hope it helps.
An easy example is given in the excellent documentation, where a whole section is dedicated to UDFs:
import org.apache.spark.sql._
val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val spark = df.sparkSession
spark.udf.register("simpleUDF", (v: Int) => v * v)
df.select($"id", callUDF("simpleUDF", $"value"))

Merge multiple RDD generated in loop

I am calling a function in scala which gives an RDD[(Long,Long,Double)] as its output.
def helperfunction(): RDD[(Long, Long, Double)]
I call this function in loop in another part of the code and I want to merge all the generated RDDs. The loop calling the function looks something like this
for (i <- 1 to n){
val tOp = helperfunction()
// merge the generated tOp
}
What I want to do is something similar to what StringBuilder would do for you in Java when you wanted to merge the strings. I have looked at techniques of merging RDDs, which mostly point to using union function like this
RDD1.union(RDD2)
But this requires both RDDs to be generated before taking their union. I though of initializing a var RDD1 to accumulate the results outside the for loop but I am not sure how can I initialize a blank RDD of type [(Long,Long,Double)]. Also I am starting out with spark, so I am not even sure if this is the most elegant method to solve this problem.
Instead of using vars, you can use functional programming paradigms to achieve what you want :
val rdd = (1 to n).map(x => helperFunction()).reduce(_ union _)
Also, if you still need to create an empty RDD, you can do it using :
val empty = sc.emptyRDD[(long, long, String)]
You're correct that this might not be the optimal way to do this, but we would need more info on what you're trying to accomplish with generating a new RDD with each call to your helper function.
You could define 1 RDD prior to the loop and assign it a var then run it through your loop. Here's an example:
val rdd = sc.parallelize(1 to 100)
val rdd_tuple = rdd.map(x => (x.toLong, (x*10).toLong, x.toDouble))
var new_rdd = rdd_tuple
println("Initial RDD count: " + new_rdd.count())
for (i <- 2 to 4) {
new_rdd = new_rdd.union(rdd_tuple)
}
println("New count after loop: " + new_rdd.count())