Add column conditionnally in Spark dataframe - scala

I would like to know if there is a way to add a column to a Spark dataframe in the middle of a chain conditionnally?
I would like to do something like that:
val newDf = df
.long
.chain
.of
.instructions
.withColumn("newColumn", operation(col("anotherColumn"))) if "anotherColumn" exists otherwise don't add a column
.another
.long
.chain
.of
.instructions
I know that I could break my chain like that:
val dfTmp = df.long
.chain
.of
.instructions
val dfTmp2 = if (dfTmp.contains("anotherColumn"))
dfTmp.withColumn("newColumn", operation(col("anotherColumn")))
else
dfTmp
val newDf = dfTmp2
.long
.chain
.of
.instructions
But it affects code readability quite a lot as it creates useless intermediate variables. I would like an easy-to-read way to include somehow the test in the chain.

Related

How to filter an rdd by data type?

I have an rdd that i am trying to filter for only float type. Do Spark rdds provide any way of doing this?
I have a csv where I need only float values greater than 40 into a new rdd. To achieve this, i am checking if it is an instance of type float and filtering them. When I filter with a !, all the strings are still there in the output and when i dont use !, the output is empty.
val airports1 = airports.filter(line => !line.split(",")(6).isInstanceOf[Float])
val airports2 = airports1.filter(line => line.split(",")(6).toFloat > 40)
At the .toFloat , i run into NumberFormatException which I've tried to handle in a try catch block.
Since you have a plain string and you are trying to get float values from it, you are not actually filtering by type. But, if they can be parsed to float instead.
You can accomplish that using a flatMap together with Option.
import org.apache.spark.sql.SparkSession
import scala.util.Try
val spark = SparkSession.builder.master("local[*]").appName("Float caster").getOrCreate()
val sc = spark.sparkContext
val data = List("x,10", "y,3.3", "z,a")
val rdd = sc.parallelize(data) // rdd: RDD[String]
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption) // filtered: RDD[Float]
filtered.collect() // res0: Array[Float] = Array(10.0, 3.3)
For the > 40 part you can either, perform another filter after or filter the inner Option.
(Both should perform more or less equals due spark laziness, thus choose the one is more clear for you).
// Option 1 - Another filter.
val filtered2 = filtered.filter(x => x > 40)
// Option 2 - Filter the inner option in one step.
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption.filter(x => x > 40))
Let me know if you have any question.

Converting a Spark DataFrame for ML processing

I have written the following code to feed data to a machine learning algorithm in Spark 2.3. The code below runs fine. I need to enhance this code to be able to convert not just 3 columns but any number of columns, uploaded via the csv file. For instance, if I had loaded 5 columns, how can I put them automatically in the Vector.dense command below, or some other way to generate the same end result? Does anyone know how this can be done?
val data2 = spark.read.format("csv").option("header",
"true").load("/data/c7.csv")
val goodBadRecords = data2.map(
row =>{
val n0 = row(0).toString.toLowerCase().toDouble
val n1 = row(1).toString.toLowerCase().toDouble
val n2 = row(2).toString.toLowerCase().toDouble
val n3 = row(3).toString.toLowerCase().toDouble
(n0, Vectors.dense(n1,n2,n3))
}
).toDF("label", "features")
Thanks
Regards,
Adeel
A VectorAssembler can do the job:
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features [...] into a single feature vector
Based on your code, the solution would look like:
val data2 = spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true") //1
.load("/data/c7.csv")
val fields = data2.schema.fieldNames
val assembler = new VectorAssembler()
.setInputCols(fields.tail) //2
.setOutputCol("features") //3
val goodBadRecords = assembler.transform(data2)
.withColumn("label", col(fields(0))) //4
.drop(fields:_*) //5
Remarks:
A schema is necessary for the input data, as the VectorAssembler only accepts the following input column types: all numeric types, boolean type, and vector type (same link). You seem to have a csv with doubles, so infering the schema should work. But of course, any other method to transform the string data to doubles is also ok.
Use all but the first column as input for the VectorAssembler
Name the result column of the VectorAssembler features
Create a new column called label as copy of the first column
Drop all orginal columns. This last step is optional as the learning algorithm usually only looks at the label and feature column and ignores all other columns

Select 2000+ columns as Features for Classification ML

I'm looking how I can select a lot of columns(2000+) as a feature from a Dataframe. I don't want to write the name one by one.
I'm doing classification and i have around 2000 features.
data is a Dataframe with around 2000 columns.
First, I get all of the columns name of my DF and drop 9 columns because i don't need them.
My idea was to use all the columns names to feed the VectorAssembler. The result should be something like [Value Of the 1st Feature, Value 2nd Feature, Value 3rd Feature...] for the first row and this for all of my Dataframe.
But I have this error :
java.lang.IllegalArgumentException: Field "features" does not exist.
EDIT : If something is unclear, please let me know that I can fix it.
I deleted some Transformers because it's not the point of my question.(StringIndexer, VectorIndexer, IndexToString)
val array = data.columns drop(9)
val assembler = new VectorAssembler()
.setInputCols(array)
.setOutputCol("features")
val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
.setNumTrees(50)
val pipeline = new Pipeline()
.setStages(Array(assembler, rf))
val model = pipeline.fit(trainingData)
EDIT 2 I fix my problem. I took off the Vector Indexer and used array in the VectorAssembler and it worked perfectly.
Well at least, I get a result.

Transforming Spark Dataframe Column

I am working with Spark dataframes. I have a categorical variable in my dataframe with many levels. I am attempting a simple transformation of this variable - Only pick the top few levels which has greater than n observations (say,1000). Club all other levels into an "Others" category.
I am fairly new to Spark, so I have been struggling to implement this. This is what I have been able to achieve so far:
# Extract all levels having > 1000 observations (df is the dataframe name)
val levels_count = df.groupBy("Col_name").count.filter("count >10000").sort(desc("count"))
# Extract the level names
val level_names = level_count.select("Col_name").rdd.map(x => x(0)).collect
This gives me an Array which has the level names that I would like to retain. Next, I should define the transformation function which can be applied to the column. This is where I am getting stuck. I believe we need to create a User defined function. This is what I tried:
# Define UDF
val var_transform = udf((x: String) => {
if (level_names contains x) x
else "others"
})
# Apply UDF to the column
val df_new = df.withColumn("Var_new", var_transform($"Col_name"))
However, when I try df_new.show it throws a "Task not serializable" exception. What am I doing wrong? Also, is there a better way to do this?
Thanks!
Here is a solution that would be, in my opinion, better for such a simple transformation: stick to the DataFrame API and trust catalyst and Tungsten to be optimised (e.g. making a broadcast join):
val levels_count = df
.groupBy($"Col_name".as("new_col_name"))
.count
.filter("count >10000")
val df_new = df
.join(levels_count,$"Col_name"===$"new_col_name", joinType="leftOuter")
.drop("Col_name")
.withColumn("new_col_name",coalesce($"new_col_name", lit("other")))

Spark Build Custom Column Function, user defined function

I’m using Scala and want to build my own DataFrame function. For example, I want to treat a column like an array , iterate through each element and make a calculation.
To start off, I’m trying to implement my own getMax method. So column x would have the values [3,8,2,5,9], and the expected output of the method would be 9.
Here is what it looks like in Scala
def getMax(inputArray: Array[Int]): Int = {
var maxValue = inputArray(0)
for (i <- 1 until inputArray.length if inputArray(i) > maxValue) {
maxValue = inputArray(i)
}
maxValue
}
This is what I have so far, and get this error
"value length is not a member of org.apache.spark.sql.column",
and I don't know how else to iterate through the column.
def getMax(col: Column): Column = {
var maxValue = col(0)
for (i <- 1 until col.length if col(i) > maxValue){
maxValue = col(i)
}
maxValue
}
Once I am able to implement my own method, I will create a column function
val value_max:org.apache.spark.sql.Column=getMax(df.col(“value”)).as(“value_max”)
And then I hope to be able to use this in a SQL statement, for example
val sample = sqlContext.sql("SELECT value_max(x) FROM table")
and the expected output would be 9, given input column [3,8,2,5,9]
I am following an answer from another thread Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame where they create a private method for standard deviation.
The calculations I will do will be more complex than this, (e.g I will be comparing each element in the column) , am I going in the correct directions or should I be looking more into User Defined Functions?
In a Spark DataFrame, you can't iterate through the elements of a Column using the approaches you thought of because a Column is not an iterable object.
However, to process the values of a column, you have some options and the right one depends on your task:
1) Using the existing built-in functions
Spark SQL already has plenty of useful functions for processing columns, including aggregation and transformation functions. Most of them you can find in the functions package (documentation here). Some others (binary functions in general) you can find directly in the Column object (documentation here). So, if you can use them, it's usually the best option. Note: don't forget the Window Functions.
2) Creating an UDF
If you can't complete your task with the built-in functions, you may consider defining an UDF (User Defined Function). They are useful when you can process each item of a column independently and you expect to produce a new column with the same number of rows as the original one (not an aggregated column). This approach is quite simple: first, you define a simple function, then you register it as an UDF, then you use it. Example:
def myFunc: (String => String) = { s => s.toLowerCase }
import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)
val newDF = df.withColumn("newCol", myUDF(df("oldCol")))
For more information, here's a nice article.
3) Using an UDAF
If your task is to create aggregated data, you can define an UDAF (User Defined Aggregation Function). I don't have a lot of experience with this, but I can point you to a nice tutorial:
https://ragrawal.wordpress.com/2015/11/03/spark-custom-udaf-example/
4) Fall back to RDD processing
If you really can't use the options above, or if you processing task depends on different rows for processing one and it's not an aggregation, then I think you would have to select the column you want and process it using the corresponding RDD. Example:
val singleColumnDF = df("column")
val myRDD = singleColumnDF.rdd
// process myRDD
So, there was the options I could think of. I hope it helps.
An easy example is given in the excellent documentation, where a whole section is dedicated to UDFs:
import org.apache.spark.sql._
val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val spark = df.sparkSession
spark.udf.register("simpleUDF", (v: Int) => v * v)
df.select($"id", callUDF("simpleUDF", $"value"))