How to select a column in a dataframe by its number instead of its name - scala

I'd like to select a Column in a Spark dataframe by its number instead of it's name. Is it possible?
Thank you

If you want to write your own method for this you can do:
package utils
object Extensions {
implicit class DataFrameExtensions(df: DataFrame) {
def selecti(indices: Int*) = {
val cols = df.columns
df.select(indices.map(cols(_)):_*)
}
}
}
Now you can import and use this method as:
import utils.Extensions._
df.selecti(1,2,3)

Related

Spark ML insert/fit custom OneHotEncoder into a Pipeline

Say I have a few features/columns in a dataframe on which I apply the regular OneHotEncoder, and one (let, n-th) column on which I need to apply my custom OneHotEncoder. Then I need to use VectorAssembler to assemble those features, and put into a Pipeline, finally fitting my trainData and getting predictions from my testData, such as:
val sIndexer1 = new StringIndexer().setInputCol("my_feature1").setOutputCol("indexed_feature1")
// ... let, n-1 such sIndexers for n-1 features
val featureEncoder = new OneHotEncoderEstimator().setInputCols(Array(sIndexer1.getOutputCol), ...).
setOutputCols(Array("encoded_feature1", ... ))
// **need to insert output from my custom OneHotEncoder function (please see below)**
// (which takes the n-th feature as input) in a way that matches the VectorAssembler below
val vectorAssembler = new VectorAssembler().setInputCols(featureEncoder.getOutputCols + ???).
setOutputCol("assembled_features")
...
val pipeline = new Pipeline().setStages(Array(sIndexer1, ...,featureEncoder, vectorAssembler, myClassifier))
val model = pipeline.fit(trainData)
val predictions = model.transform(testData)
How can I modify the building of the vectorAssembler so that it can ingest the output from the custom OneHotEncoder?
The problem is my desired oheEncodingTopN() cannot/should not refer to the "actual" dataframe, since it would be a part of the pipeline (to apply on trainData/testData).
Note:
I tested that the custom OneHotEncoder (see link) works just as expected separately on e.g. trainData. Basically, oheEncodingTopN applies OneHotEncoding on the input column, but for the top N frequent values only (e.g. N = 50), and put all the rest infrequent values in a dummy column (say, "default"), e.g.:
val oheEncoded = oheEncodingTopN(df, "my_featureN", 50)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, lit, when}
import org.apache.spark.sql.Column
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def oheEncodingTopN(df: DataFrame, colName: String, n: Int): DataFrame = {
df.createOrReplaceTempView("data")
val topNDF = spark.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
I think the cleanest way would be to create your own class that extends spark ML Transformer so that you can play with as you would do with any other transformer (like OneHotEncoder). Your class would look like this :
import org.apache.spark.ml.Transformer
import org.apache.spark.ml.param.Param
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Dataset, Column}
class OHEncodingTopN(n :Int, override val uid: String) extends Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input column")
final val outputCol = new Param[String](this, "outputCol", "The output column")
; def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
def this(n :Int) = this(n, Identifiable.randomUID("OHEncodingTopN"))
def copy(extra: ParamMap): OHEncodingTopN = {
defaultCopy(extra)
}
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is what you want if needed
// val idx = schema.fieldIndex($(inputCol))
// val field = schema.fields(idx)
// if (field.dataType != StringType) {
// throw new Exception(s"Input type ${field.dataType} did not match input type StringType")
// }
// Add the return field
schema.add(StructField($(outputCol), IntegerType, false))
}
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def transform(df: Dataset[_]): DataFrame = {
df.createOrReplaceTempView("data")
val colName = $(inputCol)
val topNDF = df.sparkSession.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
}
Now on a OHEncodingTopN object you should be able to call .getOuputCol to perform what you want. Good luck.
EDIT: your method that I just copy pasted in the transform method should be slightly modified in order to output a column of type Vector having the name given in the setOutputCol.

Test sparksql query

I have a Dataframe that I want to run a simple query on like this:
def runQuery(df: DataFrame, queryString: String): DataFrame = {
df.createOrReplaceTempView("myDataFrame")
spark.sql(queryString)
}
Where queryString can be something like
"SELECT name, age FROM myDataFrame WHERE age > 30"
But I'd really like to know ahead of time whether the query will work without having to throw an Exception. For instance, what if df doesn't have the columns name and age? I want to write something like this to handle it:
def runQuery(df: DataFrame, queryString: String): DataFrame = {
if (/*** df and queryString are compatible ***/) {
df.createOrReplaceTempView("myDataFrame")
spark.sql(queryString)
} else {
spark.createDataFrame(sc.emptyRDD[Row], df.schema)
}
}
Is there a way to check this in an 'if' statement?
I wouldn't worry to much about exceptions. Just wrap it with Try:
import scala.util.Try
import org.apache.spark.sql.catalyst.encoders.RowEncoder
def runQuery(df: DataFrame, queryString: String): DataFrame = Try {
df.createOrReplaceTempView("myDataFrame")
df.sparkSession.sql(queryString)
}.getOrElse(df.sparkSession.emptyDataset(RowEncoder(df.schema)))
You can check all the columns present in dataframe or not with triggering spark job
def runQuery(df: DataFrame, queryString: String): DataFrame =
if(Array("name", "age", "address").forall(df.columns.contains)) {
df.createOrReplaceTempView("myDataFrame")
df.sparkSession.sql(queryString)
} else {
df.sparkSession.emptyDataset(RowEncoder(df.schema))
}
you can use df.schema to match datatype as well

passing UDF to a method or class

I have a UDF say
val testUDF = udf{s: string=>s.toUpperCase}
I want to create this UDF in a separate method or may be something else like an implementation class and pass it on another class which uses it. Is it possible?
Say suppose I have a class A
class A(df: DataFrame) {
def testMethod(): DataFrame = {
val demo=df.select(testUDF(col))
}
}
class A should be able to use UDF. Can this be achieved?
Given a dataframe as
+----+
|col1|
+----+
|abc |
|dBf |
|Aec |
+----+
And a udf function
import org.apache.spark.sql.functions._
val testUDF = udf{s: String=>s.toUpperCase}
You can definitely use that udf function from another class as
val demo = df.select(testUDF(col("col1")).as("upperCasedCol"))
which should give you
+-------------+
|upperCasedCol|
+-------------+
|ABC |
|DBF |
|AEC |
+-------------+
But I would suggest you to use other functions if possible as udf function requires columns to be serialized and deserialized which would consume time and memory more than other functions available. UDF function should be the last choice.
You can use upper function for your case
val demo = df.select(upper(col("col1")).as("upperCasedCol"))
This will generate the same output as the original udf function
I hope the answer is helpful
Updated
Since your question is asking for information on how to call the udf function defined in another class or object, here is the method
suppose you have an object where you defined the udf function or a function that i suggested as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
object UDFs {
def testUDF = udf{s: String=>s.toUpperCase}
def testUpper(column: Column) = upper(column)
}
Your A class is as in your question, I just added another function
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
class A(df: DataFrame) {
def testMethod(): DataFrame = {
val demo = df.select(UDFs.testUDF(col("col1")))
demo
}
def usingUpper() = {
df.select(UDFs.testUpper(col("col1")))
}
}
Then you can call the functions from main as below
import org.apache.spark.sql.SparkSession
object TestUpper {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder().appName("Simple Application")
.master("local")
.config("", "")
.getOrCreate()
import sparkSession.implicits._
val df = Seq(
("abc"),
("dBf"),
("Aec")
).toDF("col1")
val a = new A(df)
//calling udf function
a.testMethod().show(false)
//calling upper function
a.usingUpper().show(false)
}
}
I guess this is more than helpful
If I understand correctly you would actually like some kind of factory to create this user-defined-function for a specific class A.
This could be achieve using a type class which gets injected implicitly.
E.g. (I had to define UDF and DataFrame to be able to test this)
type UDF = String => String
case class DataFrame(col: String) {
def select(in: String) = s"col:$col, in:$in"
}
trait UDFFactory[A] {
def testUDF: UDF
}
implicit object UDFFactoryA extends UDFFactory[AClass] {
def testUDF: UDF = _.toUpperCase
}
class AClass(df: DataFrame) {
def testMethod(implicit factory: UDFFactory[AClass]) = {
val demo = df.select(factory.testUDF(df.col))
println(demo)
}
}
val a = new AClass(DataFrame("test"))
a.testMethod // prints 'col:test, in:TEST'
Like you mentioned, create a method exactly like your UDF in your object body or companion class,
val myUDF = udf((str:String) => { str.toUpperCase })
Then for some dataframe df do this,
val res=df withColumn("NEWCOLNAME", myUDF(col("OLDCOLNAME")))
This will change something like this,
+-------------------+
| OLDCOLNAME |
+-------------------+
| abc |
+-------------------+
to
+-------------------+-------------------+
| OLDCOLNAME | NEWCOLNAME |
+-------------------+-------------------+
| abc | ABC |
+-------------------+-------------------+
Let me know if this helped, Cheers.
Yes thats possible as functions are objects in scala which can be passed around:
import org.apache.spark.sql.expressions.UserDefinedFunction
class A(df: DataFrame, testUdf:UserDefinedFunction) {
def testMethod(): DataFrame = {
df.select(testUdf(col))
}
}

display column name into list[column]scala

I want to insert list of column from datframe into a list [column] so I can perform a select request. it means want to get list of column and insert it automatically into a list [column] Any help Thanks
object PCA extends App{
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val strPath="C:/Users/mhattabi/Desktop/testBis2.txt"
val intial_Data=spark.read.option("header",true).csv(strPath)
//array string contains names of column
val arrayList=intial_Data.columns
var colsList = List[Column]()
//wanna insert name of column into the listColum
arrayList.foreach(p=>colsList.)
//i want to have something like
//val colsList = List(col("col1"),col("col2"))
//intial_Data.select(colsList:_*).show
}
You could use col function as follow:
var colsList = List[Column]()
arrayList.columns.foreach { c => colsList:+=col(c)}
Remember to import sql functions to use col:
import org.apache.spark.sql.functions._
I would rather use immutable list than the variable list by transformation like below.
val arrayList = initial_Data.columns
val colsList = arrayList.map(col)

How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame

Is there a more elegant way of filtering based on values in a Set of String?
def myFilter(actions: Set[String], myDF: DataFrame): DataFrame = {
val containsAction = udf((action: String) => {
actions.contains(action)
})
myDF.filter(containsAction('action))
}
In SQL you can do
select * from myTable where action in ('action1', 'action2', 'action3')
How about this:
myDF.filter("action in (1,2)")
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(1,2).map(lit(_)):_*))
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(lit(1),lit(2)):_*))
Additional support will be added to make this cleaner in 1.5