Pass parameter to select dataframe spark - scala

I want to pass the column to be selected in a dataframe as a parameter as I change each time for the moment I have done this. It actually works
object PCA extends App{
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val strPath="C:/Users/mhattabi/Desktop/testBis2.txt"
val intial_Data=spark.read.option("header",true).csv(strPath)
val inputData=intial_Data.select("col1","col2").show
}
I want to pass a parameter to the select so I did this
object PCA extends App{
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val strPath="C:/Users/mhattabi/Desktop/testBis2.txt"
val columns="col1","col2"
val intial_Data=spark.read.option("header",true).csv(strPath)
val listcolu=intial_Data.columns
foreach(string s in listcolu)
{create the list insert the column name
}
}
It hasn't even accepted what it should do. The aim is to pass it each time as parameter.

You can do something like this.
import org.apache.spark.sql.functions.col
val colsList = List(col("col1"),col("col2"))
intial_Data.select(colsList:_*).show

Related

Scala/Spark - Create Dataset with one column from another Dataset

I am trying to create a Dataset with only one column from Case Class.
Below is the code:
case class vectorData(value: Array[String], vectors: Vector)
def main(args: Array[String]) {
val spark = SparkSession.builder
.appName("Hello world!")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//blah blah and read data etc.
val word2vec = new Word2Vec()
.setInputCol("value").setOutputCol("vectors")
.setVectorSize(5).setMinCount(0).setWindowSize(5)
val dataset = spark.createDataset(data)
val model = word2vec.fit(dataset)
val encoder = org.apache.spark.sql.Encoders.product[vectorData]
val result = model.transform(dataset).as(encoder)
//val output: Dataset[Vector] = ???
}
As shown in last line of the code, I want the output to be only the 2nd column which has Vector type with vectors data.
I tried with:
val output = result.map(o => o.vectors)
But this line highlighted error No implicit arguments of type: Encoder[Vector]
How to resolve this?
I think line:
implicit val vectorEncoder: Encoder[Vector] = org.apache.spark.sql.Encoders.product[Vector]
should make
val output = result.map(o => o.vectors)
correct

How to pass DataSet(s) to a function that accepts DataFrame(s) as arguments in Apache Spark using Scala?

I have a library in Scala for Spark which contains many functions.
One example is the following function to unite two dataframes that have different columns:
def appendDF(df2: DataFrame): DataFrame = {
val cols1 = df.columns.toSeq
val cols2 = df2.columns.toSeq
def expr(sourceCols: Seq[String], targetCols: Seq[String]): Seq[Column] = {
targetCols.map({
case x if sourceCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
// both df's need to pass through `expr` to guarantee the same order, as needed for correct unions.
df.select(expr(cols1, cols1): _*).union(df2.select(expr(cols2, cols1): _*))
}
I would like to use this function (and many more) to Dataset[CleanRow] and not DataFrames. CleanRow is a simple class here that defines the names and types of the columns.
My educated guess is to convert the Dataset into Dataframe using .toDF() method. However, I would like to know whether there are better ways to do it.
From my understanding, there shouldn't be many differences between Dataset and Dataframe since Dataset are just Dataframe[Row]. Plus, I think that from Spark 2.x the APIs for DF and DS have been unified, so I was thinking that I could pass either of them interchangeably, but that's not the case.
If changing signature is possible:
import spark.implicits._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
def f[T](d: Dataset[T]): Dataset[T] = {d}
// You are able to pass a dataframe:
f(Seq(0,1).toDF()).show
// res1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
// You are also able to pass a dataset:
f(spark.createDataset(Seq(0,1)))
// res2: org.apache.spark.sql.Dataset[Int] = [value: int]

Spark Scala | create Dataframe Dyanmically

I would like to create dataframe names dynamically from a collection.
Please see below:
val set1 = Set("category1","category2","category3")
The following is a UDF which takes a string x from the set as input and generate the dataframe accordingly:
def catDfgen(x: String): DataFrame = {
spark.sql(s"select * from table where col1 = '$x'")
}
Now I need help here, to create not only DataFrame but also the DataFrame name should be dynamically generated in order to achieve
val category1DF = catDfgen($x)
val category2DF = catDfgen($x)
...etc. Would it be possible to do it using the code below?
set1.map( x => val $x+"DF" = catDfgen($x))
If not please suggest an effective method.
Suman, I believe the below might help your use-case
import org.apache.spark.sql.{DataFrame, SparkSession}
object Test extends App {
val spark: SparkSession = SparkSession.builder().master("local").getOrCreate()
val set1 = Set("category1","category2","category3")
val dfs: Map[String, DataFrame] = set1.map(x =>
(s"${x}DF", spark.sql(s"select * from table where col1 = '$x'").alias(s"${x}DF").toDF())
).toMap
dfs("category1DF").show()
spark.stop()
}

How to add a new method to DataFrame type?

Imagine I have this Scala function that operates upon a Spark dataframe:
class MyClass {
def makeColumnNull(df: DataFrame, columnToMakeNull: String): DataFrame = {
val colType = df.select(columnToMakeNull).schema.head.dataType
df.withColumn(columnToMakeNull, lit(null).cast(colType))
}
}
I call it like so:
val df = spark.range(0,10).toDF()
val df2 = MyClass.makeColumnNull(df, "id")
That works fine however it doesn't work in the same fluent manner as Spark's API. What I'd like to is rewrite my function in a way that enables me to do this:
val df2 = df.makeColumnNull("id")
Can anyone help?
Implicit classes is the way to go, I've used them to extend several spark classes. So you need this:
package com.mycompany.utils.spark
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
object DataFrameExtensions {
implicit class DataFrameWrapper(df: DataFrame) {
def makeColumnNull(columnToMakeNull: String): DataFrame = {
val colType = df.select(columnToMakeNull).schema.head.dataType
df.withColumn(columnToMakeNull, lit(null).cast(colType))
}
}
}
then you have to import com.mycompany.utils.spark.DataFrameExtensions._ and you will able to invoke makeColumnNull() against any DataFrame object

How to convert RDD[Row] to RDD[String]

I have a DataFrame called source, a table from mysql
val source = sqlContext.read.jdbc(jdbcUrl, "source", connectionProperties)
I have converted it to rdd by
val sourceRdd = source.rdd
but its RDD[Row] I need RDD[String]
to do transformations like
source.map(rec => (rec.split(",")(0).toInt, rec)), .subtractByKey(), etc..
Thank you
You can use Row. mkString(sep: String): String method in a map call like this :
val sourceRdd = source.rdd.map(_.mkString(","))
You can change the "," parameter by whatever you want.
Hope this help you, Best Regards.
What is your schema?
If it's just a String, you can use:
import spark.implicits._
val sourceDS = source.as[String]
val sourceRdd = sourceDS.rdd // will give RDD[String]
Note: use sqlContext instead of spark in Spark 1.6 - spark is a SparkSession, which is a new class in Spark 2.0 and is a new entry point to SQL functionality. It should be used instead of SQLContext in Spark 2.x
You can also create own case classes.
Also you can map rows - here source is of type DataFrame, we use partial function in map function:
val sourceRdd = source.rdd.map { case x : Row => x(0).asInstanceOf[String] }.map(s => s.split(","))