udf spark column names - scala

I need to specify a sequence of columns. If I pass two strings, it works fine
val cols = array("predicted1", "predicted2")
but if I pass a sequence or an array, I get an error:
val cols = array(Seq("predicted1", "predicted2"))
Could you please help me? Many thanks!

You have at least two options here:
Using a Seq[String]:
val columns: Seq[String] = Seq("predicted1", "predicted2")
array(columns.head, columns.tail: _*)
Using a Seq[ColumnName]:
val columns: Seq[ColumnName] = Seq($"predicted1", $"predicted2")
array(columns: _*)

Function signature is def array(colName: String, colNames: String*): Column which means that it takes one string and then one or more strings. If you want to use a sequence, do it like this:
array("predicted1", Seq("predicted2"):_*)
From what I can see in the code, there are a couple of overloaded versions of this function, but neither one takes a Seq directly. So converting it into varargs as described should be the way to go.

You can use Spark's array form def array(cols: Column*): Column where the cols val is defined without using the $ column name notation -- i.e. when you want to have a Seq[ColumnName] type specifically, but created using strings. Here is how to solve that...
import org.apache.spark.sql.ColumnName
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val some_states: Seq[String] = Seq("state_AK","state_AL","state_AR","state_AZ")
val some_state_cols: Seq[ColumnName] = some_states.map(s => symbolToColumn(scala.Symbol(s)))
val some_array = array(some_state_cols: _*)
...using Spark's symbolToColumn method.
or with the ColumnName(s) constructor directly.
val some_array: Seq[ColumnName] = some_states.map(s => new ColumnName(s))

Related

Find columns to select, for spark.read(), from another Dataset - Spark Scala

I have a Dataset[Year] that has the following schema:
case class Year(day: Int, month: Int, Year: Int)
Is there any way to make a collection of the current schema?
I have tried:
println("Print -> "+ds.collect().toList)
But the result were:
Print -> List([01,01,2022], [31,01,2022])
I expected something like:
Print -> List(Year(01,01,2022), Year(31,01,2022)
I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.
That is my method:
class SchemeList[A]{
def set[A](ds: Dataset[A]): List[A] = {
ds.collect().toList
}
}
Apparently the method return is getting the correct signature, but when running the engine, it gets an error:
val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year
Based on your additional information in your comment:
I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?
Given your initial dataset:
val yearsDS: Dataset[Year] = ???
and that you want to do something like:
val desiredColumns: Array[String] = ???
spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)
You could find the column names of yearsDS by doing:
val desiredColumns: Array[String] = yearsDS.columns
Spark achieves this by using def schema, which is defined on Dataset.
You can see the definition of def columns here.
May be you got a DataFrame,not a DataSet.
try to use "as" to transform dataframe to dataset.
like this
val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
sparkContext
.parallelize(years)
.toDF("day","month","Year")
.as[Year]
println(df.collect().toList)

How to pass DataSet(s) to a function that accepts DataFrame(s) as arguments in Apache Spark using Scala?

I have a library in Scala for Spark which contains many functions.
One example is the following function to unite two dataframes that have different columns:
def appendDF(df2: DataFrame): DataFrame = {
val cols1 = df.columns.toSeq
val cols2 = df2.columns.toSeq
def expr(sourceCols: Seq[String], targetCols: Seq[String]): Seq[Column] = {
targetCols.map({
case x if sourceCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
// both df's need to pass through `expr` to guarantee the same order, as needed for correct unions.
df.select(expr(cols1, cols1): _*).union(df2.select(expr(cols2, cols1): _*))
}
I would like to use this function (and many more) to Dataset[CleanRow] and not DataFrames. CleanRow is a simple class here that defines the names and types of the columns.
My educated guess is to convert the Dataset into Dataframe using .toDF() method. However, I would like to know whether there are better ways to do it.
From my understanding, there shouldn't be many differences between Dataset and Dataframe since Dataset are just Dataframe[Row]. Plus, I think that from Spark 2.x the APIs for DF and DS have been unified, so I was thinking that I could pass either of them interchangeably, but that's not the case.
If changing signature is possible:
import spark.implicits._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
def f[T](d: Dataset[T]): Dataset[T] = {d}
// You are able to pass a dataframe:
f(Seq(0,1).toDF()).show
// res1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
// You are also able to pass a dataset:
f(spark.createDataset(Seq(0,1)))
// res2: org.apache.spark.sql.Dataset[Int] = [value: int]

Add new columns

ErrorHi I am trying to a new column to a Spark. I am trying in a data set where I want to add the percentage made by in all games.
The data set looks like this:
Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales
val vgdataLines = sc.textFile("hdfs:///user/ashhall1616/bdc_data/t1/vgsales-small.csv")
val vgdata = vgdataLines.map(_.split(";"))
def toPercentage(x: Double): Double = {x * 100} val countPubl = vgdata.map(r => (r(4),1)).reduceByKey(_+_)
val addpercen = countPubl.withColumn("count", toPercentage($"count"/countPubl.count(_._2)))
I used withColumn() to add new column 'count' and expected output to be like:
(Ubisoft,3,15.0)
Can anyone tell whats wrong here?
You cannot use "withColumn" with an RDD.
You could do as follow
val addpercen = countPubl.map({case(key, value) => (key, value, toPercentage(value))})
use map to add a calculated value as new column and convert to a DataFrame if you want
import spark.implicits._
val myDf = addpercen.toDF("key","value","myNewColumn")
myDf.show()
Hope it helps.
You can not use withColumn with an RDD hence convert it to DataFrame as below and then use it
val countPubl : DataFrame = vgdata.map(r => (r(4),1)).reduceByKey(_+_).toDF()
If you still looking to use RDD then just converto it back to RDD once you add the with column as
val javaRdd : JavaRDD[Row] = countPubl.withColumn("...",col("...")).toJavaRDD

calling a scala method passing each row of a dataframe as input

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.
I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

Providing a Sequence as a parameter to a method accepting varargs

In the Spark DataFrame we have a select method that has a varargs second parameter:
#scala.annotation.varargs
def select(col: String, cols: String*): DataFrame =
select((col +: cols).map(Column(_)) : _*)
I would like to invoke the select using a Sequence:
val ProductCols = Seq("prdct_id", "prdct_tag")
The preference would be to invoke as follows:
myDataFrame.select(ProductCols: _*)
However that does not resolve to the method shown above, so the following has been used:
myDataFrame.select(ProductCols.head, ProductCols.tail: _*)
Is there a way to send the ProductCols only once - and the varargs would accept it?
You should wrap String to column:
Df.select(cols.map(x => col(x)) :_*)
Write also import org.apache.spark.sql.functions._