spark select expression available only in R but not in Scala - scala

What is the name of sparks selectExpr.html https://docs.databricks.com/spark/latest/sparkr/functions/selectExpr.html when using spark with Scala?
edit
how can I use it in a withColumn statement?
val scalarInput = 123
df.withColumn("foo", selectExpr("""someHiveUDF( ${scalarInput}, column)"""))
fails with
selectExpr not Found

The name is exactly the same, selectExpr:
def selectExpr(exprs: String*): DataFrame

expr from org.apache.spark.sql.functions._ is what you want!

Related

Scala _* to select a list of dataframe columns

I have a dataframe and a list of columns like this:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq(("Java", "20000"), ("Python", "100000"))).toDF("language","users_count")
val data_columns = List("language","users_count").map(x=>col(s"$x"))
Why does this work:
df.select(data_columns:_ *).show()
But not this?
df.select($"language", data_columns:_*).show()
Gives the error:
error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
And how do I get it to work so I can use _* to select all columns in a list, but I also want to specify some other columns in the select?
Thanks!
Update:
based on #chinayangyangyong answer below, this is how I solved it:
df.select( $"language" +: data_columns :_*)
It is because there is no method on Dataframe with the signature select(col: Column, cols: Column*): DataFrame, but there is one with the signature select(col: Column*): DataFrame, which is why your first example works.
Interestingly, your second example would work if you were using String to select the columns since there is a method select(col: String, cols: String*): DataFrame.
df.select(data_columns.head, data_columns.tail:_*),show()

cast string column to decimal in scala dataframe

I have a dataframe (scala)
I am using both pyspark and scala in a notebook
#pyspark
spark.read.csv(output_path + '/dealer', header = True).createOrReplaceTempView('dealer_dl')
%scala
import org.apache.spark.sql.functions._
val df = spark.sql("select * from dealer_dl")
How to convert a string column (amount) into decimal in scala dataframe.
I tried as below.
%scala
df = df.withColumn("amount", $"amount".cast(DecimalType(9,2)))
But I am getting an error as below:
error: reassignment to val
I am used to pyspark and quite new to scala. I need to do by scala to proceed further. Please let me know. Thanks.
in scala you can't reasign references defined as val but val is immutable reference. if you want to use reasigning some ref you can use var but better solution is not reasign something to the same reference name and use another val.
For example:
val dfWithDecimalAmount = df.withColumn("amount", $"amount".cast(DecimalType(9,2)))

spark dataframe filter and select

I have a spark scala dataframe and need to filter the elements based on condition and select the count.
val filter = df.groupBy("user").count().alias("cnt")
val **count** = filter.filter(col("user") === ("subscriber").select("cnt")
The error i am facing is value select is not a member of org.apache.spark.sql.Column
Also for some reasons count is Dataset[Row]
Any thoughts to get the count in a single line?
DataSet[Row] is DataFrame
RDD[Row] is DataFrame so no need to worry.. its dataframe
see this for better understanding... Difference between DataFrame, Dataset, and RDD in Spark
Regarding select is not a member of org.apache.spark.sql.Column its purely compile error.
val filter = df.groupBy("user").count().alias("cnt")
val count = filter.filter (col("user") === ("subscriber"))
.select("cnt")
will work since you are missing ) braces which is closing brace for filter.
You are missing ")" before .select, Please check below code.
Column class don't have .select method, you have to invoke select on Dataframe.
val filter = df.groupBy("user").count().alias("cnt")
val **count** = filter.filter(col("user") === "subscriber").select("cnt")

scala: select column where not contains elements into dataframe

I have this ligne of code that should create a dataframe from list of columns that not contain a string. I tried this but it doesn't work:
val exemple = hiveObj.sql("show tables in database").select("tableName")!==="ABC".collect()
Try using the filter method:
import org.apache.spark.sql.functions._
import spark.implicits._
val exemple = hiveObj.sql("your query here").filter($"columnToFilter" =!= "ABC").show
NOTE: the inequality operator =!=is only available for Spark 2.0.0+. If you're using an older version, you must use !==. You can see the documentation here.
If you need to filter several columns you can do so:
.filter($"columnToFilter" =!= "ABC" and $"columnToFilter2" =!= "ABC")
another alternative answer to my question:
val exemple1 = hiveObj.sql("show tables in database").filter(!$"tableName".contains("ABC")).show()

How to set column names to toDF() function in spark dataframe using a string array?

For example,
val columns=Array("column1", "column2", "column3")
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF(columns)
How can I set column name using string Array?
Is it possible to mention data types inside toDF()?
toDF() takes a repeated parameter of type String, so you can use the _* type annotation to pass a sequence:
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF(columns: _*)
For more on repeated parameters - see section 4.6.2 in the Scala Language Specification.
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF("column1", "column2", "column3")
toDF() takes comma-seperated strings
toDF() is defined in Spark documentation as:
def toDF(colNames: String*): DataFrame
And so you need to turn your array to a varargs as also described here. That means you need to do the following:
val columns=Array("column1", "column2", "column3")
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF(columns: _*)
(Add : _* tocolumns in toDF)