Scala spark dataframe hive conversion - scala

Hi in hive query I am using below syntax for displaying decimal values ex.
Cast( column as decimal(10,6)).
Same syntax how to convert in data frame.
$"column".cast("decimal(10,6)")
Will that work

It will. It totally legit to cast it like that:
df.withColumn("new_column_name", $"old_column_name".cast("decimal(10,6)"))

Related

Column value not properly passed to hive udf spark scala

I have created a hive udf like below,
Class customUdf extends UDF{
def evaluate(col : String): String = {
return col + "abc"
}
}
I then registered the udf in sparksession by,
sparksession.sql("""CREATE TEMPORARY FUNCTION testUDF AS 'testpkg.customUdf'""");
When I try to query hive table using below query in scala code it does not progress and does not throw error also,
SELECT testUDF(value) FROM t;
However when I pass a string like below from scala code it works
SELECT testUDF('str1') FROM t;
I am running the queries via sparksession.Tried with GenericUdf, but still facing same issue. This happens only when i pass hive column. What could be reason.
Try referencing your jar from hdfs:
create function testUDF as 'testpkg.customUdf' using jar 'hdfs:///jars/customUdf.jar';
I am not sure about implementation of UDFs in Scala, but when I faced similar issue in Java, I noticed a difference that if you plug in literal
select udf("some literal value")
then it is received by UDF as a String.
But when you select from a Hive table
select udf(some_column) from some_table
you may get what's called a LazyString for which you would need to use getObject to retrieve actual value. I am not sure is Scala handles these lazy values automatically.

How to get the SQL representation for the query logic of a (derived) Spark DataFrame?

One can convert a raw SQL string into a DataFrame. But is it also possible the other way around, i.e., get the SQL representation for the query logic of a (derived) Spark DataFrame?
// Source data
val a = Seq(7, 8, 9, 7, 8, 7).toDF("foo")
// Query using DataFrame functions
val b = a.groupBy($"foo").agg(count("*") as "occurrences").orderBy($"occurrences")
b.show()
// Convert a SQL string into a DataFrame
val sqlString = "SELECT foo, count(*) as occurrences FROM a GROUP BY foo ORDER BY occurrences"
a.createOrReplaceTempView("a")
val c = currentSparkSession.sql(sqlString)
c.show()
// "Convert" a DataFrame into a SQL string
b.toSQLString() // Error: This function does not exist.
It is not possible to "convert" a DataFrame into an SQL string because Spark does not know how to write SQL queries and it does not need to.
I find it useful to recall how a Dataframe code or an SQL query gets handled by Spark. This is done by Spark's Catalyst Optimizer and it goes through four transformational phases as shown below:
In the first phase (Analysis), the Spark SQL engine generates an abstract syntax tree (AST) for the SQL or Dataframe query. This tree is the main data type in Catalyst (see section 4.1 in white paper Spark SQL: Relational Data Processing in Spark) and it is used to create the logical plan and eventually the physical plan. You get an representation of those plans if you use the explain API that Spark offers.
Although it is clear to me what you mean with "One can convert a raw SQL string into a DataFrame" I guess it helps to be more precise. We are not converting an SQL string (hence you put quotations around that word yourself) into a Dataframe, but you applied your SQL knowledge as this is a syntax that can be parsed by Spark to understand your intentions. In addition, you cannot just type in any SQL query as this could still fail in the Analysis phase when it comes to the comparison with the Catalog. So, the SQL string is just an agreement on how Spark allows you to give instructions. This SQL query then gets parsed, transformed into an AST (as described above) and after going through the other three phases ending up in a RDD-based code. The result of this SQL execution through the sql API returns a Dataframe, whereas you can easily transform it into an RDD with df.rdd.
Overall, there is no need for Spark to write any code and in particular any Dataframe code into an SQL syntax which you could then get out of Spark. The AST is the internal abstraction and it is not required for Spark to convert Dataframe code first to an SQL query instead of directly converting the Dataframe code into an AST.
No. There is no method that can get the SQL query from a dataframe.
You will have to create the query yourself by looking at all the filters and select you used to create the dataframe.

Datatype conversion of Parquet using spark sql - Dynamically without specify a column name explicityly

I am looking for a way to handle the data type conversion dynamically. SparkDataframes , i am loading the data into a Dataframe using a hive SQL and storing into dataframe and then writing to a parquet file. Hive is unable to read some of the data types and i wanted to convert the decimal datatypes to Double . Instead of specifying a each column name separately Is there any way we can dynamically handle the datatype. Lets say in my dataframe i have 50 columns out of 8 are decimals and need to convert all 8 of them to Double datatype Without specify a column name. can we do that directly?
There is no direct way to do this convert data type here are some ways,
Either you have to cast those columns in hive query .
or
Create /user case class of data types you required and populate data and use it to generate parquet.
or
you can read data type from hive query meta and use dynamic code to get case one or case two to get. achieved
There are two options:
1. Use the schema from the dataframe and dynamically generate query statement
2. Use the create table...select * option with spark sql
This is already answered and this post has details, with code.

Column having list datatype : Spark HiveContext

The following code does aggregation and create a column with list datatype:
groupBy(
"column_name_1"
).agg(
expr("collect_list(column_name_2) "
"AS column_name_3")
)
So it seems it is possible to have 'list' as column datatype in a dataframe.
I was wondering if I can write a udf that returns custom datatype, for example a python dict?
The list is a representation of spark's Array datatype. You can try using the Map datatype (pyspark.sql.types.MapType).
an example of something which creates it is: pyspark.sql.functions.create_map which creates a map from several columns
That said if you want to create a custom aggregation function to do anything not already available in pyspark.sql.functions you will need to use scala.

Applying function to Spark Dataframe Column

Coming from R, I am used to easily doing operations on columns. Is there any easy way to take this function that I've written in scala
def round_tenths_place( un_rounded:Double ) : Double = {
val rounded = BigDecimal(un_rounded).setScale(1, BigDecimal.RoundingMode.HALF_UP).toDouble
return rounded
}
And apply it to a one column of a dataframe - kind of what I hoped this would do:
bid_results.withColumn("bid_price_bucket", round_tenths_place(bid_results("bid_price")) )
I haven't found any easy way and am struggling to figure out how to do this. There's got to be an easier way than converting the dataframe to and RDD and then selecting from rdd of rows to get the right field and mapping the function across all of the values, yeah? And also something more succinct creating a SQL table and then doing this with a sparkSQL UDF?
You can define an UDF as follows:
val round_tenths_place_udf = udf(round_tenths_place _)
bid_results.withColumn(
"bid_price_bucket", round_tenths_place_udf($"bid_price"))
although built-in Round expression is using exactly the same logic as your function and should be more than enough, not to mention much more efficient:
import org.apache.spark.sql.functions.round
bid_results.withColumn("bid_price_bucket", round($"bid_price", 1))
See also following:
Updating a dataframe column in spark
How to apply a function to a column of a Spark DataFrame?