I have a schemaRDD created from a hive query
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sqlContext.sql("Select * from mytime")
My RDD contains the following schema
StructField(id,StringType,true)
StructField(t,TimestampType,true)
We have our own custom database and want to same the TimestampType to a string. But I could not find a way to extract the value and save it as a string.
Can you help? Thanks!
What happens if you change your query to:
SELECT id, cast(t as STRING) from mytime
Related
I am reading the data from Store table which is in snowflake. I want to pass the date from dataframe maxdatefromtbl to my query in spark sql to filter records.
This condition (s"CREATED_DATE!='$maxdatefromtbl'") is not working as expected
var retail = spark.read.format("snowflake").options(options).option("query","Select MAX(CREATED_DATE) as CREATED_DATE from RSTORE").load()
val maxdatefromtbl = retail.select("CREATED_DATE").toString
var retailnew = spark.read.format("snowflake").options(options).option("query","Select * from RSTORE").load()
var finaldataresult = retailnew.filter(s"CREATED_DATE!='$maxdatefromtbl'")
Select a single value from the retail dataframe to use in the filter.
val maxdatefromtbl = retail.select("CREATED_DATE").collect().head.getString(0)
var finaldataresult = retailnew.filter(col("CREATED_DATE") =!= maxdatefromtbl)
The type of retail.select("CREATED_DATE") is DataFrame, and DataFrame.toString returns the schema rather than the value of the single row you have. Please see the following example from a Spark shell.
scala> val s = Seq(1, 2, 3).toDF()
scala> s.select("value").toString
res0: String = [value: int]
In first line in the code snipped above, collect() wraps the dataframe, with a single row in your case, in an array; head takes the first element of the array, and .getString(0) gets the value from the cell with at the index 0 as String. Please see the DataFrame and Row documentation pages for more information.
I have a dataframe (scala)
I am using both pyspark and scala in a notebook
#pyspark
spark.read.csv(output_path + '/dealer', header = True).createOrReplaceTempView('dealer_dl')
%scala
import org.apache.spark.sql.functions._
val df = spark.sql("select * from dealer_dl")
How to convert a string column (amount) into decimal in scala dataframe.
I tried as below.
%scala
df = df.withColumn("amount", $"amount".cast(DecimalType(9,2)))
But I am getting an error as below:
error: reassignment to val
I am used to pyspark and quite new to scala. I need to do by scala to proceed further. Please let me know. Thanks.
in scala you can't reasign references defined as val but val is immutable reference. if you want to use reasigning some ref you can use var but better solution is not reasign something to the same reference name and use another val.
For example:
val dfWithDecimalAmount = df.withColumn("amount", $"amount".cast(DecimalType(9,2)))
I need to insert some values in my hive table using sparksql.I'm using below code.
val filepath:String = "/user/usename/filename.csv'"
val fileName : String = filepath
val result = fileName.split("/")
val fn=result(3) //filename
val e=LocalDateTime.now() //timestamp
First I tried using Insert Into Values but then i found this feature is not available in sparksql.
val ds=sparksession.sql("insert into mytable("filepath,filename,Start_Time") values('${filepath}','${fn}','${e}')
is there any other way to insert these values using sparksql(mytable is empty,I need to load this table everyday)?.
You can directly use Spark Dataframe Write API to insert data into the table.
If you do not have the Spark Dataframe then first create one Dataframe using spark.createDataFrame() then, try as follow to write the data:
df.write.insertInto("name of hive table")
Hi Below code worked for me since i need to use variable in my dataframe so first i created dataframe form selected data then using df.write.insertInto(tablename) saved in hive table.
val filepath:String = "/user/usename/filename.csv'"
val fileName : String = filepath
val result = fileName.split("/")
val fn=result(3) //filename
val e=LocalDateTime.now() //timestamp
val df1=sparksession.sql(s" select '${filepath}' as file_path,'${fn}' as filename,'${e}' as Start_Time")
df1.write.insertInto("dbname.tablename")
Description
When I try to select a column that is cast to unix_timestamp and then timestamp from a dataframe there is a sparkanalysisexception error. See link below.
However, when I combine two columns, and then cast the combo to a unix_timestamp and then timestamp type and then select from a df there is no error.
Disparate Cases
Error:
How to extract year from a date string?
No Error
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder().
appName("myapp").master("local").getOrCreate()
case class Person(id: Int, date: String, time:String)
import spark.implicits._
val mydf: DataFrame = Seq(Person(1,"9/16/13", "11:11:11")).toDF()
//solution.show()
//column modificaton
val datecol: Column = mydf("date")
val timecol: Column = mydf("time")
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
mydf.select(newcol).show()
Results
Expected:
Error-sparkanalysis, can't find unix_timestamp(concat(....)) in mydf
Actual:
+------------------------------------------------------------------+
|CAST(unix_timestamp(concat(date, , time), MM/dd/yy) AS TIMESTAMP)|
+------------------------------------------------------------------+
| 2013-09-16 00:00:...|
These do not seem disparate cases. In the erroneous case, you had a new dataframe with changed column names. See below :-
val select_df: DataFrame = mydf.select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType))
select_df.select(year($"date")).show()
Here, select_df dataframe has changed column names from date to something like cast(unix_timestamp(mydf("date"),"MM/dd/yy")) as Timestamp
While in the case mentioned above, you are just defining a new column when you say :-
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
And then you use this to select from your dataframe and thus it gives out expected results.
Hope this makes things clearer.
I have a dataframe in Spark with many columns and a udf that I defined. I want the same dataframe back, except with one column transformed. Furthermore, my udf takes in a string and returns a timestamp. Is there an easy way to do this? I tried
val test = myDF.select("my_column").rdd.map(r => getTimestamp(r))
but this returns an RDD and just with the transformed column.
If you really need to use your function, I can suggest two options:
Using map / toDF:
import org.apache.spark.sql.Row
import sqlContext.implicits._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val test = myDF.select("my_column").rdd.map {
case Row(string_val: String) => (string_val, getTimestamp(string_val))
}.toDF("my_column", "new_column")
Using UDFs (UserDefinedFunction):
import org.apache.spark.sql.functions._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val newCol = udf(getTimestamp).apply(col("my_column")) // creates the new column
val test = myDF.withColumn("new_column", newCol) // adds the new column to original DF
Alternatively,
If you just want to transform a StringType column into a TimestampType column you can use the unix_timestamp column function available since Spark SQL 1.5:
val test = myDF
.withColumn("new_column", unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm")
.cast("timestamp"))
Note: For spark 1.5.x, it is necessary to multiply the result of unix_timestamp by 1000 before casting to timestamp (issue SPARK-11724). The resulting code would be:
val test = myDF
.withColumn("new_column", (unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm") *1000L)
.cast("timestamp"))
Edit: Added udf option