I have xml schema of following structure
I am interested in _EventStartDateand and E these values have to be converted into an AVRO record at the end date being the key and E being the value.
To convert this i use databricks XML parser and then create a temporary table and then repartition it on date.
import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.sql.SQLContext
import com.databricks.spark.avro._
val xmlDF =
sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "Events").load("/hdfsPath").cache
xmlDF.registerTempTable("event_tbl")
val xmlDF_Tbl = sqlContext.sql("SELECT `#EventStartDate` as eventDate,E as event FROM event_tbl")
val xmlDF_Tbl_Part = xmlDF_Tbl.repartition($"eventDate").rdd.map(s => (s(0),s(1).toString))
xmlDF_Tbl_Part.saveAsTextFile("path to Hdfs")
i get output in following format 2015,WrappedArray([E1], [E2])
i want output in following format 2015,E1###E2###E3 and so on
The DF looks something like this:
[#BL: string, #EventStartDate: string, #MaterialNumber: bigint, #SerialNumber: bigint, #UTCOfs: bigint, E:array[struct<#C:string,#EID:bigint,#Hst:string,#L:bigint,#MID:string,#S:string,#Src:string,#T:string,#Tgt:string,#Usr:string,#VALUE:string>]]
Now how to convert this wrapped array to a delimited string
Related
I have a Spark dataframe where few columns having a different type of date format.
To handle this I have written below code to keep a consistent type of format for all the date columns.
As the date column date format may get change every time hence I have defined a set of date formats in dt_formats.
def to_timestamp_multiple(s: Column, formats: Seq[String]): Column = {
coalesce(formats.map(fmt => to_timestamp(s, fmt)):_*)
}
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("ETD1", date_format(to_timestamp_multiple($"ETD",Seq("dd-MMM-yyyy", dt_formats)).cast("date"), "yyyy-MM-dd")).drop("ETD").withColumnRenamed("ETD1","ETD")
But here I have to create a new column then I have to drop older column then rename the new column.
that make the code unnecessary very clumsy hence I want to get override from this code.
I am trying to implement similar functionality by writing a Scala below function but it is throwing the exception org.apache.spark.sql.catalyst.parser.ParseException:, but I am unable to identify the what change I should made to make it work..
val CleansedData= rawDF.selectExpr(rawDF.columns.map(
x => { x match {
case "ETA" => s"""date_format(to_timestamp_multiple($x, dt_formats).cast("date"), "yyyy-MM-dd") as ETA"""
case _ => x
} } ) : _*)
Hence seeking help.
Thanks in advance.
Create a UDF in order to use with select. The select method takes columns and produces another DataFrame.
Also, instead of using coalesce, it might be more straightforward simply to build a parser that handles all of the formats. You can use DateTimeFormatterBuilder for this.
import java.time.format.DateTimeFormatter
import java.time.format.DateTimeFormatterBuilder
import org.apache.spark.sql.functions.udf
import java.time.LocalDate
import scala.util.Try
import java.sql.Date
val dtFormatStrings:Seq[String] = Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
// use foldLeft with appendOptional method, which for each format,
// returns a new builder with that additional possible format
val initBuilder = new DateTimeFormatterBuilder()
val builder: DateTimeFormatterBuilder = dtFormatStrings.foldLeft(initBuilder)(
(b: DateTimeFormatterBuilder, s:String) => b.appendOptional(DateTimeFormatter.ofPattern(s)))
val formatter = builder.toFormatter()
// Create the UDF, which just takes
// any function returning a sql-compatible type (java.sql.Date, here)
def toTimeStamp2(dateString:String): Date = {
val dateTry: Try[Date] = Try(java.sql.Date.valueOf(LocalDate.parse(dateString, formatter)))
dateTry.toOption.getOrElse(null)
}
val timeConversionUdf = udf(toTimeStamp2 _)
// example DF and new DF
val df = Seq(("05/08/20"), ("2020-04-03"), ("unparseable")).toDF("ETD")
df.select(timeConversionUdf(col("ETD"))).toDF("ETD2").show
Output:
+----------+
| ETD2|
+----------+
|2020-05-08|
|2020-04-03|
| null|
+----------+
Note that unparseable values end up null, as shown.
try withColumn(...) with same name and coalesce as below-
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("ETD", coalesce(dt_formats.map(fmt => to_date($"ETD", fmt)):_*))
I'm using spark 2.1.2.
I'm working with datetime data, and would like to get the year from a dt string using spark sql functions.
The code I use is as follows:
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder().
appName("myapp").master("local").getOrCreate()
case class Person(id: Int, date: String)
import spark.implicits._
val mydf: DataFrame = Seq(Person(1,"9/16/13")).toDF()
val select_df: DataFrame = mydf.select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType))
select_df.select(year($"date")).show()
I expect the year of the date as 13 in the example above.
Actual: org.apache.spark.sql.AnalysisException: cannot resolve 'date' given input columns: [CAST(unix_timestamp(date, MM/dd/yy) AS TIMESTAMP)];;
'Project [year('date) AS year(date)#11]
case class Person(id: Int, date: String)
val mydf = Seq(Person(1,"9/16/13")).toDF
val solution = mydf.withColumn("year", year(to_timestamp($"date", "MM/dd/yy")))
scala> solution.show
+---+-------+----+
| id| date|year|
+---+-------+----+
| 1|9/16/13|2013|
+---+-------+----+
It looks like year does not give you two digits but four for years. I'm leaving the string truncation as a home exercise for you :)
Actual: org.apache.spark.sql.AnalysisException: cannot resolve 'date' given input columns: [CAST(unix_timestamp(date, MM/dd/yy) AS TIMESTAMP)];; 'Project [year('date) AS year(date)#11]
The reason of the exception is that you want to access the "old" date column (in select(year($"date"))) that's no longer available after select (select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType)).
You could use alias or as to change the weird-looking auto-generated name into something else like date again, and that would work.
I am trying to insert a dataframe into a Hive table using the following code:
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
val hiveCont = val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
val empfile = sc.textFile("empfile")
val empdata = empfile.map(p => p.split(","))
case class empc(id:Int, name:String, salary:Int, dept:String, location:String)
val empRDD = empdata.map(p => empc(p(0).toInt, p(1), p(2).toInt, p(3), p(4)))
val empDF = empRDD.toDF()
empDF.registerTempTable("emptab")
I have a table in Hive with following DDL:
# col_name data_type comment
id int
name string
salary int
dept string
# Partition Information
# col_name data_type comment
location string
I'm trying to insert the temporary table into the hive table as follows:
hiveCont.sql("insert into parttab select id, name, salary, dept from emptab")
This is giving an exception:
org.apache.spark.sql.AnalysisException: Table not found: emptab. 'emptab' is the temp table created from Dataframe
Here I understand that the hivecontext will run the query on 'HIVE' from Spark and it doesn't find the table there, hence resulting exception. But I don't understand how I can fix this issue. Could any tell me how to fix this ?
registerTempTable("emptab") : This line of code is used to create a table temporary table in spark, not in hive.
For storing data to hive, you have to first create a table in hive explicitly. For storing a table value data to hive table, please use the below code:
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
val empfile = sc.textFile("empfile")
val empdata = empfile.map(p => p.split(","))
case class empc(id:Int, name:String, salary:Int, dept:String, location:String)
val empRDD = empdata.map(p => empc(p(0).toInt, p(1), p(2).toInt, p(3), p(4)))
val empDF = empRDD.toDF()
empDF.write().saveAsTable("emptab");
You are implicitly converting RDD into dataFrame but you are not importing implicit objects therefore RDD is not getting converted into dataframe. Include below line in import.
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
Also the case classes must be defined top level - they cannot be nested. So your final code should be like this:
import org.apache.spark._
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import sqlContext.implicits._
val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
case class Empc(id:Int, name:String, salary:Int, dept:String, location:String)
val empFile = sc.textFile("/hdfs/location/of/data/")
val empData = empFile.map(p => p.split(","))
val empRDD = empData.map(p => Empc(p(0).trim.toInt, p(1), p(2).trim.toInt, p(3), p(4)))
val empDF = empRDD.toDF()
empDF.registerTempTable("emptab")
Also trim all white space if you are converting a String to Integer. I have included that in the above code as well.
I have txt files with semistructured data, I have to write it in cassandra through spark-cassandra. But for the first I what to parse in only in scala.
my code :
import java.io.File
import scala.io.Source
object parser extends App {
val path = "somepath"
val fileArray = (new java.io.File(path)).listFiles()
for (file <- fileArray)
for (line <- Source.fromFile(file).getLines())
So how can I parse each string and get values to put it in cassandra?
for example I have (int, text, timestamp, int, text, char, int, text)?
I have to split line for delimiter(" ")? and put them in a tuple? or each of them to convert to readable format?
What you probably could do is to handle it as csv file with delimiter(" ")? So let Spark do the parsing for you.
val spark = SparkSession.builder.config(conf).getOrCreate()
val dataFrame = spark.read.option("inferSchema", "true").option("delimiter", " ").csv(csvfilePath)
I have a csv file with datetime column: "2011-05-02T04:52:09+00:00".
I am using scala, the file is loaded into spark DataFrame and I can use jodas time to parse the date:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = new SQLContext(sc).load("com.databricks.spark.csv", Map("path" -> "data.csv", "header" -> "true"))
val d = org.joda.time.format.DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")
I would like to create new columns base on datetime field for timeserie analysis.
In DataFrame, how do I create a column base on value of another column?
I notice DataFrame has following function: df.withColumn("dt",column), is there a way to create a column base on value of existing column?
Thanks
import org.apache.spark.sql.types.DateType
import org.apache.spark.sql.functions._
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
val d = DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")
val dtFunc: (String => Date) = (arg1: String) => DateTime.parse(arg1, d).toDate
val x = df.withColumn("dt", callUDF(dtFunc, DateType, col("dt_string")))
The callUDF, col are included in functions as the import show
The dt_string inside col("dt_string") is the origin column name of your df, which you want to transform from.
Alternatively, you could replace the last statement with:
val dtFunc2 = udf(dtFunc)
val x = df.withColumn("dt", dtFunc2(col("dt_string")))