Is it possible manipulating timestamp/date in SparkSQL(1.3.0) ? - date

Since I'm new to Spark (1.3.0), I'm trying to figure out what is possible to do with it, especially Spark SQL.
I'm stuck with timestamp/date formats and I can't pass this obstacle when it comes to operate with these datatypes.
Are there any available operations for these datatypes?
All I can do at the moment is a simple cast from string to timestamp:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Log(visitor: String, visit_date: String, page: String)
val log = (triple.map(p => Log(p._1,p._2,p._3))).toDF()
log.registerTempTable("logs")
val logSessions= sqlContext.sql("SELECT visitor" +
" ,cast(visit_date as timestamp)" +
" ,page" +
" FROM logs"
)
logSessions.foreach(println)
I'm trying to use different "custom SQL" operations on this timestamp (casted from string) but I can't obtain anything but errors.
For example: can I add 30 minutes to my timestamps? How?
Maybe I'm missing something but I can't find any documentation on this topic.
Thanks in advance!
FF

I am looking for the same thing. Spark 1.5 has added some built in functions:
https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html
But for previous versions and something specific looke like need to implement an UDF.

Related

Adding Column In sparkdataframe

Hi I am trying to add one column in my spark dataframe and calculating value based on existing dataframe column. I am writing below code.
val df1=spark.sql("select id,dt1,salary frm dbdt1.tabledt1")
val df2=df1.withColumn("new_date",WHEN (month(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))
IN (01,02,03)) THEN
CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))-1,'-'),
substr(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy'))),3,4))
.otherwise(CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy'))),'-')
,SUBSTR(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy')))+1,3,4))))
But it always showing issue error: unclosed character literal. Can someone plase guide me how should i add this new column or modify the existing code.
Incorrect syntax in many places. First I suggest you look at a few spark sql examples online and also the org.apache.spark.sql.functions API documentation because your use of WHEN, CONCAT, IN are all incorrect.
Scala strings are enclosed by double quotes, you appear to be using SQL string syntax.
'dd-MM-yyyy' should be "dd-MM-yyyy"
To reference a column dt1 on DataFrame df1 you can use one of the following:
df1("dt1")
col("dt1") // if you import org.apache.spark.sql.functions.col
$"dt1" // if you import spark.implicits._ locally
For example:
from_unixtime(unix_timestamp(col("dt1")), 'dd-MM- yyyy')

Scala/Spark determine the path of external table

I am having one external table on on gs bucket and to do some compaction logic, I want to determine the full path on which the table is created.
val tableName="stock_ticks_cow_part"
val primaryKey="key"
val versionPartition="version"
val datePartition="dt"
val datePartitionCol=new org.apache.spark.sql.ColumnName(datePartition)
import spark.implicits._
val compactionTable = spark.table(tableName).withColumnRenamed(versionPartition, "compaction_version").withColumnRenamed(datePartition, "date_key")
compactionTable. <code for determining the path>
Let me know if anyone knows how to determine the table path in scala.
I think you can use .inputFiles to
Returns a best-effort snapshot of the files that compose this Dataset
Be aware that this returns an Array[String], so you should loop through it to get all information you're looking for.
So actually just call
compactionTable.inputFiles
and look at each element of the Array
Here is the correct answer:
import org.apache.spark.sql.catalyst.TableIdentifier
lazy val tblMetadata = catalog.getTableMetadata(new TableIdentifier(tableName,Some(schema)))
lazy val s3location: String = tblMetadata.location.getPath
You can use SQL commands SHOW CREATE TABLE <tablename> or DESCRIBE FORMATTED <tablename>. Both should return the location of the external table, but they need some logic to extract this path...
See also How to get the value of the location for a Hive table using a Spark object?
Use the DESCRIBE FORMATTED SQL command and collect the path back to the driver.
In Scala:
val location = spark.sql("DESCRIBE FORMATTED table_name").filter("col_name = 'Location'").select("data_type").head().getString(0)
The same in Python:
location = spark.sql("DESCRIBE FORMATTED table_name").filter("col_name = 'Location'").select("data_type").head()[0]

How can I Count the Number of Rows of the JSON File?

My JSON file below contains six rows:
[
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:12 EST","n":"est"}]],
"apps":[],
"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},
"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"12","n":"cpu"},{"v":"154665","n":"seq"},{"v":"2016-08-24 14:23:17 EST","n":"est"}]
},
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:14 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"5","n":"cpu"},{"v":"154666","n":"seq"},{"v":"2016-08-24 14:23:23 EST","n":"est"}]},
{"events":[[{"v":"LOGOFF","n":"type"},{"v":"2016-08-24 14:24:04 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"0","n":"cpu"},{"v":"154667","n":"seq"},{"v":"2016-08-24 14:24:05 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"O","n":"state"},{"v":"5376","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"29","n":"cpu"},{"v":"154668","n":"seq"},{"v":"2016-09-25 16:57:24 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"16","n":"cpu"},{"v":"154669","n":"seq"},{"v":"2016-09-25 16:57:30 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"17","n":"cpu"},{"v":"154670","n":"seq"},{"v":"2016-09-25 16:57:36 EST","n":"est"}]}
]
The JSON looks like the below records:
JSON
0
1
2
3
4
5
Required Output:
Count
6
Ok, you are in Spark, and you need to turn your Json into dataset, and use the appropriate operation on it. So here, I wrote the workflow to go from Json to dataset in general and the required steps with examples. I think this way of answering is more beneficial because you can see the steps and then you can decide what to do with the information.
Input Data: You have the Json, that is your data you should start working on. Then you need to decide which fields are important. Counting on its own, is the small part of most cases and you don't want to load all the fields which may not be necessary.
Create a Case Class: you can use case classes because then you can serialize your input data. To keep it simple I have a doctor which belongs to a department, and I get the data in Json. I could have the following case classes:
case class Department(name: String, address: String)
case class Doctor(name: String, department: Department)
so as you can see from the above code, I go bottom up to create the data I want to work on. In you Json, there are loads of fields (e.g., v) that I can't understand the meaning behind it. So be careful not to mix them.
Have a dataaset: Ok, the below code serialize a Json to the case class we defined:
spark.read.json("doctorsData.json).as[Doctor]
couple of points. The spark is a spark session, which you need to create. Here its instance is spark it could be anything. You also need to import spark.implicits._.
In Business!: Ok now you are in business, and in the Spark world. It is just the matter of using count() to count your dataset. THe following method shows how to count it:
def recordsCount(myDataset: Dataset[Doctor]): Long = myDataset.count()
A file of three records I have - with correct formatting, Spark 2.x., reading into a dataframe / dataset:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val df = spark.read
.option("multiLine", true)
.option("mode", "PERMISSIVE")
.option("inferSchema", true)
.json("/FileStore/tables/json_01.txt")
df.select("*").show(false)
df.printSchema()
df.count()
If just a total tally count, then this will suffice, last line.
res15: Long = 3

Write Header only CSV record from Spark Scala DataFrame

My requirement is to write only Header CSV record using Spark Scala DataFrame. Can any one help me on this.
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
The above one is working and able to create header in the CSV with tab delimiter. Since I am using spark session I am creating sparkContext in the second line. outDF is my dataframe created before these statements.
Two things are outstanding, can you one of you help me.
1. The above working code is not overriding the files, so every time I need to delete the files manually. I could not find override option, can you help me.
2. Since I am doing a select statement and schema, will it be consider as action and start another lineage for this statement. If it is true then this would degrade the performance.
If you need to output only header you can use this code:
df.schema.fieldNames.reduce(_ + "," + _)
It will create line of CSV with names of columns
I tested and the solution below did not affect any performance.
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
I got a solution to handle this situation. Define the columns in the configuration file and write those columns in an file. Here is the snipet.
val Header = prop.getProperty("OUT_HEADER_COLUMNS").replaceAll("\"","").replaceAll(",","\t")
scala.tools.nsc.io.File(s"$HeadOPath").writeAll(s"$Header")

I need help parsing a file in scala for running a spark job

I'm running a Spark Job in Scala and I'm struck with parsing the input file.
The Input file(TAB separated) is something like,
date=20160701 name=mike age=26
date=20160402 name=john age=33
I want to parse it and extract only values and not the keys, such as,
20160701 mike 26
20160402 john 33
How can this be achieved in SCALA?
I'm using,
SCALA VERSION: 2.11
You can use CSVParser() and you know the location for key, it will be easy and clean
Test data
val data = "date=20160701\tname=mike\tage=26\ndate=20160402 name=john\tage=33\n"
One statement to do what you asked
val rdd = sc.parallelize(data.split('\n'))
.map(_.split('\t') // split into key=value
.map(_.split('=')(1))) // split those at "=" and select only the value
Display what we got
rdd.collect().foreach(r=>println(r.mkString(",")))
// 20160701,mike,26
// 20160402,john,33
But don't do this for real code. It's very fragile in the face of data format errors, etc. Use CSVParser or something instead as Narendra Parmar suggests.
val rdd = sc.textFile()
rdd.map(x => x.split("\t")).map(x => x.split("=")(1)).map(x => x.mkstring("\t")).saveAsTextFile("")