How to get integer value with leading zero in Spark (Scala) - scala

I have spark dataframe and and trying to add Year, Month and Day columns to it.
But the problem is after adding the YTD columns it does not keeps the leading zero with the date and month columns.
val cityDF= Seq(("Delhi","India"),("Kolkata","India"),("Mumbai","India"),("Nairobi","Kenya"),("Colombo","Srilanka"),("Tibet","China")).toDF("City","Country")
val dateString = "2020-01-01"
val dateCol = org.apache.spark.sql.functions.to_date(lit(dateString))
val finaldf = cityDF.select($"*", year(dateCol).alias("Year"), month(dateCol).alias("Month"), dayofmonth(dateCol).alias("Day"))
I want to keep the leading zero from the Month and Day columns but it is giving me result as 1 instead of 01.
As I am using year month date columns for the spark partition creation. so I want to keep the leading zeros intact.
So my question is: How do I keep the leading zero in my dataframe columns.

Integer type can be converted to String type, where leading zeroes are possibe, with "format_string" function:
val finaldf =
cityDF
.select($"*",
year(dateCol).alias("Year"),
format_string("%02d", month(dateCol)).alias("Month"),
format_string("%02d", dayofmonth(dateCol)).alias("Day")
)

Why not simply use date_format for that?
val finaldf = cityDF.select(
$"*",
year(dateCol).alias("Year"),
date_format(dateCol, "MM").alias("Month"),
date_format(dateCol, "dd").alias("Day")
)

Related

How to calculate an hourly count (grouped by a timeStamp type) in Spark dataframe?

For a dataframe df1 where col1 is of type DateType, I do the following to get the daily count.
val df1_new=df1.groupBy("col1").count()
However, for my dataframe df2 where col2 is of type TimestampType, I want to get the count on a per-hour basis. But replicating the above code for this, results in separate count for every timestamp that differs by even a second.
What should I be doing to achieve the count on an hourly-basis for df2?
You can use date_trunc to truncate the timestamps to the hour level:
val df2_new = df2.groupBy(date_trunc("hour", col("col2"))).count()

How to add a list or array of strings as a column to a Spark Dataframe

So, I have n number of strings that I can keep either in an array or in a list like this:
val checks = Array("check1", "check2", "check3", "check4", "check5")
val checks: List[String] = List("check1", "check2", "check3", "check4", "check5")
Now, I have a spark dataframe df and I want to add a column with the values present in this List/Array. (It is guaranteed that the number of items in my List/Array will be exactly equal to the number of rows in the dataframe, i.e n)
I tried doing:
df.withColumn("Value", checks)
But that didn't work. What would be the best way to achieve this?
You need to add it as an array column as follows:
val df2 = df.withColumn("Value", array(checks.map(lit):_*))
If you want a single value for each row, you can get the array element:
val df2 = df.withColumn("Value", array(checks.map(lit):_*))
.withColumn("rn", row_number().over(Window.orderBy(lit(1))) - 1)
.withColumn("Value", expr("Value[rn]"))
.drop("rn")

scala - how to substring column names after the last dot?

After exploding a nested structure I have a DataFrame with column names like this:
sales_data.metric1
sales_data.type.metric2
sales_data.type3.metric3
When performing a select I'm getting the error:
cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3]
How should I select from the DataFrame so the column names are parsed correctly?
I've tried the following: the substrings after dots are extracted successfully. But since I also have columns without dots like date - their names are getting removed completely.
var salesDf_new = salesDf
for(col <- salesDf .columns){
salesDf_new = salesDf_new.withColumnRenamed(col, StringUtils.substringAfterLast(col, "."))
}
I want to leave just metric1, metric2, metric3
You can use backticks to select columns whose names include periods.
val df = (1 to 1000).toDF("column.a.b")
df.printSchema
// root
// |-- column.a.b: integer (nullable = false)
df.select("`column.a.b`")
Also, you can rename them easily like this. Basically starting with your current DataFrame, keep updating it with a new column name for each field and return the final result.
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replace(".", "_"))
)
EDIT: Get the last component
To rename with just the last name component, this regex will work:
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(".+\\.([^.]+)$", "$1"))
)
EDIT 2: Get the last two components
This is a little more complicated, and there might be a cleaner way to write this, but here is a way that works:
val pattern = (
".*?" + // Lazy match leading chars so we ignore that bits we don't want
"([^.]+\\.)?" + // Optional 2nd to last group
"([^.]+)$" // Last group
)
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(pattern, "$1$2"))
)
df2.printSchema

spark add_month doesn't work as expected [duplicate]

This question already has answers here:
Spark Dataframe Random UUID changes after every transformation/action
(4 answers)
Closed 5 years ago.
in a dataframe, I'm generating a column based on column A in DateType format "yyyy-MM-dd". Column A is generated from a UDF (udf generates a random date from the last 24 months).
from that generated date I try to calculate column B. Column B is column A minus 6 months. ex. 2017-06-01 in A is 2017-01-01 in B.
To achieve this I use function add_months(columname, -6)
when I do this using another column (not generated by udf) I get the right result. But when I do it on that generated column I get random values, totally wrong.
I checked the schema, column is from DateType
this is my code :
val test = df.withColumn("A", to_date(callUDF("randomUDF")))
val test2 = test.select(col("*"), add_months(col("A"), -6).as("B"))
code of my UDF :
sqlContext.udf.register("randomUDF", () => {
//prepare dateformat
val formatter = new SimpleDateFormat("yyyy-MM-dd")
//get today's date as reference
val today = Calendar.getInstance()
val now = today.getTime()
//set "from" 2 years from now
val from = Calendar.getInstance()
from.setTime(now)
from.add(Calendar.MONTH, -24)
// set dates into Long
val valuefrom = from.getTimeInMillis()
val valueto = today.getTimeInMillis()
//generate random Long between from and to
val value3 = (valuefrom + Math.random()*(valueto - valuefrom))
// set generated value to Calendar and format date
val calendar3 = Calendar.getInstance()
calendar3.setTimeInMillis(value3.toLong)
formatter.format(calendar3.getTime()
}
UDF works as expected, but I think there is something going wrong here.
I tried the add_months function on another column (not generated) and it worked fine.
example of results I get with this code :
A | B
2017-10-20 | 2016-02-27
2016-05-06 | 2015-05-25
2016-01-09 | 2016-03-14
2016-01-04 | 2017-04-26
using spark version 1.5.1
using scala 2.10.4
The creation of test2 dataframe in your code
val test2 = test.select(col("*"), add_months(col("A"), -6).as("B"))
is treated by spark as
val test2 = df.withColumn("A", to_date(callUDF("randomUDF"))).select(col("*"), add_months(to_date(callUDF("randomUDF")), -6).as("B"))
So you can see that udf function is called twice. df.withColumn("A", to_date(callUDF("randomUDF"))) is generating the date that comes in column A. And add_months(to_date(callUDF("randomUDF")), -6).as("B") is calling udf function again and generating a new date and subtracting 6 months from it and showing that date in column B.
Thats the reason you are getting random dates.
The solution to this would be to use persist or cache in test dataframe as
val test = df.withColumn("A", callUDF("randomUDF")).cache()
val test2 = test.as("table").withColumn("B", add_months($"table.A", -6))

How to get full timestamp value from dataframes? values being truncated

I have a function "toDate(v:String):Timestamp" that takes a string an converts it into a timestamp with the format "MM-DD-YYYY HH24:MI:SS.NS".
I make a udf of the function:
val u_to_date = sqlContext.udf.register("u_to_date", toDate_)
The issue happens when you apply the UDF to dataframes. The resulting dataframe will lose the last 3 nanoseconds.
For example when using the argument "0001-01-01 00:00:00.123456789"
The resulting dataframe will be in the format
[0001-01-01 00:00:00.123456]
I have even tried a dummy function that returns Timestamp.valueOf("1234-01-01 00:00:00.123456789"). When applying the udf of the dummy function, it will truncate the last 3 nanoseconds.
I have looked into the sqlContext conf and
spark.sql.parquet.int96AsTimestamp is set to True. (I tried when it's set to false)
I am at lost here. What is causing the truncation of the last 3 digits?
example
The function could be:
def date123(v: String): Timestamp = {
Timestamp.valueOf("0001-01-01 00:00:00.123456789")
}
It's just a dummy function that should return a timestamp with full nanosecond precision.
Then I would make a udf:
`val u_date123 = sqlContext.udf.register("u_date123", date123 _)`
example df:
val theRow =Row("blah")
val theRdd = sc.makeRDD(Array(theRow))
case class X(x: String )
val df = theRdd.map{case Row(s0) => X(s0.asInstanceOf[String])}.toDF()
If I apply the udf to the dataframe df with a string column, it will return a dataframe that looks like '[0001-01-01 00:00:00.123456]'
df.select(u_date123($"x")).collect.foreach(println)
I think I found the issue.
On spark 1.5.1, they changed the size of the timestamp datatype from 12 bytes to 8 bytes
https://fossies.org/diffs/spark/1.4.1_vs_1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala-diff.html
I tested on spark 1.4.1, and it produces the full nanosecond precision.