Convert epochmilli to DDMMYYYY - Spark Scala - scala

I have a dataframe with one of the column containing timestamps respresented in epochmilli (column type is long) and I need to convert them to a column with DDMMYY using withColumn
Something like:
1528102439 ---> 040618
How do I achieve this?

val df_DateConverted = df.withColumn("Date",from_unixtime(df.col("timestamp").divide(1000),"ddMMyy"))

Related

how to find length of string of array of json object in pyspark scala?

I have one column in DataFrame with format =
'[{jsonobject},{jsonobject}]'. here length will be 2 .
I have to find length of this array and store it in another column.
I've only worked with pySpark, but the Scala solution would be similar. Assuming the column name is input:
from pyspark.sql import functions as f, types as t
json_schema = t.ArrayType(t.MapType(t.StringType(), t.StringType()))
df.select(f.size(f.from_json(df.input, json_schema)).alias("num_objects"))

Filter Scala dataframe by column of arrays

My scala dataframe has a column that has the datatype array(element: String). I want to display those rows of the dataframe that has the word "hello" in that column.
I have this:
display(df.filter($"my_column".contains("hello")))
I get an error because of data mismatch. It says that argument 1 requires string type, however, 'my:column' is of array<string> type.
You can use array_contains function
import org.apache.spark.sql.functions._
df.filter(array_contains(df.col("my_column"), "hello")).show

IllegalArgumentException: u'Data type StringType of column is not supported [duplicate]

I have a dataframe that contains string columns and I am planning to use it as input for k-means using spark and scala. I am converting my string typed columns of the dataframe using the method below:
val toDouble = udf[Double, String]( _.toDouble)
val analysisData = dataframe_mysql.withColumn("Event", toDouble(dataframe_mysql("event"))).withColumn("Execution", toDouble(dataframe_mysql("execution"))).withColumn("Info", toDouble(dataframe_mysql("info")))
val assembler = new VectorAssembler()
.setInputCols(Array("execution", "event", "info"))
.setOutputCol("features")
val output = assembler.transform(analysisData)
println(output.select("features", "execution").first())
when I print the analysisData schema the convertion is correct. but I am getting an exception: VectorAssembler does not support the StringType type
which means that my values are still strings! how can I convert the values and not only the schema type?
thanks
Indeed, the VectorAssembler Transformer does not take strings. So you need to make sure that your columns match numerical, boolean, vector types. Make sure that your udf is doing the right thing and be sure that none of the columns has StringType.
To convert a column in a Spark DataFrame to another type, make it simple and use the cast() DSL function like so:
val analysisData = dataframe_mysql.withColumn("Event", dataframe_mysql("Event").cast(DoubleType))
It should work!

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

How to get full timestamp value from dataframes? values being truncated

I have a function "toDate(v:String):Timestamp" that takes a string an converts it into a timestamp with the format "MM-DD-YYYY HH24:MI:SS.NS".
I make a udf of the function:
val u_to_date = sqlContext.udf.register("u_to_date", toDate_)
The issue happens when you apply the UDF to dataframes. The resulting dataframe will lose the last 3 nanoseconds.
For example when using the argument "0001-01-01 00:00:00.123456789"
The resulting dataframe will be in the format
[0001-01-01 00:00:00.123456]
I have even tried a dummy function that returns Timestamp.valueOf("1234-01-01 00:00:00.123456789"). When applying the udf of the dummy function, it will truncate the last 3 nanoseconds.
I have looked into the sqlContext conf and
spark.sql.parquet.int96AsTimestamp is set to True. (I tried when it's set to false)
I am at lost here. What is causing the truncation of the last 3 digits?
example
The function could be:
def date123(v: String): Timestamp = {
Timestamp.valueOf("0001-01-01 00:00:00.123456789")
}
It's just a dummy function that should return a timestamp with full nanosecond precision.
Then I would make a udf:
`val u_date123 = sqlContext.udf.register("u_date123", date123 _)`
example df:
val theRow =Row("blah")
val theRdd = sc.makeRDD(Array(theRow))
case class X(x: String )
val df = theRdd.map{case Row(s0) => X(s0.asInstanceOf[String])}.toDF()
If I apply the udf to the dataframe df with a string column, it will return a dataframe that looks like '[0001-01-01 00:00:00.123456]'
df.select(u_date123($"x")).collect.foreach(println)
I think I found the issue.
On spark 1.5.1, they changed the size of the timestamp datatype from 12 bytes to 8 bytes
https://fossies.org/diffs/spark/1.4.1_vs_1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala-diff.html
I tested on spark 1.4.1, and it produces the full nanosecond precision.