Convert String to DataFrame using Spark/scala - scala

I want to convert String column to timestamp column , but it returns always null values .
val t = unix_timestamp(col("tracking_time"),"MM/dd/yyyy").cast("timestamp")
val df= df2.withColumn("ts", t)
Any idea ?
Thank you .

Make sure your String column is matching with the format specified MM/dd/yyyy.
If not matching then null will be returned.
Example:
val df2=Seq(("12/12/2020")).toDF("tracking_time")
val t = unix_timestamp(col("tracking_time"),"MM/dd/yyyy").cast("timestamp")
df2.withColumn("ts", t).show()
//+-------------+-------------------+
//|tracking_time| ts|
//+-------------+-------------------+
//| 12/12/2020|2020-12-12 00:00:00|
//+-------------+-------------------+
df2.withColumn("ts",unix_timestamp(col("tracking_time"),"MM/dd/yyyy").cast("timestamp")).show()
//+-------------+-------------------+
//|tracking_time| ts|
//+-------------+-------------------+
//| 12/12/2020|2020-12-12 00:00:00|
//+-------------+-------------------+
//(or) by using to_timestamp function.
df2.withColumn("ts",to_timestamp(col("tracking_time"),"MM/dd/yyyy")).show()
//+-------------+-------------------+
//|tracking_time| ts|
//+-------------+-------------------+
//| 12/12/2020|2020-12-12 00:00:00|
//+-------------+-------------------+

As #Shu mentioned, the cause might have been the invalid format of tracking_time column. It is worth mentioning though, that Spark is looking for the pattern as a prefix of the column's value. Study these examples for better intuition
Seq(
"03/29/2020 00:00",
"03/29/2020",
"00:00 03/29/2020",
"03/29/2020somethingsomething"
).toDF("tracking_time")
.withColumn("ts", unix_timestamp(col("tracking_time"), "MM/dd/yyyy").cast("timestamp"))
.show()
//+--------------------+-------------------+
//| tracking_time| ts|
//+--------------------+-------------------+
//| 03/29/2020 00:00|2020-03-29 00:00:00|
//| 03/29/2020|2020-03-29 00:00:00|
//| 00:00 03/29/2020| null|
//|03/29/2020somethi...|2020-03-29 00:00:00|

Related

Spark read partition columns showing up null

I have an issue when trying to read partitioned data with Spark.
If the data in the partitioned column is in a specific format, it will show up as null in the resulting dataframe.
For example :
case class Alpha(a: String, b:Int)
val ds1 = Seq(Alpha("2020-02-11_12h32m12s", 1), Alpha("2020-05-21_10h32m52s", 2), Alpha("2020-06-21_09h32m38s", 3)).toDS
ds1.show
+--------------------+---+
| a| b|
+--------------------+---+
|2020-02-11_12h32m12s| 1|
|2020-05-21_10h32m52s| 2|
|2020-06-21_09h32m38s| 3|
+--------------------+---+
ds1.write.partitionBy("a").parquet("test")
val ds2 = spark.read.parquet("test")
ds2.show
+---+----+
| b| a|
+---+----+
| 2|null|
| 3|null|
| 1|null|
+---+----+
Do you have any idea how I could instead make that data show up as a String (or Timestamp).
Thanks for the help.
Just needed to set the parameter spark.sql.sources.partitionColumnTypeInference.enabled to false.
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

spark change DF schema column rename from dot to underscore

I have a dataframe with columns names that has dot.
Example : df.printSchema
user.id_number
user.name.last
user.phone.mobile
etc and I want to rename the schema by replacing the dot with _.
user_id_number
user_name_last
user_phone_mobile
Note: the input data for this DF is JSON format (with nonrelational like NoSQL)
Use either .map,.withColumnRenamed to replace . with _
Example:
val df=Seq(("1","2","3")).toDF("user.id_number","user.name.last","user.phone.mobile")
df.toDF(df.columns.map(x =>x.replace(".","_")):_*).show()
//using replaceAll
df.toDF(df.columns.map(x =>x.replaceAll("\\.","_")):_*).show()
//+--------------+--------------+-----------------+
//|user_id_number|user_name_last|user_phone_mobile|
//+--------------+--------------+-----------------+
//| 1| 2| 3|
//+--------------+--------------+-----------------+
2. Using selectExpr:
val expr=df.columns.map(x =>col(s"`${x}`").alias(s"${x}".replace(".","_")).toString)
df.selectExpr(expr:_*).show()
//+--------------+--------------+-----------------+
//|user_id_number|user_name_last|user_phone_mobile|
//+--------------+--------------+-----------------+
//| 1| 2| 3|
//+--------------+--------------+-----------------+
3.Using .withColumnRenamed:
df.columns.foldLeft(df){(tmpdf,col) =>tmpdf.withColumnRenamed(col,col.replace(".","_"))}.show()
//+--------------+--------------+-----------------+
//|user_id_number|user_name_last|user_phone_mobile|
//+--------------+--------------+-----------------+
//| 1| 2| 3|
//+--------------+--------------+-----------------+

Infinity value in a dataframe spark / scala

I have a dataframe with infinity values . How can I replace it by 0.0.
I tried that but thoesn't work .
val Nan= dataframe_final.withColumn("Vitesse",when(col("Vitesse").isin(Double.NaN,Double.PositiveInfinity,Double.NegativeInfinity),0.0))
Example of dataframe
--------------------
| Vitesse |
--------------------
| 8.171069002316942|
| Infinity |
| 4.290418664272539|
|16.19811830014666 |
| |
How can I replace "Infinity by 0.0" ?
Thank you.
scala> df.withColumn("Vitesse", when(col("Vitesse").equalTo(Double.PositiveInfinity),0.0).otherwise(col("Vitesse")))
res1: org.apache.spark.sql.DataFrame = [Vitesse: double]
scala> res1.show
+-----------------+
| Vitesse|
+-----------------+
|8.171069002316942|
| 0.0|
|4.290418664272539|
+-----------------+
You can try like above.
Your approach is correct using when() .otherwise()
add the missing otherWise statement to get Vitesse values as is if value is not in Infinity,-Infinity,NaN.
Example:
val df=Seq(("8.171".toDouble),("4.2904".toDouble),("16.19".toDouble),(Double.PositiveInfinity),(Double.NegativeInfinity),(Double.NaN)).toDF("Vitesse")
df.show()
//+---------+
//| Vitesse|
//+---------+
//| 8.171|
//| 4.2904|
//| 16.19|
//| Infinity|
//|-Infinity|
//| NaN|
//+---------+
df.withColumn("Vitesse", when(col("Vitesse").isin(Double.PositiveInfinity,Double.NegativeInfinity,Double.NaN),0.0).
otherwise(col("Vitesse"))).
show()
//+-------+
//|Vitesse|
//+-------+
//| 8.171|
//| 4.2904|
//| 16.19|
//| 0.0|
//| 0.0|
//| 0.0|
//+-------+

Splitting the file name

good morning guys,
I have the following dataFrame
+--------------+--------------------------------------------------------------------------+
|co_tipo_arquiv|filename |count_tipo_arquiv|
+--------------+--------------------------------------------------------+-----------------+
|05 |hdfs://spbrhdpdev1.br.experian.local:8020/files/files_01|2 |
|01 |hdfs://spbrhdpdev1.br.experian.local:8020/files/files_02|2 |
+--------------+--------------------------------------------------------+-----------------+
I would like to get only the file name in the filename column
getting that way
+--------------+--------------------------------------------------------------------------+
|co_tipo_arquiv|filename |count_tipo_arquiv|
+--------------+--------------------------------------------------------+-----------------+
|05 |files_01 |2 |
|01 |files_02 |2 |
+--------------+--------------------------------------------------------+-----------------+
I thought about doing a split, but I don't know how to get the last value
split(col("filename"), "/")
but .last dont work
+--------------+-------------------------------------------------------------+
|co_tipo_arquiv|filename |
+--------------+-------------------------------------------------------------+
|05 |[hdfs:, , spbrhdpdev1.br.experian.local:8020,files, files_01]|
|01 |[hdfs:, , spbrhdpdev1.br.experian.local:8020,files, files_02]|
+--------------+-------------------------------------------------------------+
From Spark-2.4+:
We can use element_at function to get last index of array.
1.Using element_at function:
df.withColumn("filename",element_at(split(col("filename"),"/"),-1)).show()
//+--------------+--------+-----------------+
//|co_tipo_arquiv|filename|count_tipo_arquiv|
//+--------------+--------+-----------------+
//| 05|files_01| 2|
//| 01|files_02| 2|
//+--------------+--------+-----------------+
For Spark < 2.4:
2.Using substring_index function:
df.withColumn("filename",substring_index(col("filename"),"/",-1)).show()
//+--------------+--------+-----------------+
//|co_tipo_arquiv|filename|count_tipo_arquiv|
//+--------------+--------+-----------------+
//| 05|files_01| 2|
//| 01|files_02| 2|
//+--------------+--------+-----------------+
3.Using regexp_extract function:
df.withColumn("filename",regexp_extract(col("filename"),"([^\\/]+$)",1)).show()
//+--------------+--------+-----------------+
//|co_tipo_arquiv|filename|count_tipo_arquiv|
//+--------------+--------+-----------------+
//| 05|files_01| 2|
//| 01|files_02| 2|
//+--------------+--------+-----------------+

Spark Dataframe - Method to take row as input & dataframe has output

I need to write a method that iterates all the rows from DF2 and generate a Dataframe based on some conditions.
Here is the inputs DF1 & DF2 :
val df1Columns = Seq("Eftv_Date","S_Amt","A_Amt","Layer","SubLayer")
val df2Columns = Seq("Eftv_Date","S_Amt","A_Amt")
var df1 = List(
List("2016-10-31","1000000","1000","0","1"),
List("2016-12-01","100000","950","1","1"),
List("2017-01-01","50000","50","2","1"),
List("2017-03-01","50000","100","3","1"),
List("2017-03-30","80000","300","4","1")
)
.map(row =>(row(0), row(1),row(2),row(3),row(4))).toDF(df1Columns:_*)
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
|2017-03-01| 50000| 100| 3| 1|
|2017-03-30| 80000| 300| 4| 1|
+----------+-------+-----+-----+--------+
val df2 = List(
List("2017-02-01","0","400")
).map(row =>(row(0), row(1),row(2))).toDF(df2Columns:_*)
+----------+-----+-----+
| Eftv_Date|S_Amt|A_Amt|
+----------+-----+-----+
|2017-02-01| 0| 400|
+----------+-----+-----+
Now I need to write a method that filters DF1 based on the Eftv_Date values from each row of DF2.
For example, first row of df2.Eftv_date=Feb 01 2017, so need to filter df1 having records Eftv_date less than or equal to Feb 01 2017.So this will generate 3 records as below:
Expected Result :
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+
I have written the method as below and called it using map function.
def transformRows(row: Row ) = {
val dateEffective = row.getAs[String]("Eftv_Date")
val df1LayerMet = df1.where(col("Eftv_Date").leq(dateEffective))
df1 = df1LayerMet
df1
}
val x = df2.map(transformRows)
But while calling this I am facing this error:
Error:(154, 24) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val x = df2.map(transformRows)
Note : We can implement this using join , But I need to implement a custom scala method to do this , since there were a lot of transformations involved. For simplicity I have mentioned only one condition.
Seems you need a non-equi join:
df1.alias("a").join(
df2.select("Eftv_Date").alias("b"),
df1("Eftv_Date") <= df2("Eftv_Date") // non-equi join condition
).select("a.*").show
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+