Spark, String to Timestamp - scala

I've got the next String column:
+-------------------+
| Data_Time|
+-------------------+
|22.07.2017 10:06:51|
|22.07.2017 10:06:51|
|22.07.2017 10:06:51|
|22.07.2017 10:03:45|
|22.07.2017 10:03:45|
+-------------------+
I want to convert it to Timestamp.
I'm trying to do with:
val dfSorted = df.withColumn("Data_Time_Pattern", to_timestamp(col("Data_Time"), "dd.MM.yyyy HH:mm:ss"))
But I'm getting null values:
+-------------------+-----------------+
| Data_Time|Data_Time_Pattern|
+-------------------+-----------------+
|22.07.2017 10:06:51| null|
|22.07.2017 10:06:51| null|
|22.07.2017 10:06:51| null|
|22.07.2017 10:03:45| null|
|22.07.2017 10:03:45| null|
+-------------------+-----------------+
What am I doing wrong?

I have solved the problem.I had "\t" as the separator, and changing the value to " " made it work.

Related

How do I count the occurrences of a string in a PySpark dataframe column?

Suppose I have the following PySpark Dataframe:
+---+------+-------+-----------------+
|age|height| name| friends |
+---+------+-------+-----------------+
| 10| 80| Alice| 'Grace, Sarah'|
| 15| null| Bob| 'Sarah'|
| 12| null| Tom|'Amy, Sarah, Bob'|
| 13| null| Rachel| 'Tom, Bob'|
+---+------+-------+-----------------+
How do I count the number of people who have 'Sarah' as a friend without creating another column?
I have tried df.friends.apply(lambda x: x[x.str.contains('Sarah')].count()) but got TypeError: 'Column' object is not callable
you can try the following code:
df = df.withColumn('sarah', lit('Sarah'))
df.filter(df['friends'].contains(df['sarah'])).count()
Thanks pault
df.where(df.friends.like('%Sarah%')).count()

Problems to convert date to timestamp, Spark date to timestamp from unix_timestamp return null

Problems to convert date to timestamp, Spark date to timestamp from unix_timestamp return null.
scala> import org.apache.spark.sql.functions.unix_timestamp
scala> spark.sql("select from_unixtime(unix_timestamp(('2017-08-13 00:06:05'),'yyyy-MM-dd HH:mm:ss')) AS date").show(false)
+----+
|date|
+----+
|null|
+----+
The problem was the change of time in Chile, thank you very much.
+-------------------+---------+
| DateIntermedia|TimeStamp|
+-------------------+---------+
|13-08-2017 00:01:07| null|
|13-08-2017 00:10:33| null|
|14-08-2016 00:28:42| null|
|13-08-2017 00:04:43| null|
|13-08-2017 00:33:51| null|
|14-08-2016 00:28:08| null|
|14-08-2016 00:15:34| null|
|14-08-2016 00:21:04| null|
|13-08-2017 00:34:13| null|
+-------------------+---------+
The solution, set timeZone:
spark.conf.set("spark.sql.session.timeZone", "UTC-6")

Remove Nulls in specific Rows in Dataframe and combine rows

Need to do the below activity in Spark Dataframes using Scala.
Have tried doing some basic filters isNotNull conditions and others. But no luck.
Input
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
| null| null|[AE,AA,CV]|
| null|[AH,EE,CC]| null|
|[DD,DE,QQ]| null| null|
+----------+----------+----------+
Output
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
|[DD,DE,QQ]|[AH,EE,CC]|[AE,AA,CV]|
+----------+----------+----------+
If the input dataframe is limited to only
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
| null| null|[AE,AA,CV]|
| null|[AH,EE,CC]| null|
|[DD,DE,QQ]| null| null|
+----------+----------+----------+
Then doing the following should get you the desired final dataframe
import org.apache.spark.sql.functions._
df.select(collect_list("Amber")(0).as("Amber"), collect_list("Green")(0).as("Green"), collect_list("Red")(0).as("Red")).show(false)
You should be getting
+------------+------------+------------+
|Amber |Green |Red |
+------------+------------+------------+
|[DD, DE, QQ]|[AH, EE, CC]|[AE, AA, CV]|
+------------+------------+------------+
collect_list inbuilt function ignores the null values.

Spark DataFrame Add Column with Value

I have a DataFrame with below data
scala> nonFinalExpDF.show
+---+----------+
| ID| DATE|
+---+----------+
| 1| null|
| 2|2016-10-25|
| 2|2016-10-26|
| 2|2016-09-28|
| 3|2016-11-10|
| 3|2016-10-12|
+---+----------+
From this DataFrame I want to get below DataFrame
+---+----------+----------+
| ID| DATE| INDICATOR|
+---+----------+----------+
| 1| null| 1|
| 2|2016-10-25| 0|
| 2|2016-10-26| 1|
| 2|2016-09-28| 0|
| 3|2016-11-10| 1|
| 3|2016-10-12| 0|
+---+----------+----------+
Logic -
For latest DATE(MAX Date) of an ID, Indicator value would be 1 and others
are 0.
For null value of the account Indicator would be 1
Please suggest me a simple logic to do that.
Try
df.createOrReplaceTempView("df")
spark.sql("""
SELECT id, date,
CAST(LEAD(COALESCE(date, TO_DATE('1900-01-01')), 1)
OVER (PARTITION BY id ORDER BY date) IS NULL AS INT)
FROM df""")

Scala: Spark SQL to_date(unix_timestamp) returning NULL

Spark Version: spark-2.0.1-bin-hadoop2.7
Scala: 2.11.8
I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter date_format includes string of column names that need to be converted to yyyy-mm-dd format.
In the following code, I first loaded the csv of Date column as StringType via the schema, and then I check if the date_format is not empty, that is there are columns that need to be converted to Date from String, then cast each column using unix_timestamp and to_date. However, in the csv_df.show(), the returned rows are all null.
def read_csv(csv_source:String, delimiter:String, is_first_line_header:Boolean,
schema:StructType, date_format:List[String]): DataFrame = {
println("|||| Reading CSV Input ||||")
var csv_df = sqlContext.read
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", is_first_line_header)
.option("delimiter", delimiter)
.load(csv_source)
println("|||| Successfully read CSV. Number of rows -> " + csv_df.count() + " ||||")
if(date_format.length > 0) {
for (i <- 0 until date_format.length) {
csv_df = csv_df.select(to_date(unix_timestamp(
csv_df(date_format(i)), "yyyy-­MM-­dd").cast("timestamp")))
csv_df.show()
}
}
csv_df
}
Returned Top 20 rows:
+-------------------------------------------------------------------------+
|to_date(CAST(unix_timestamp(prom_price_date, YYYY-­MM-­DD) AS TIMESTAMP))|
+-------------------------------------------------------------------------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-------------------------------------------------------------------------+
Why am I getting all null?
To convert yyyyMMdd to yyyy-MM-dd you can:
spark.sql("""SELECT DATE_FORMAT(
CAST(UNIX_TIMESTAMP('20161025', 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'
)""")
with functions:
date_format(unix_timestamp(col, "yyyyMMdd").cast("timestamp"), "yyyy-MM-dd")