I've got the next String column:
+-------------------+
| Data_Time|
+-------------------+
|22.07.2017 10:06:51|
|22.07.2017 10:06:51|
|22.07.2017 10:06:51|
|22.07.2017 10:03:45|
|22.07.2017 10:03:45|
+-------------------+
I want to convert it to Timestamp.
I'm trying to do with:
val dfSorted = df.withColumn("Data_Time_Pattern", to_timestamp(col("Data_Time"), "dd.MM.yyyy HH:mm:ss"))
But I'm getting null values:
+-------------------+-----------------+
| Data_Time|Data_Time_Pattern|
+-------------------+-----------------+
|22.07.2017 10:06:51| null|
|22.07.2017 10:06:51| null|
|22.07.2017 10:06:51| null|
|22.07.2017 10:03:45| null|
|22.07.2017 10:03:45| null|
+-------------------+-----------------+
What am I doing wrong?
I have solved the problem.I had "\t" as the separator, and changing the value to " " made it work.
Related
Suppose I have the following PySpark Dataframe:
+---+------+-------+-----------------+
|age|height| name| friends |
+---+------+-------+-----------------+
| 10| 80| Alice| 'Grace, Sarah'|
| 15| null| Bob| 'Sarah'|
| 12| null| Tom|'Amy, Sarah, Bob'|
| 13| null| Rachel| 'Tom, Bob'|
+---+------+-------+-----------------+
How do I count the number of people who have 'Sarah' as a friend without creating another column?
I have tried df.friends.apply(lambda x: x[x.str.contains('Sarah')].count()) but got TypeError: 'Column' object is not callable
you can try the following code:
df = df.withColumn('sarah', lit('Sarah'))
df.filter(df['friends'].contains(df['sarah'])).count()
Thanks pault
df.where(df.friends.like('%Sarah%')).count()
Problems to convert date to timestamp, Spark date to timestamp from unix_timestamp return null.
scala> import org.apache.spark.sql.functions.unix_timestamp
scala> spark.sql("select from_unixtime(unix_timestamp(('2017-08-13 00:06:05'),'yyyy-MM-dd HH:mm:ss')) AS date").show(false)
+----+
|date|
+----+
|null|
+----+
The problem was the change of time in Chile, thank you very much.
+-------------------+---------+
| DateIntermedia|TimeStamp|
+-------------------+---------+
|13-08-2017 00:01:07| null|
|13-08-2017 00:10:33| null|
|14-08-2016 00:28:42| null|
|13-08-2017 00:04:43| null|
|13-08-2017 00:33:51| null|
|14-08-2016 00:28:08| null|
|14-08-2016 00:15:34| null|
|14-08-2016 00:21:04| null|
|13-08-2017 00:34:13| null|
+-------------------+---------+
The solution, set timeZone:
spark.conf.set("spark.sql.session.timeZone", "UTC-6")
Need to do the below activity in Spark Dataframes using Scala.
Have tried doing some basic filters isNotNull conditions and others. But no luck.
Input
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
| null| null|[AE,AA,CV]|
| null|[AH,EE,CC]| null|
|[DD,DE,QQ]| null| null|
+----------+----------+----------+
Output
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
|[DD,DE,QQ]|[AH,EE,CC]|[AE,AA,CV]|
+----------+----------+----------+
If the input dataframe is limited to only
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
| null| null|[AE,AA,CV]|
| null|[AH,EE,CC]| null|
|[DD,DE,QQ]| null| null|
+----------+----------+----------+
Then doing the following should get you the desired final dataframe
import org.apache.spark.sql.functions._
df.select(collect_list("Amber")(0).as("Amber"), collect_list("Green")(0).as("Green"), collect_list("Red")(0).as("Red")).show(false)
You should be getting
+------------+------------+------------+
|Amber |Green |Red |
+------------+------------+------------+
|[DD, DE, QQ]|[AH, EE, CC]|[AE, AA, CV]|
+------------+------------+------------+
collect_list inbuilt function ignores the null values.
I have a DataFrame with below data
scala> nonFinalExpDF.show
+---+----------+
| ID| DATE|
+---+----------+
| 1| null|
| 2|2016-10-25|
| 2|2016-10-26|
| 2|2016-09-28|
| 3|2016-11-10|
| 3|2016-10-12|
+---+----------+
From this DataFrame I want to get below DataFrame
+---+----------+----------+
| ID| DATE| INDICATOR|
+---+----------+----------+
| 1| null| 1|
| 2|2016-10-25| 0|
| 2|2016-10-26| 1|
| 2|2016-09-28| 0|
| 3|2016-11-10| 1|
| 3|2016-10-12| 0|
+---+----------+----------+
Logic -
For latest DATE(MAX Date) of an ID, Indicator value would be 1 and others
are 0.
For null value of the account Indicator would be 1
Please suggest me a simple logic to do that.
Try
df.createOrReplaceTempView("df")
spark.sql("""
SELECT id, date,
CAST(LEAD(COALESCE(date, TO_DATE('1900-01-01')), 1)
OVER (PARTITION BY id ORDER BY date) IS NULL AS INT)
FROM df""")
Spark Version: spark-2.0.1-bin-hadoop2.7
Scala: 2.11.8
I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter date_format includes string of column names that need to be converted to yyyy-mm-dd format.
In the following code, I first loaded the csv of Date column as StringType via the schema, and then I check if the date_format is not empty, that is there are columns that need to be converted to Date from String, then cast each column using unix_timestamp and to_date. However, in the csv_df.show(), the returned rows are all null.
def read_csv(csv_source:String, delimiter:String, is_first_line_header:Boolean,
schema:StructType, date_format:List[String]): DataFrame = {
println("|||| Reading CSV Input ||||")
var csv_df = sqlContext.read
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", is_first_line_header)
.option("delimiter", delimiter)
.load(csv_source)
println("|||| Successfully read CSV. Number of rows -> " + csv_df.count() + " ||||")
if(date_format.length > 0) {
for (i <- 0 until date_format.length) {
csv_df = csv_df.select(to_date(unix_timestamp(
csv_df(date_format(i)), "yyyy-MM-dd").cast("timestamp")))
csv_df.show()
}
}
csv_df
}
Returned Top 20 rows:
+-------------------------------------------------------------------------+
|to_date(CAST(unix_timestamp(prom_price_date, YYYY-MM-DD) AS TIMESTAMP))|
+-------------------------------------------------------------------------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-------------------------------------------------------------------------+
Why am I getting all null?
To convert yyyyMMdd to yyyy-MM-dd you can:
spark.sql("""SELECT DATE_FORMAT(
CAST(UNIX_TIMESTAMP('20161025', 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'
)""")
with functions:
date_format(unix_timestamp(col, "yyyyMMdd").cast("timestamp"), "yyyy-MM-dd")