I have a date field in a csv.
Dates can be of different formats in the same field on the same file.
Like one record may have '14-Dec-2022', next record may have '21-04-2022'.
How to change the date in first record(14-Dec-2022) to 14-12-2022 in Pyspark.
based on your comment, it'll be easy if it is known that there are only 2 formats present in the file when it is read in pyspark.
Here's an example based on that.
Option 1 - If you want to format in a different format
spark.sparkContext.parallelize([('14-Dec-2022',), ('14-12-2022',)]).toDF(['date_field_str']). \
withColumn('date_field',
func.coalesce(func.to_date('date_field_str', 'dd-MMM-yyyy'),
func.to_date('date_field_str', 'dd-MM-yyyy')
)
). \
withColumn('date_field_formatted', func.date_format('date_field', 'dd-MM-yyyy')). \
show()
# you can change the output format used within `date_field_formatted`'s `withColumn`.
# +--------------+----------+--------------------+
# |date_field_str|date_field|date_field_formatted|
# +--------------+----------+--------------------+
# | 14-Dec-2022|2022-12-14| 14-12-2022|
# | 14-12-2022|2022-12-14| 14-12-2022|
# +--------------+----------+--------------------+
Option 2
spark.sparkContext.parallelize([('14-Dec-2022',), ('14-12-2022',)]).toDF(['date_field_str']). \
withColumn('date_field_formatted',
func.coalesce(func.date_format(func.to_date('date_field_str', 'dd-MMM-yyyy'), 'dd-MM-yyyy'),
func.col('date_field_str')
)
). \
show()
# +--------------+--------------------+
# |date_field_str|date_field_formatted|
# +--------------+--------------------+
# | 14-Dec-2022| 14-12-2022|
# | 14-12-2022| 14-12-2022|
# +--------------+--------------------+
Related
I am using Pyspark and I have data like this in the dataframe
and I want the output like this
The logic goes like this - from table 1 above,
the first date of category B for id=1 is 08/06/2022 and the first date for category A is 13/06/2022.So, for any dates on or after 13/06/2022 should have both categories A and B.
So, for 08/06/2022, there is category B only and for 13/06/2022 there is category A and B. For 24/06/2022, there is just category A in table1 but the output should have category B too as the first date of category B is 13/06/2022 and for 26/07/2022, there is just category B in table 1 but output should have both category and category B for 26/07/2022.
How do I achieve this in pyspark?
# input dataframe creation
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id', 'cat', 'dt']). \
withColumn('dt', func.col('dt').cast('date'))
# required solution
data_sdf. \
withColumn('min_dt', func.min('dt').over(wd.partitionBy('id'))). \
withColumn('all_cats', func.collect_set('cat').over(wd.partitionBy('id'))). \
withColumn('cat_arr',
func.when(func.col('min_dt') == func.col('dt'), func.array(func.col('cat'))).
otherwise(func.col('all_cats'))
). \
drop('cat', 'min_dt', 'all_cats'). \
dropDuplicates(). \
withColumn('cat', func.explode('cat_arr')). \
drop('cat_arr'). \
orderBy('id', 'dt', 'cat'). \
show()
# +---+----------+---+
# |id |dt |cat|
# +---+----------+---+
# |1 |2022-06-08|B |
# |1 |2022-06-13|A |
# |1 |2022-06-13|B |
# |1 |2022-06-24|A |
# |1 |2022-06-24|B |
# +---+----------+---+
I've used a subset of the posted data. The idea of the approach is that you create an array of distinct categories and apply that to all dates except the minimum date. The minimum date will only have that row's category (not all categories). The array can then be exploded to get the desired result for all dates.
Hi I am wondering how I can count elements of each row in the whole data set?
I have the following column:
col
(‘a’, ‘b’),(‘a’, ‘c’),(‘b’, ‘c’)
(‘g’, ‘h’),(‘a’, ‘c’),(‘a’, ‘b’)
I wanna count how many of the above pair exist in the data set!
Output:
(‘a’, ‘b’) 2
(‘a’, ‘c’). 2
(‘b’, ‘c’). 1
(‘g’, ‘h’). 1
I know in pandas I can do this:
h=data['col'].str.findall(r'(\([^()]+\))').explode().value_counts()
Assuming your input data is a string (based on your regex), I'd suggest changing the middle comma by another separator, which you can use to split later. I'm using pipe here for demo purpose only, you can change to another unique pattern.
from pyspark.sql import functions as F
df = (spark
.sparkContext
.parallelize([
("('a', 'b'),('a', 'c'),('b', 'c')",),
("('g', 'h'),('a', 'c'),('a', 'b')",),
])
.toDF(['col'])
)
# Output
# +--------------------------------+
# |col |
# +--------------------------------+
# |('a', 'b'),('a', 'c'),('b', 'c')|
# |('g', 'h'),('a', 'c'),('a', 'b')|
# +--------------------------------+
(df
.withColumn('col', F.regexp_replace('col', '\),\(', ')|('))
.withColumn('col', F.explode(F.split('col', '\|')))
.groupBy('col')
.count()
.show(10, False)
)
# Output
# +----------+-----+
# |col |count|
# +----------+-----+
# |('g', 'h')|1 |
# |('a', 'c')|2 |
# |('b', 'c')|1 |
# |('a', 'b')|2 |
# +----------+-----+
I'm using the following code in order to convert a date/timestamp into a string with a specific format:
when(to_date($"timestamp", fmt).isNotNull, date_format(to_timestamp($"timestamp", fmt), outputFormat))
The "fmt" is coming from a list of possible formats because we have different formats in the source data.
The issue here is that when we apply the "to_timestamp" function, the milliseconds part is lost. Is there any other possible (and not over complicated) ways to do this without loosing the miliseconds detail?
Thanks,
BR
I remember having to mess with it while back. This will work as well.
df = (
spark
.createDataFrame(['2021-07-19 17:29:36.123',
'2021-07-18 17:29:36.123'], "string").toDF("ts")
.withColumn('ts_with_mili',
(unix_timestamp(col('ts'), "yyyy-MM-dd HH:mm:ss.SSS")
+ substring(col('ts'), -3, 3).cast('float')/1000).cast('timestamp'))
).show(truncate=False)
# +-----------------------+-----------------------+
# |ts |ts_with_mili |
# +-----------------------+-----------------------+
# |2021-07-19 17:29:36.123|2021-07-19 17:29:36.123|
# |2021-07-18 17:29:36.123|2021-07-18 17:29:36.123|
# +-----------------------+-----------------------+
I am trying to generate a date sequence
from pyspark.sql import functions as F
df1 = df.withColumn("start_dt", F.to_date(F.col("start_date"), "yyyy-mm-dd")) \
.withColumn("end_dt", F.to_date(F.col("end_date"), "yyyy-mm-dd"))
df1.select("start_dt", "end_dt").show()
print("type(start_dt)", type("start_dt"))
print("type(end_dt)", type("end_dt"))
df2 = df1.withColumn("lineoffdate", F.expr("""sequence(start_dt,end_dt,1)"""))
Below is the output
+---------------+----------+
| start_date | end_date|
+---------------+----------+
| 2020-02-01|2020-03-21|
+---------------+----------+
type(start_dt) <class 'str'>
type(end_dt) <class 'str'>
cannot resolve 'sequence(start_dt, end_dt, 1)' due to data type mismatch: sequence only supports integral, timestamp or date types; line 1 pos 0;
Even after converting the start dt and end dt to date or timestamp, I see the type of the column still str and getting above mentioned error while generating the date sequence.
You are correct in saying it should work with date or timestamp(calendar types), however, the only mistake you were making was you were putting the "step" in sequence as integer, when it should be calendar interval(like interval 1 day):
df.withColumn("start_date",F.to_date("start_date")) \
.withColumn("end_date", F.to_date("end_date")) \
.withColumn(
"lineofdate",
F.expr("""sequence(start_date,end_date,interval 1 day)""") \
) \
.show()
# output:
# +----------+----------+--------------------+
# |start_date| end_date| lineofdate|
# +----------+----------+--------------------+
# |2020-02-01|2020-03-21|[2020-02-01, 2020...|
# +----------+----------+--------------------+
I have a pyspark dataframe, called df.
ONE LINE EXAMPLE:
df.take(1)
[Row(data=u'2016-12-25',nome=u'Mauro',day_type="SUN")]
I have a list of holidays day:
holydays=[u'2016-12-25',u'2016-12-08'....]
I want to switch day_type to "HOLIDAY" if "data" is in holydays list otherwise I want to leave day_type field as it is.
This is my non working tentative:
df=df.withColumn("day_type",when(col("data") in holydays, "HOLIDAY").otherwise(col("day_type")))
PySpark does not like the expression "in holydays".
It returns this error:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
Regarding your first question - you need isin:
spark.version
# u'2.2.0'
from pyspark.sql import Row
from pyspark.sql.functions import col, when
df=spark.createDataFrame([Row(data=u'2016-12-25',nome=u'Mauro',day_type="SUN")])
holydays=[u'2016-12-25',u'2016-12-08']
df.withColumn("day_type",when(col("data").isin(holydays), "HOLIDAY").otherwise(col("day_type"))).show()
# +----------+--------+-----+
# | data|day_type| nome|
# +----------+--------+-----+
# |2016-12-25| HOLIDAY|Mauro|
# +----------+--------+-----+
Regarding your second question - I don't see any issue:
df.withColumn("day_type",when(col("data")=='2016-12-25', "HOLIDAY").otherwise(col("day_type"))).filter("day_type='HOLIDAY'").show()
# +----------+--------+-----+
# | data|day_type| nome|
# +----------+--------+-----+
# |2016-12-25| HOLIDAY|Mauro|
# +----------+--------+-----+
BTW, it's a always a good idea to provide a little more than a single row of sample data...
Use isin function on column instead of using in clause to check if the value is present in a list. Sample code :
df=df.withColumn("day_type",when(df.data.isin(holydays), "HOLIDAY").otherwise(df.day_type)))