So I have a DataFrame object called 'df' and im trying to covert the 'timestamp' into a actual readable date.
timestamp
0 1465893683657
1 1457783741932
2 1459730006393
3 1459744745346
4 1459744756375
Ive tried
df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')
but this gives
timestamp
0 1970-01-01 00:24:25.893683657
1 1970-01-01 00:24:17.783741932
2 1970-01-01 00:24:19.730006393
3 1970-01-01 00:24:19.744745346
4 1970-01-01 00:24:19.744756375
which is clearly wrong since I know the date should be either this year or last year.
What am i doing wrong?
Solution with unit ms:
print (pd.to_datetime(df.timestamp, unit='ms'))
0 2016-06-14 08:41:23.657
1 2016-03-12 11:55:41.932
2 2016-04-04 00:33:26.393
3 2016-04-04 04:39:05.346
4 2016-04-04 04:39:16.375
Name: timestamp, dtype: datetime64[ns]
You can reduce the significant digits or better use #jezrael's unit ('ms').
In [133]: pd.to_datetime(df.timestamp // 10**3, unit='s')
Out[133]:
0 2016-06-14 08:41:23
1 2016-03-12 11:55:41
2 2016-04-04 00:33:26
3 2016-04-04 04:39:05
4 2016-04-04 04:39:16
Name: timestamp, dtype: datetime64[ns]
Related
I have the following abstracted DataFrame (my original DF has 60 billion lines +)
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-05 8 4
2 2021-02-03 2 0
1 2021-02-07 12 5
2 2021-02-05 1 3
My expected ouput is:
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-02 10 2
1 2021-02-03 10 2
1 2021-02-04 10 2
1 2021-02-05 8 4
1 2021-02-06 8 4
1 2021-02-07 12 5
2 2021-02-03 2 0
2 2021-02-04 2 0
2 2021-02-05 1 3
Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).
I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?
I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.
I created the following dataframe:
import pandas as pd
import databricks.koalas as ks
df = ks.DataFrame(
{'Date1': pd.date_range('20211101', '20211110', freq='1D'),
'Date2': pd.date_range('20201101', '20201110', freq='1D')})
df
Out[0]:
Date1
Date2
0
2021-11-01
2020-11-01
1
2021-11-02
2020-11-02
2
2021-11-03
2020-11-03
3
2021-11-04
2020-11-04
4
2021-11-05
2020-11-05
5
2021-11-06
2020-11-06
6
2021-11-07
2020-11-07
7
2021-11-08
2020-11-08
8
2021-11-09
2020-11-09
9
2021-11-10
2020-11-10
When trying to get the minimum of Date1 I get the correct result:
df.Date1.min()
Out[1]:
Timestamp('2021-11-01 00:00:00')
Also, when trying to get the minimum values of each row the correct result is returned:
df.min(axis=1)
Out[2]:
0 2020-11-01
1 2020-11-02
2 2020-11-03
3 2020-11-04
4 2020-11-05
5 2020-11-06
6 2020-11-07
7 2020-11-08
8 2020-11-09
9 2020-11-10
dtype: datetime64[ns]
However, using the same functions on columns fails:
df.min(axis=0)
Out[3]:
Series([], dtype: float64)
Does anyone know why this is and if there's an elegant way around it?
Try this:
df.apply(min, axis=0)
Out[1]:
Date1 2021-11-01
Date2 2020-11-01
dtype: datetime64[ns]
This was indeed a bug in the code, but since then Koalas was merged with pyspark and the pandas on spark API was born. More information here.
Using spark 3.2.0 and above, one needs to replace
import databricks.koalas as ks
With
import pyspark.pandas as ps
and replace ks.DataFrame with ps.DataFrame. This completely eliminates the issue.
I am trying to use a CASE WHEN statement like below to add 1 day to a timestamp based on the time part of the timestamp:
CASE WHEN to_char(pickup_date, 'HH24:MI') between 0 and 7 then y.pickup_date else dateadd(day,1,y.pickup_date) end as ead_target
pickup_Date is a timestamp with default format YYYY-MM-DD HH:MM:SS
My output
pickup_Date ead_target
2020-07-01 10:00:00 2020-07-01 10:00:00
2020-07-02 3:00:00 2020-07-02 3:00:00
When the hour of the day is between 0 and 7 then ead_target = pickup_Date ELSE add 1 day
Expected output
pickup_Date ead_target
2020-07-01 10:00:00 2020-07-02 10:00:00
2020-07-02 3:00:00 2020-07-02 3:00:00
You will want to use the date_part() function to extract the hour of the day - https://docs.aws.amazon.com/redshift/latest/dg/r_DATE_PART_function.html
Your case statement should work if you extract 'hour' from the timestamp and compare it to the range 0 - 7.
I have a dataframe that looks like this:
user_id val date
1 10 2015-02-01
1 11 2015-01-01
2 12 2015-03-01
2 13 2015-02-01
3 14 2015-03-01
3 15 2015-04-01
I need to run a function that calculates (let's say) the sum of vals chronologically by the dates. If a user has a more recent date, use that date, but if not, keep the older date.
For example. If I run the function with the date 2015-03-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 14 2015-03-01
Giving me a sum of 36.
If I run the function with the date 2015-04-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 15 2015-04-01
(User 3's row was replaced with a more recent date).
I know this is fairly esoteric, but thought I could bounce this off all of you as I have been trying to think of a simple way of doing this..
try this:
In [36]: df.loc[df.date <= '2015-03-15']
Out[36]:
user_id val date
0 1 10 2015-02-01
1 1 11 2015-01-01
2 2 12 2015-03-01
3 2 13 2015-02-01
4 3 14 2015-03-01
In [39]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').agg({'date':'last', 'val':'last'}).reset_index()
Out[39]:
user_id date val
0 1 2015-02-01 10
1 2 2015-03-01 12
2 3 2015-03-01 14
or:
In [40]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[40]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 14 2015-03-01
In [41]: df.loc[df.date <= '2015-04-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[41]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 15 2015-04-01
This question is far from unique, but i cannot find a way to convert the strings that are contained in this df column to datetime and date alone objects in order to use them as the index of my dataframe.
How can i convert this string to datetime or date format to use it as an index on my df?
The format of this column in particular is as follows:
>>> data['DateTime']
0 20140101 00:00:00
1 20140101 00:00:00
3 20140101 00:00:00
4 20140101 00:00:00
5 20140101 00:00:00
6 20140101 00:00:00
7 20140101 00:00:00
8 20140101 00:00:00
9 20140101 00:00:00
10 20140101 00:00:00
Name: DateTime, Length: 3779, dtype: object
Use to_datetime to convert to a string to a datetime, you can pass a formatting string but in this case it seems to handle it fine, then if you wanted a date then call apply and use a lambda to call .date() on each datetime entry:
In [59]:
df = pd.DataFrame({'DateTime':['20140101 00:00:00']*10})
df
Out[59]:
DateTime
0 20140101 00:00:00
1 20140101 00:00:00
2 20140101 00:00:00
3 20140101 00:00:00
4 20140101 00:00:00
5 20140101 00:00:00
6 20140101 00:00:00
7 20140101 00:00:00
8 20140101 00:00:00
9 20140101 00:00:00
In [60]:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df.dtypes
Out[60]:
DateTime datetime64[ns]
dtype: object
In [61]:
df['DateTime'] = df['DateTime'].apply(lambda x:x.date())
print(df)
df.dtypes
DateTime
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2014-01-01
5 2014-01-01
6 2014-01-01
7 2014-01-01
8 2014-01-01
9 2014-01-01
Out[61]:
DateTime object
dtype: object