I have this code:
import pantab
import pandas as pd
import datetime
df = pd.DataFrame([
[datetime.date(2018,2,20), 4],
[datetime.date(2018,2,20), 4],
], columns=["date", "num_of_legs"])
pantab.frame_to_hyper(df, "example.hyper", table="animals")
which causes this error:
TypeError: Invalid value "datetime.date(2018, 2, 20)" found (row 0 column 0)
Is there a fix?
This has apparently been a problem since time 2020. You have to know which columns are datetime columns because panda dtypes treat dates and strings as objects. There's a world where data scientists don't care about dates, but not mine, apparently. Anyway, here's the solution awaiting the day when pandas makes the date dtype:
[https://github.com/innobi/pantab/issues/100][1]
And just to reduce it down to do what I did:
def createHyper(xlsx, filename, tablename):
for name in xlsx.columns:
if 'date' in name.lower():
xlsx[name] = pd.to_datetime(xlsx[name],errors='coerce',utc=True)
pantab.frame_to_hyper(xlsx,filename,table=tablename)
return open(filename, 'rb')
errors = 'coerce' makes it so you can have dates like 1/1/3000 which is handy in a world of scd2 tables
utc = True was necessary for me because my dates were timezone sensitive, yours may not be.
I screwed up the hyperlink, It's not working. Damn it. Thank you to the anonymous editor who will inevitable show up and fix it. I'm very sorry.
Related
Here is a trivial benchmark based on a real-life workload.
import gc
import time
import numpy as np
import polars as pl
df = ( # I have a dataframe like this from reading a csv.
pl.Series(
name="x",
values=np.random.choice(
["ASPARAGUS", "BROCCOLI", ""], size=30_000_000
),
)
.to_frame()
.with_column(
pl.when(pl.col("x") == "").then(None).otherwise(pl.col("x"))
)
)
start = time.time()
df.lazy().with_column(
pl.col("x").cast(pl.Categorical).fill_null("MISSING")
).collect()
end = time.time()
print(f"Cast then fill_null took {end-start:.2f} seconds.")
Cast then fill_null took 0.93 seconds.
gc.collect()
start = time.time()
df.lazy().with_column(
pl.col("x").fill_null("MISSING").cast(pl.Categorical)
).collect()
end = time.time()
print(f"Fill_null then cast took {end-start:.2f} seconds.")
Fill_null then cast took 1.36 seconds.
(1) Am I correct to think that casting to categorical then filling null will always be faster?
(2) Am I correct to think that the result will always be identical regardless of the order?
(3) If the answers are "yes" and "yes", is it possible that someday polars will do this rearrangement automatically? Or is it actually impossible try all these sorts of permutations in a general query optimizer?
Thanks.
1: yes
2: somewhat. The logical categorcal representatition will always be the same. The physical changes by the order of occurrence of the string values. Doing fill_null before the cast, means "MISSING" will be found earlier. But this should be seen as an implementation detail.
3: Yes, this is something we can automatically optimize. Just today we merged something similar: https://github.com/pola-rs/polars/pull/4883
I started in the pyspark world some time ago and I'm racking my brain with an algorithm, initially I want to create a function that calculates the difference of months between two dates, I know there is a function for that (months_between), but it works a little bit different from what I want, I want to extract the months from two dates and subtract without taking into account the days, only the month and the year, the point is, I can do this by manipulating base, creating new columns with the months and subtracting , but I want to do this as a UDF function, like below:
from datetime import datetime
import pyspark.sql.functions as f
base_study = spark.createDataFrame([("1", "2009-01-31", "2007-01-31"),("2","2009-01-31","2011-01-31")], ['ID', 'A', 'B'])
base_study = base_study.withColumn("A",f.to_date(base_study["A"], 'yyyy-MM-dd'))
base_study = base_study.withColumn("B",f.to_date(base_study["B"], 'yyyy-MM-dd'))
def intckSasFunc(RecentDate, PreviousDate):
RecentDate = f.month("RecentDate")
PreviousDate = f.month("PreviousDate")
months_diff = (RecentDate.year - PreviousDate.year) * 12 + (RecentDate.month - PreviousDate.month)
return months_diff
intckSasFuncUDF = f.udf(intckSasFunc, IntegerType())
base_study.withColumn('Result', intckSasFuncUDF(f.col('B'), f.col('A') ))
What I'm doing wrong ?
Another question: When I pass parameters in a UDF function, they sent one by one or it pass entire column? And this column is a series?
Thank you!
I found a solution and upgraded it to handle missings too.
from datetime import datetime
import pyspark.sql.functions as f
base_study = spark.createDataFrame([("1", None, "2015-01-01"),("2","2015-01-31","2015-01-31")], ['ID', 'A', 'B'])
base_study = base_study.withColumn("A",f.to_date(base_study["A"], 'yyyy-MM-dd'))
base_study = base_study.withColumn("B",f.to_date(base_study["B"], 'yyyy-MM-dd'))
def intckSasFunc(RecentDate, PreviousDate):
if (PreviousDate and RecentDate) is not None:
months_diff = (RecentDate.year - PreviousDate.year) * 12 + (RecentDate.month - PreviousDate.month)
return months_diff
else:
return None
intckSasFuncUDF = f.udf(lambda x,y:intckSasFunc(x,y) , IntegerType())
display(base_study.withColumn('Result', intckSasFuncUDF(f.col('B'), f.col('A'))))
for those who have doubts, as I had, the function treats one record at a time, as if it were a normal python function, I couldn't use pyspark.sql functions inside this UDF, it gives an error, it seems, these functions are used only in pypsark columns, and inside the UDF the transformation is row by row.
Here's my chunk of code
y = yahoo;
% get some data for Apple
from = '2014-01-01';
to = '2016-04-01';
aapl = fetch(y,'AAPL','Adj Close', from, to);
close(y)
I get two columns - the second are the prices, the first are... Date values? But they cannot be! I expected Unix dates, but my dates start with 1455950 and a conversion gives me:
>> datetime(1455950, 'convertFrom', 'posixtime')
ans =
17-Jan-1970 20:25:50
So that clearly cannot be it. Also, what's crazy is if I get today's date, I get an even smaller value:
>> fetch(y, 'AAPL', 'Date')
ans =
Date: 736449
Can someone help me understand this madnesss?
As it turns out, yahoo was simply giving out false values. All working now, and datetime(x, 'convertfrom', 'datenum') works perfectly as can be seen here.
Right now I have a datetime object, but it doesn't contain all the fields. It's missing minutes, seconds, and microseconds. I want to use this to fetch data from MongoDB through Python. i'm just wondering if pymongo will automatically fill in the missing fields and convert it into ISODATE, or it's going to produce an error?
EDIT:
an example:
time1='2015-12-17 23'
#assuming already imported all necessary libs
time2=datetime.strptime(time1, '%Y-%m-%d %H')
#for mongo
sq = {'$and': [something.some, {'field1':{'ne':1}}]}
sq['$and'].append({'field2': {'$gt': time2}})
In python, there's no such thing as a datetime object with missing fields. All the datetime fields (relevant to this question) always have values.
The way you created time2, the fields you didn't specify get the value 0, as demostrated here:
% time2
=> datetime.datetime(2015, 12, 17, 23, 0)
% time2.minute, time2.second, time2.microsecond
=> (0, 0, 0)
time2 == datetime(2015, 12, 17, 23, 0,0,0)
=> True
As demostrated above, your time2 object is identical to the object you would have gotten if were manually filling the values with zeros.
Now that you see your datetime object is not missing anything, it should be clear how mongodb treats it.
I know this is elementary but I can't seem to figure it out, even after reading other posts.
In a dataset, I want to convert an entire column into a date. The current class is factor.
The value in the field looks like this 12/25/2012
This is what I've tried.
C$DateofDeath=as.Date(C$DateofDeath,'%m/%d/%Y')
Error in as.Date.default(C$DateofDeath, "%m/%d/%Y") :
do not know how to convert 'C$DateofDeath' to class “Date”
C$DateofDeath=as.Date(C$DateofDeath,"%m/%d/%Y")
Error in as.Date.default(C$DateofDeath, "%m/%d/%Y") :
do not know how to convert 'C$DateofDeath' to class “Date”
Claims$DateofDeath=strptime(as.character(Claims$DateofDeath),format= '%m/%d/%Y')
Error in `$<-.data.frame`(`*tmp*`, "DateofDeath", value = list(sec = numeric(0), :
replacement has 0 rows, data has 71616
Claims$DateofDeath=strptime(as.character(Claims$DateofDeath),format= "%m/%d/%Y")
Error in `$<-.data.frame`(`*tmp*`, "DateofDeath", value = list(sec = numeric(0), :
replacement has 0 rows, data has 71616
Use as.POSIXct
C$DateOfDeath<-as.POSIXct(as.character(C$DateOfDeath), format = "%d/%m/%Y")
There are lots of R experts here but you have to specify R as one of your tags to get them to notice your question.
Looks like you have tried a bunch of combinations but not the right one.
> C <- data.frame(DateofDeath="12/25/2012",other=TRUE)
> as.Date(as.character(C$DateofDeath),format="%m/%d/%Y")
[1] "2012-12-25"
Notice that as.Date() takes a character input, not a factor. So you need to convert to character, then to Date.
Your strptime() versions seem fine to me except that you call are referring to the dataframe Claims instead of C. Actually strptime() should convert the factor to character for you, so you don't need the as.character() part with those.