I am dumping a Postgres table using a copy command outputting to CSV.
The CSV contains timestamps formatted as such: 2011-01-01 12:30:10.123456+00.
I'm reading the CSV like
df = spark.read.csv(
"s3://path/to/csv",
inferSchema=True,
timestampFormat="yyyy-MM-dd HH:mm:ss.SSSSSSX",
...
)
but this doesn't work (as expected). The timestampFormat uses java.text.SimpleDateFormat which does not have nanosecond support.
I've tried a lot of variations on the timestampFormat, and they all produce either String columns or misformat the timestamp. Seems like the nanoseconds end up overflowing the seconds and adding time to my timestamp.
I can't apply a schema to the CSV because I don't always know it, and I can't cast the columns because I don't always know which will be timestamps. I also can't cast the timestamp on the way out of Postgres, because I'm just doing select * ....
How can I solve this so I can ingest the CSV with the proper timestamp format?
My first thought was I just had to modify timestampFormat, this seems like it's not possible? My second thought is to use sed to truncate the timestamp as I'm dumping from Postgres.
I'm using spark 2.3.1.
Thanks for the help!
Related
I'm trying to import data from a large CSV into MongoDB using MongoImport. The date column type is giving me problems.
The CSV has the date in epoch time, but I want it saved into MongoDB was a normal date type.
I'm using --columnsHaveTypes with --fieldFile, but I can't figure out or find any answers anywhere for how to convert the date format on import.
The documentation makes it seem like I can use the Go Lang format of columnName.date(1136239445) (that's the reference time in the documentation), but that won't work. And I can't find any help in date_ms(<arg>) or date_oracle(<arg>).
As much as possible, this needs to be a hands-off operation because the large DB dump (SQLite3 format) will be automatically converted to CSV and imported to MongoDB without human input.
Thanks in advance!
I'm trying to create a table to import data from Kaggle (https://www.kaggle.com/sulianova/cardiovascular-disease-dataset) using SQL Shell. I'm having problems with the date import.
I've altered the date to the correct format yyyy-mm-dd in excel and save it as .csv but when I try to copy in the data (using https://www.postgresqltutorial.com/import-csv-file-into-posgresql-table/ as a guide) its recognising it as an integer. How can I overcome this?
I know I can enter a date in inverted commas but I cant do that manually for 70K entries.
I usually have this 'yyyy-MM-ddThh:mm:ssZ' and it works pretty fine for timestamp. Be careful with the 24H and 12H format.
I'd like to import some data into a Redshift database using COPY. For reasons passing understanding one of the columns in the data is a timestamp that's given in seconds since 2000-01-01 00:00:00. Is there any way to turn these into proper timestamps on import?
Unfortunately, you cannot transform data in a Redshift COPY load. I think you will have to stage these to a load table and then do the transform in the insert to the final table.
Worth noting though that you could do this if they had used the standard Unix epoch (seconds since 1970-01-01 00:00:00) by adding TIMEFORMAT 'epochsecs' to your COPY.
I have a large data and in that one field be like Wed Sep 15 19:17:44 +0100 2010 and I need to insert that field in Hive.
I am getting troubled for choosing data type. I tried both timestamp and date but getting null values when loading from CSV file.
The data type is a String as it is text. If you want to convert it, I would suggest a TIMESTAMP. However you will need to do this conversion yourself while loading the data or (even better) afterwards.
To convert to a timestamp, you can use the following syntax:
CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(<date_column>,'FORMAT')) as TIMESTAMP)
Your format seems complex though. My suggestion is to load it as a string and then just do a simple query on the first record until you get it working.
SELECT your_column as string_representation,
CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(<date_column>,'FORMAT')) as TIMESTAMP) as timestamp_representation
FROM your_table
LIMIT 1
You can find more information on the format here: http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
My advice would be to concat some substrings first and try to convert only the day, month, year part before you look at time and timezone et cetera.
I am reading a csv file with date fields of formatted mm/dd/yyyy. I expected the same kind of format from a Postgres table after the import, but I see yyyy-mm-dd hh:mm:ss.
The date fields in my table are defined as timestamp without time zone data type.
How do I maintain the same format of data? I am using PostgreSQL 9.3.
Postgresql only stores the value, it doesn't store formatting (which would waste space).
You can use the to_char function in your query if you like to get the output formatted in a special way. Details are in the manual.