I am facing one issue, while loading data from s3. In the conversionnfron parquet to csv, a default date is changing, from 0001-01-03 to 001-01-01.
The source dra value getting changed.
In the spark config, we did below 2 addition but no help:
Spark.sql.legacy.parquet.datetimeRebaseModeInWrite CORRECTED
Spark.sql.legacy.parquet.datetimeRebaseModeInRead LEGACY
CAN anyone else have idea??
Related
I have several csv files in a folder. please refer to below screenshot.
The files with '20221205' are delta files and are newly uploaded into the folder today.
I want to read these 2 delta csv files only, and do some transformation and then append to existing table.
Every day, i will upload 2 files with current data as suffix, then run the note to handle the files uploaded today only.
Question: how to read only today's file only by pyspark??
How should I load the delta
what you call delta is actually a normal csv file with different prefix, not to be confused with delta data format.
you can read the prefix using glob patterns, simply put the date into the path string and it will read only the files ending with the suffix of the date:
spark.read.csv("path/to/folder/*20221205.csv")
I recommend however, if possible, storing the csv partitioned in your file system. this means each date is in a separate folder.
The file system will be something like:
folder
date=2022-01-01
date=2022-01-02
....
then you can simply:
spark.read.csv('folder').filter(col('date') == '2022-01-02')
the filter on the date will take milliseconds since the data is partitioned, behind the scenes spark knows that csvs with date = X are stored ONLY in date=X folder.
i am working on ETL job in datastage , a simple one Source ---> tRANsformer -----> destination
the source is a csv file , the destination is db2 base , so the prob is that the csv file contains a string timestamp like this
and i need to put it the db2 stage this is my table that i created with a script
this is the transformer config
and this is my prob this error appears
that means this in english
update_or_insert, 3: Unhandled conversion error in the "SEC_DAT_DATE_INSERT" zone from the source
type "timestamp" to the target type "timestamp [microseconds]":
source value = "*****************". The result does not accept a NULL value and there is no
handle_null to specify a default value
I don't know what it means that's the prob if anyone could help that would be nice thanks
First off, verify how Excel has handled the timestamp. Change the display format so that it conforms to ISO 8601, namely YYYY-MM-DD HH:NN:SS format, before you export it to CSV. Check the CSV file using a text editor (Notepad or Wordpad) to confirm that the timestamp format is correct.
Then change the StringToTimestamp() function so that it reflects the new format (or leave out the format entirely if this is your default format.
Note that the Else part of your expression uses a string. Perhaps you need to wrap that in a StringToTimestamp() function.
i will suggest you to check weather you have marked that column as key in the source (can happen by mistake) if so then deselect the key check box and see weather nullable is set to YES for that column in source ,if not then try and run on nullable column with YES selected. hope this helps.
When importing a seemingly valid flat file (csv, text etc) into a SQL Server database using the SSMS Import Flat File option, the following error appears:
Microsoft SQL Server Management Studio
Error inserting data into table. (Microsoft.SqlServer.Import.Wizard)
Error inserting data into table. (Microsoft.SqlServer.Prose.Import)
Object reference not set to an instance of an object. (Microsoft.SqlServer.Prose.Import)
The target table may contain rows that imported just fine. The first row that is not imported appears to have no formatting errors.
What's going wrong?
Check the following:
that there are no blank lines at the end of the file (leaving the last line's line terminator intact) - this seems to be the most common issue
there are no unexpected blank columns
there are no badly escaped quotes
It looks like the import process loads lines in chunks. This means that the lines following the last successfully loaded chunk may appear to have no errors. You need to look at subsequent lines, that are part of the failing chunk, to find the offending line(s).
This cost me hours of hair pulling while dealing with large files. Hopefully this saves someone some time.
If the file you're importing is already open, SSMS will throw this error. Close the file and try again.
Make sure when you are creating your flat-file IF you have text (varchar) value in any of your columns, DO NOT select your file to be comma "," delimited. Instead, select vertical line "|" or something that you are SURE it can't be in those values. the comma is supper common to have in nvarchar filed.
I have this issue and none of the recommendations from other answers helped me!
I hope this saves someone some times and it took me hours to figure it out!!!
None of these other ones worked for me, however this did:
When you import a flat file, SSMS gives you a brief summary of the data types within each column. Whenever you see a nvarchar that's in an int or double column, change it to int or double. And change all nvarchars to nvarchar(max). This worked for me.
I've been working with csv data for a long time. I encountered the similar problems when I first started this job, however as a novice, I couldn't obtain a precise fault from the exceptions.
Here are a few things you should look at before importing anything.
Your csv file must not be opened in any software, such as Excel.
Your csv file cells should not include comma or quotation symbols.
There are no unnecessary blanks at the end of your data.
There is no usage of a reserved term as data. In Excel, open
yourfile and save it as a new file.
After considering all the suggestions, if anyone is still having issues, check the length of the DataType for your columns. It took hours for me to figure this out but increasing the nvarchar length from (50) to (100) worked for me.
One thing that worked for me : You can change the error range to 1 in "Modify colums"
Image for clarity of where it is
You get an error message with the specific line that's problematic in your file instead of "ran out of memory"
I fixed these errors by playing around with the data type. For instance, change my tinyint to smallint, smallint to int, and increased my nvarchar() to reasonable values, else I set it to nvarchar(MAX). Since most of the real-life data do have missing values, I checked allowed missing values in all columns. Everything then worked with a warning message.
I'm trying to add a .csv to a table in database.
All dates in the .csv is in this format dd.mm.yyyy ( 18.10.2017).
I'm importing via pgadmin and always get an invalid input error.
I've tried to use almost all date formatting options for the column but without any luck.
I would rather not change the csv manually.
Can anyone help me with this?
I almost always import data into a staging table where all the columns are strings.
Then I use queries to load the final table.
This has several advantages:
It gives me much more control over how the data is transformed.
It makes it easier to debug problems -- the entire staging table can be queried to find all rows with a particular issue (for instance).
Additional validations can be performed before loading into the final table.
This is just a suggestion, but you might find that overall this takes less time.
The DateStyle setting is probably set to MDY. You can check this by running:
show datestyle;
Although dd.mm.yyy isn't listed as a standard input format, if you expect it to work, you will need the DateStyle to line up with the ordering here (DMY).
The date/time style can be selected by the user using the SET datestyle command, the DateStyle parameter in the postgresql.conf configuration file, or the PGDATESTYLE environment variable on the server or client.
See section "Date Order Conventions":
https://www.postgresql.org/docs/current/static/datatype-datetime.html
I am using the import and export wizard and imported a large csv file. I get the following error.
Error 0xc02020a1: Data Flow Task 1: Data conversion failed. The data
conversion for column "firms" returned status value 2 and status text "The
value could not be converted because of a potential loss of data.".
(SQL Server Import and Export Wizard)
Upon importing, I use the advanced tab and make all of the adjustments. As for the field in question, I set it is numeric (8,0). I have since went through this process multiple times and tried 7,8,9,10,and 11 to no avail. I import the csv into excel and look at the respective column, firms. It shows no entry with more than 5 characters. I thought about making it DT_String but will need to manipulate that column eventually by averaging it. Also, have searched for spaces or strange characters and found none.
Any other ideas?
1) Try changing the Numeric precision to numeric(30,20) both in source and destination table.
2) Change the data type to str/wstr and adjust the output column width while importing. It will run fine. It happened with me as well while loading large CSV file of approx 5 GB. After load, use Try_convert function to convert it back to numeric and check the values which went null while conversion, you will find the root cause then.