am new to postgresql (redshift)
i am copying CSV files from S3 to RedShift and there's an error about trying to save 2.35555E7 number into a numeric | 18, 0 column . what is the right datatype for this datum ?
thanks
numeric (18,0) implies a scale of zero, which is a way of saying no decimals -- it's a bit like a smaller bigint.
http://www.postgresql.org/docs/current/static/datatype-numeric.html
If you want to keep it as numeric, you want to use numeric instead -- with no precision or scale.
If not, just use a real or a double precision type, depending on the number of significant digits (6 vs 15, respectively) you want to keep around.
Your example data (2.35555E7) suggests you're using real, so probably try that one first.
Note: select 2.35555E7::numeric(18,0) works fine per the comments, but I assume there's some other data in your set that is causing issues.
Related
We often have columns that can contain values of varying sizes. For these, I like to set the data type to VARCHAR with a size way beyond the current maximum length. For example, if I have a column where the current minimum length for a value is 10 and the maximum length is 35, I might set the data type to VARCHAR(64). My rationale is that Db2 stores the 2 byte length followed by the exact value, therefore, there is no difference, from a storage perspective, defining the data type as VARCHAR(64) instead of VARCHAR(35). And I don't get an error if I a value with a length of 36 comes along.
Is there a nuance that I'm missing and should I not be so glib about my VARCHAR assignments?
The exact formula to calculate row length is described in the docs for CREATE TABLE. VARCHAR(64) or VARCHAR(35) should not make a difference.
Be aware that rows a stored in data pages in tablespaces. Database systems usually pre-allocate pages for performance reasons. Moreover, pages might not be fully filled or there is compression. And you might have defined indexes which require their own pages with structures. Plus there is metadata in the system catalog.
In PostgreSQL, I have a column with people's height in meters. If the height is, say 1.75 m, it shows properly, but if the height is 1.70 m, it shows as 1.7. I would like to have this already formatted to two decimal places, showing as 1.70 without formatting in each and every SQL call. Can I specify this in the table creation? Or a stored procedure, or something? I've seen a few things about timestamps, but not for real fields. Knowing how to format the decimal point as a colon (1,70) would be a plus.
Basically, presentation and "cosmetics" are the job of the application, not the database.
Having a default number of decimal places for floats would also create a problem, because the data returned by the database would not be the actual data in the column. So if you did a SELECT and it returned a value of 1.75, then if you searched for this value, you might not find it because the actual value stored was not 1.75 but 1.7499999999 and it was only rounded for display.
Potential solutions:
If you want to store a specified number of digits, use NUMERIC. This will solve the 1.7499999999 problem above. If you use NUMERIC, when doing a SELECT you get the actual contents of the column.
In your app, if you use an ORM, use a Decimal (or similar) type for the column with the appropriate settings so it displays the way you want.
Or create a view with the format applied to the column, but in this case if you want the trailing zero, the type will be text and not float, and it will not be searchable unless you create an extra index on it.
Generated column with the number formatted as you want, maybe easier than a view
I have files on S3 where two columns contain only positive integers which can be of 10^26. Unfortunately, according to AWS Docs Athena only supports values in a range up to 2^63-1 (approx 10^19). So at the moment these column are represented as a string.
When it comes to filtering it is not that big of an issue, as I can use regex. For example, if I want to get all records between 5e^21 and 6e^21 my query would look like:
SELECT *
FROM database.table
WHERE (regexp_like(col_1, '^5[\d]{21}$'))
I have approx 300M rows (approx 12GB in parquet) and it takes about 7 seconds, so performance wise it ok.
However, sometimes I would like to perform some math operation on these two big columns, e.g subtract one big column from another. Casting these records to DOUBLE wouldn't work due to approximation error. Ideally, I would want to stay within Athena. At the moment, I have about 100M rows that are greater then 2^63-1, but this number can grow in a future.
What would be the right way to approach problem of having numerical records that exceed available range? Also what are your thoughts on using regex for filtering? Is there a better/more appropriate way to do it?
You can cast numbers of the form 5e21 to an approximate 64bit double or an exact numeric 128bit decimal. First you'll need to remove the caret ^, with the replace function. Then a simple cast will work:
SELECT CAST(replace('5e^21', '^', '') as DOUBLE);
_col0
--------
5.0E21
or
SELECT CAST(replace('5e^21', '^', '') as DECIMAL);
_col0
------------------------
5000000000000000000000
If you are going to this table often, I would rewrite it the new data type to save processing time.
I have a pandas dataframe with a "year" column. However some rows have a np.NaN value due to an outer merge. The data type of the column in pandas is therefore converted to float64 instead of integer (integer cannot store NaNs?). Next, I want to store the dataframe on a postGreSQL database. For this I use:
df.to_sql()
Everything works fine but my postGreSQL column is now type "double precision" and the np.NaN values are now [null]. This all makes sense since the input column type was float64 and not integer type.
I was wondering if there is a way to store the results in an integer type column with [nans].
Example Notebook
Result of Ami's answer:
(integer cannot store NaNs?)
No, they cannot. If you look at the postgresql numeric documentation, you can see that the number of bytes, and ranges, are completely specified, and integers cannot store this.
A common solution in this case is to decide, by convention, that some number is logically a nan. In your case, if it is year, you might choose a negative value (or just -1) as that. Before writing, you could use
df.year = df.year.fillna(-1).astype(int)
Alternatively, you can define another column as year_is_none.
Alternatively, you can store them as floats.
These solutions range from most efficient, to least efficient in terms of memory.
You should use it;
df.year = df.year.fillna(-1) OR 0
I am supporting an ETL process that transforms flat-file inputs into a SqlServer database table. The code is almost 100% T-SQL and runs inside the DB. I do not own the code and cannot change the workflow. I can only help configure the "translation" SQL that takes the file data and converts it to table data (more on this later).
Now that the disclaimers are out of the way...
One of our file providers recently changed how they represent a monetary amount from '12345.67' to '12,345.67'. Our SQL that transforms the value looks like SELECT FLOOR( CAST([inputValue] AS DECIMAL(24,10))) and no longer works. I.e., the comma breaks the cast.
Given that I have to store the final value as Decimal (24,10) datatype (yes, I realize the FLOOR wipes out all post-decimal-point precision - the designer was not in sync with the customer), what can I do to cast this string efficiently?'
Thank you for your ideas.
try using REPLACE (Transact-SQL):
SELECT REPLACE('12,345.67',',','')
OUTPUT:
12345.67
so it would be:
SELECT FLOOR( CAST(REPLACE([input value],',','') AS DECIMAL(24,10)))
This works for me:
DECLARE #foo NVARCHAR(100)
SET #foo='12,345.67'
SELECT FLOOR(CAST(REPLACE(#foo,',','') AS DECIMAL(24,10)))
This is probably only valid for collations/culture where the comma is not the decimal separator (ie: Spanish)
While not necessarily the best approach for my situation, I wanted to leave a potential solution for future use that we uncovered while researching this problem.
It appears that the SqlServer datatype MONEY can be used as a direct cast for strings with a comma separating the non-decimal portion. So, where SELECT CAST('12,345.56' AS DECIMAL(24,10)) fails, SELECT CAST('12,345.56' AS MONEY) will succeed.
One caveat is that the MONEY datatype has a precision of 4 decimal places and would require an explicit cast to get it to DECIMAL, should you need it.
SELECT FLOOR (CAST(REPLACE([inputValue], ',', '') AS DECIMAL(24,10)))