I have a pandas dataframe with a "year" column. However some rows have a np.NaN value due to an outer merge. The data type of the column in pandas is therefore converted to float64 instead of integer (integer cannot store NaNs?). Next, I want to store the dataframe on a postGreSQL database. For this I use:
df.to_sql()
Everything works fine but my postGreSQL column is now type "double precision" and the np.NaN values are now [null]. This all makes sense since the input column type was float64 and not integer type.
I was wondering if there is a way to store the results in an integer type column with [nans].
Example Notebook
Result of Ami's answer:
(integer cannot store NaNs?)
No, they cannot. If you look at the postgresql numeric documentation, you can see that the number of bytes, and ranges, are completely specified, and integers cannot store this.
A common solution in this case is to decide, by convention, that some number is logically a nan. In your case, if it is year, you might choose a negative value (or just -1) as that. Before writing, you could use
df.year = df.year.fillna(-1).astype(int)
Alternatively, you can define another column as year_is_none.
Alternatively, you can store them as floats.
These solutions range from most efficient, to least efficient in terms of memory.
You should use it;
df.year = df.year.fillna(-1) OR 0
Related
I have an imported CSV file with string values.
In this file there are amounts, of which several lines equal 0,00
I want to create a TotalCA column by adding several fields in my table and convert it to a numeric value.
I use the toDecimal function and the values are all returned NULL and the created column is grayed..
I have done a lot of research and I can't find a solution. Can you help me?
Thank you
Lea
I made an example csv data if I understand you correctly:
Like you said, some rows are enriched with values greater than 0, and others contain "0.00" when it is a zero value. Actually, the row data contains different data type, int and decimal.
For these reason and as I tested, no matter toDecimal(), toFloat() or toDouble(), all of the functions don't work. I use Derived column expression to do the data conversion.
We can't keep these data and only can choose one type of them. If you choose the decimal or float, other rows data would be converted to '11.0', I think that also doesn't you want.
Source Projection: I preset the column type to double:
(Decimal can't keep '0.00', it only returns '0')
In one word, the only way is that use String data type to keep the data. And also use String data type to receive the data in sink dataset.
HTH.
Thank you all for your answers.
Here is my CSV file
If I go to the Source Projection module and change the type of my column LFC1_UM01S to decimal this is what I get:
Why are some values considered as NULL?
To decimal column
I have files on S3 where two columns contain only positive integers which can be of 10^26. Unfortunately, according to AWS Docs Athena only supports values in a range up to 2^63-1 (approx 10^19). So at the moment these column are represented as a string.
When it comes to filtering it is not that big of an issue, as I can use regex. For example, if I want to get all records between 5e^21 and 6e^21 my query would look like:
SELECT *
FROM database.table
WHERE (regexp_like(col_1, '^5[\d]{21}$'))
I have approx 300M rows (approx 12GB in parquet) and it takes about 7 seconds, so performance wise it ok.
However, sometimes I would like to perform some math operation on these two big columns, e.g subtract one big column from another. Casting these records to DOUBLE wouldn't work due to approximation error. Ideally, I would want to stay within Athena. At the moment, I have about 100M rows that are greater then 2^63-1, but this number can grow in a future.
What would be the right way to approach problem of having numerical records that exceed available range? Also what are your thoughts on using regex for filtering? Is there a better/more appropriate way to do it?
You can cast numbers of the form 5e21 to an approximate 64bit double or an exact numeric 128bit decimal. First you'll need to remove the caret ^, with the replace function. Then a simple cast will work:
SELECT CAST(replace('5e^21', '^', '') as DOUBLE);
_col0
--------
5.0E21
or
SELECT CAST(replace('5e^21', '^', '') as DECIMAL);
_col0
------------------------
5000000000000000000000
If you are going to this table often, I would rewrite it the new data type to save processing time.
I want to visualise the below excel table in Tableau.
When adding this table to Tableau it shows Salary values as String and thus under Dimension Tab and not under Measure, thus cannot make proper graph from it.
How to convert this Salary range values to Int ?
As #Alexandru Porumb suggested, the best solution is to have a min_salary column and a max_salary column — unless you really have the actual salary available which is even better.
If you don’t want to revise the incoming data, you can get the same effect using the Split() function in a calculated field from Tableau to derive two integer fields from the original string field.
For example, you could define a calculated field called min_salary as INT(SPLIT([Salary], ‘-‘, 1)). Split() extracts part of a string based on a separator string. Int() converts the string to an integer.
You could simplify the way it sees the data and separate the salary column into Min and Max, thus you wouldn't have the hyphen that makes Tableau consider the entry as a string.
Simplistic idea, I know but it may help until a better solution will be provided.
Hope it helps
am new to postgresql (redshift)
i am copying CSV files from S3 to RedShift and there's an error about trying to save 2.35555E7 number into a numeric | 18, 0 column . what is the right datatype for this datum ?
thanks
numeric (18,0) implies a scale of zero, which is a way of saying no decimals -- it's a bit like a smaller bigint.
http://www.postgresql.org/docs/current/static/datatype-numeric.html
If you want to keep it as numeric, you want to use numeric instead -- with no precision or scale.
If not, just use a real or a double precision type, depending on the number of significant digits (6 vs 15, respectively) you want to keep around.
Your example data (2.35555E7) suggests you're using real, so probably try that one first.
Note: select 2.35555E7::numeric(18,0) works fine per the comments, but I assume there's some other data in your set that is causing issues.
I have built a query which contains UNION ALL, but the two parts of it
have not the same data type. I mean, i have to display one column but the
format of the two columns, from where i get the data have differences.
So, if i get an example :
select a,b
from c
union all
select d,b
from e
a and d are numbers, but they have different format. It means that a's length is 15
and b's length is 13. There are no digits after the floating point.
Using digits, varchar, integer and decimal didn't work.
I always get the message : Data conversion or data mapping error.
How can i convert these fields in the same format?
I've no DB2 experience but can't you just cast 'a' & 'd' to the same types. That are large enough to handle both formats, obviously.
I have used the cast function to convert the columns type into the same type(varchar with a large length).So i used union without problems. When i needed their original type, back again, i used the same cast function(this time i converted the values into float), and i got the result i wanted.