AWS Athena: Handling big numbers - biginteger

I have files on S3 where two columns contain only positive integers which can be of 10^26. Unfortunately, according to AWS Docs Athena only supports values in a range up to 2^63-1 (approx 10^19). So at the moment these column are represented as a string.
When it comes to filtering it is not that big of an issue, as I can use regex. For example, if I want to get all records between 5e^21 and 6e^21 my query would look like:
SELECT *
FROM database.table
WHERE (regexp_like(col_1, '^5[\d]{21}$'))
I have approx 300M rows (approx 12GB in parquet) and it takes about 7 seconds, so performance wise it ok.
However, sometimes I would like to perform some math operation on these two big columns, e.g subtract one big column from another. Casting these records to DOUBLE wouldn't work due to approximation error. Ideally, I would want to stay within Athena. At the moment, I have about 100M rows that are greater then 2^63-1, but this number can grow in a future.
What would be the right way to approach problem of having numerical records that exceed available range? Also what are your thoughts on using regex for filtering? Is there a better/more appropriate way to do it?

You can cast numbers of the form 5e21 to an approximate 64bit double or an exact numeric 128bit decimal. First you'll need to remove the caret ^, with the replace function. Then a simple cast will work:
SELECT CAST(replace('5e^21', '^', '') as DOUBLE);
_col0
--------
5.0E21
or
SELECT CAST(replace('5e^21', '^', '') as DECIMAL);
_col0
------------------------
5000000000000000000000
If you are going to this table often, I would rewrite it the new data type to save processing time.

Related

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

Postgres resampling time series data

I have OHLCV data of stocks stored in 1-minute increments inside Postgres.
I am trying to resample data to 5 minutes interval. I have used this answer to generate the following SQL query.
Here is the SQL query generated:
SELECT
avg('open') AS open,
avg('high') AS high,
avg('low') AS low,
avg('close') AS close,
avg('volume') AS volume,
avg('open_interest') AS open_interest,
to_timestamp(floor(EXTRACT(epoch FROM 'timestamp') / 300) * 300) AS interval_alias
WHERE 'symbol'='IRFC-N8' GROUP BY interval_alias
I am getting this error:
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) function avg(unknown) is not unique
LINE 1: SELECT avg('open') AS open, avg('high') AS high, avg('low') ...
^
HINT: Could not choose a best candidate function. You might need to add explicit type casts.
Could you tell me what went wrong?
Edit 1: Code formatted for better rendering.
Edit 2: According to the answer below, I need to use double quotes around the parameter to the avg function. I am using sqlalchemy to generate the expressions and it is creating single quoted strings. Here is the part of code which generates the avg query:
cols = list()
cols.append(func.avg(self.p.open).label(self.p.open))
cols.append(func.avg(self.p.high).label(self.p.high))
cols.append(func.avg(self.p.low).label(self.p.low))
cols.append(func.avg(self.p.close).label(self.p.close))
cols.append(func.avg(self.p.volume).label(self.p.volume))
cols.append(func.avg(self.p.openinterest).label(self.p.openinterest))
seconds = self._get_seconds()
cols.append(func.to_timestamp(func.floor(func.extract("epoch", "timestamp") / seconds) * seconds).label("interval_alias"))
SqlAlchemy should have known better to use double quotes but its generating single quotes.
db<>fiddle
Your error is the use of single quotes ' instead of double quotes " in your AVG() calls. The single quotes mark texts but you want to name the columns for the average value. So you need the double quotes or can leave them (both variants are shown in the db fiddle).
Edit (The real problem):
It seems that self.p.columnname gives just the column's name, not the column itself. On SQLAlchemy the reference to a specific column is table.c.columnname for referencing the specific column. Please use the p instead of c.
Attention:
If you average all your data you may lose important data as the real minimum and maximum. You may want to aggregate with other functions as MIN or MAX. Maybe the WITHIN GROUP functions could help you.
https://www.postgresql.org/docs/current/static/functions-aggregate.html
https://www.postgresql.org/docs/9.5/static/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE

rethinkdb: group documents by price range

I want to group documents in rethinkdb by price range (0-100, 100-200, 200-300, and so on), instead of a single price value. How do I do that?
Unfortunately, ReQL doesn't support rounding at the moment (see github issue #866), but you can get something similar through some minor annoyances.
First of all, I would recommend making this an index on the given table if you're going to be running this regularly or on large data sets. The function I have here is not the most efficient because we can't round numbers, and an index would help mitigate that a lot.
These code samples are in Python, since I didn't see any particular language referenced. To create the index, run something like:
r.db('foo').table('bar').index_create('price_range',
lambda row: row['price'].coerce_to('STRING').split('.')[0].coerce_to('NUMBER')
.do(lambda x: x.sub(x.mod(100)))).run()
This will create a secondary index based on the price where 0 indicates [0-100), 100 is [100-200), and so on. At this point, a group-by is trivial:
r.db('foo').table('bar').group(index='price_range').run()
If you would really rather not create an index, the mapping can be done during the group in a single query:
r.db('foo').table('bar').group(
lambda row: row['price'].coerce_to('STRING').split('.')[0].coerce_to('NUMBER')
.do(lambda x: x.sub(x.mod(100)))).run()
This query is fairly straight-forward, but to document what is going on:
coerce_to('STRING') - we obtain a string representation of the number, e.g. 318.12 becomes "318.12".
split('.') - we split the string on the decimal point, e.g. "318.12". becomes ["318", "12"]. If there is no decimal point, everything else should still work.
[0] - we take the first value of the split string, which is equivalent the original number rounded down. e.g. "318".
coerce_to('NUMBER') - we convert the string back into an integer, which allows us to do modulo arithmetic on it so we can round, e.g. "318" becomes 318.
.do(lambda x: x.sub(x.mod(100))) - we round the resulting integer down to the nearest 100 by running (essentially) x = x - (x % 100), e.g. 318 becomes 300.

saving data like 2.3214E7 into postgresql

am new to postgresql (redshift)
i am copying CSV files from S3 to RedShift and there's an error about trying to save 2.35555E7 number into a numeric | 18, 0 column . what is the right datatype for this datum ?
thanks
numeric (18,0) implies a scale of zero, which is a way of saying no decimals -- it's a bit like a smaller bigint.
http://www.postgresql.org/docs/current/static/datatype-numeric.html
If you want to keep it as numeric, you want to use numeric instead -- with no precision or scale.
If not, just use a real or a double precision type, depending on the number of significant digits (6 vs 15, respectively) you want to keep around.
Your example data (2.35555E7) suggests you're using real, so probably try that one first.
Note: select 2.35555E7::numeric(18,0) works fine per the comments, but I assume there's some other data in your set that is causing issues.

MS SQL Float Decimal Comparison Problems

I'm in the process of normalising a database, and part of this involves converting a column from one table from a FLOAT to a DECIMAL(28,18). When I then try to join this converted column back to the source column, it returns no results in some cases.
It seems to bee something to do with the ways its converted. For example, the FLOAT converted to a DECIMAL(28,18) produces:
51.051643260000006000
The original FLOAT is
51.05164326
I have tried various ways of modifying the FLOAT, and none of these work either:
CAST(51.05164326 AS DECIMAL(28,18)) = 51.051643260000000000
STR(51.05164326 , 28,18) = 51.0516432599999990
The reason for the conversion is due to improving the accuracy of these fields.
Has anyone got a consistent strategy to convert these numbers, and be able to ensure subsequent joins work?
Thanks in advance
CM
For your application, you need to consider how many decimal places you need. It looks like in reality you require about 8-14 decimal places not 18.
One way to do the conversion is cast(cast(floatColumn as decimal(28,14)) as decimal(28,18)).
To do a join between a decimal and float column, you can do something like this:
ON cast(cast(floatColumn as decimal(28,14)) as decimal(28,18)) = decimalColumn
Provided the double-cast is the same double-cast used to create the decimalColumn, this will allow you to make use of an index on the decimalColumn.
Alternatively you can use a range join:
ON floatColumn > decimalColumn - #epsilon AND floatColumn < decimalColumn + #epsilon
This should still make use of the index on decimalColumn.
However, it is unusual to join on decimals. Unless you actually need to join on them or need to do a direct equality comparision (as opposed to a range comparison), it may be better to simply do the conversion as you are, and document the fact that there is a small loss of accuracy due to the initial choice of an inappropriate data type.
For more information see:
Is it correct to compare two rounded floating point numbers using the == operator?
Dealing with accuracy problems in floating-point numbers