rethinkdb: group documents by price range - group-by

I want to group documents in rethinkdb by price range (0-100, 100-200, 200-300, and so on), instead of a single price value. How do I do that?

Unfortunately, ReQL doesn't support rounding at the moment (see github issue #866), but you can get something similar through some minor annoyances.
First of all, I would recommend making this an index on the given table if you're going to be running this regularly or on large data sets. The function I have here is not the most efficient because we can't round numbers, and an index would help mitigate that a lot.
These code samples are in Python, since I didn't see any particular language referenced. To create the index, run something like:
r.db('foo').table('bar').index_create('price_range',
lambda row: row['price'].coerce_to('STRING').split('.')[0].coerce_to('NUMBER')
.do(lambda x: x.sub(x.mod(100)))).run()
This will create a secondary index based on the price where 0 indicates [0-100), 100 is [100-200), and so on. At this point, a group-by is trivial:
r.db('foo').table('bar').group(index='price_range').run()
If you would really rather not create an index, the mapping can be done during the group in a single query:
r.db('foo').table('bar').group(
lambda row: row['price'].coerce_to('STRING').split('.')[0].coerce_to('NUMBER')
.do(lambda x: x.sub(x.mod(100)))).run()
This query is fairly straight-forward, but to document what is going on:
coerce_to('STRING') - we obtain a string representation of the number, e.g. 318.12 becomes "318.12".
split('.') - we split the string on the decimal point, e.g. "318.12". becomes ["318", "12"]. If there is no decimal point, everything else should still work.
[0] - we take the first value of the split string, which is equivalent the original number rounded down. e.g. "318".
coerce_to('NUMBER') - we convert the string back into an integer, which allows us to do modulo arithmetic on it so we can round, e.g. "318" becomes 318.
.do(lambda x: x.sub(x.mod(100))) - we round the resulting integer down to the nearest 100 by running (essentially) x = x - (x % 100), e.g. 318 becomes 300.

Related

AWS Athena: Handling big numbers

I have files on S3 where two columns contain only positive integers which can be of 10^26. Unfortunately, according to AWS Docs Athena only supports values in a range up to 2^63-1 (approx 10^19). So at the moment these column are represented as a string.
When it comes to filtering it is not that big of an issue, as I can use regex. For example, if I want to get all records between 5e^21 and 6e^21 my query would look like:
SELECT *
FROM database.table
WHERE (regexp_like(col_1, '^5[\d]{21}$'))
I have approx 300M rows (approx 12GB in parquet) and it takes about 7 seconds, so performance wise it ok.
However, sometimes I would like to perform some math operation on these two big columns, e.g subtract one big column from another. Casting these records to DOUBLE wouldn't work due to approximation error. Ideally, I would want to stay within Athena. At the moment, I have about 100M rows that are greater then 2^63-1, but this number can grow in a future.
What would be the right way to approach problem of having numerical records that exceed available range? Also what are your thoughts on using regex for filtering? Is there a better/more appropriate way to do it?
You can cast numbers of the form 5e21 to an approximate 64bit double or an exact numeric 128bit decimal. First you'll need to remove the caret ^, with the replace function. Then a simple cast will work:
SELECT CAST(replace('5e^21', '^', '') as DOUBLE);
_col0
--------
5.0E21
or
SELECT CAST(replace('5e^21', '^', '') as DECIMAL);
_col0
------------------------
5000000000000000000000
If you are going to this table often, I would rewrite it the new data type to save processing time.

Rounding to a variable number of decimal places in SSRS

I am trying to find a way to round a field in SSRS to a dynamic number of decimal places. I know I can format it dynamically, and it may eventually come to that, but many of my users are going to take this report directly to Excel and are going to want to have actual numeric fields.
My t-SQL code includes these declared variables:
NumLong01 DECIMAL(23,8)
, NumLongDP01 INTEGER
The first set of entries in this table is for headers and rounding parameters. So I add values for these two as:
NULL
,4
and then I add the actual table values as:
543210987654321.87654321
,NULL
That way I can put a whole series of numbers into the table but they all have to be formatted the same way.
Running this query yields:
When I go to ReportBuilder, my field has this expression:
=Fields!NumLong01.Value
If I want to format a certain number of decimal places, I can just do this:
=Round(Fields!NumLong01.Value,2) or some such. What I tried to do, though, was to make it dynamic:
=Round(Fields!NumLong01.Value,First(Fields!NumLongDP01.Value, "DataSet1"))
This ended up rounding to 0 decimal places. I subsequently learned--by just using the second half in my field--that this was a NULL value. So I tried Sum instead of First--again, just in my field--and got the 4 that I expected. Great, so now I had my number, and I just put that in as my rounding:
=Round(Fields!NumLong01.Value,Sum(Fields!NumLongDP01.Value, "DataSet1"))
Only problem is, this yields an error. Next I asked myself if maybe it wasn't seeing this as a number for some reason. So i just added it onto my field. No problems. So I really don't know what it's doing. Is it thinking that this field might become so long that it will round to an illegal number of decimals places?
Now, I can do this:
=IIf(Sum(Fields!NumLongDP01.Value, "DataSet1") = 8,Round(Fields!NumLong01.Value,8),IIf(Sum(Fields!NumLongDP01.Value, "DataSet1") = 7,Round(Fields!NumLong01.Value,7),IIf(Sum(Fields!NumLongDP01.Value, "DataSet1") = 6,Round(Fields!NumLong01.Value,6),IIf(Sum(Fields!NumLongDP01.Value, "DataSet1") = 5,Round(Fields!NumLong01.Value,5),IIf(Sum(Fields!NumLongDP01.Value, "DataSet1") = 4,Round(Fields!NumLong01.Value,4),IIf(Sum(Fields!NumLongDP01.Value, "DataSet1") = 3,Round(Fields!NumLong01.Value,3),IIf(Sum(Fields!NumLongDP01.Value, "DataSet1") = 2,Round(Fields!NumLong01.Value,2),IIf(Sum(Fields!NumLongDP01.Value, "DataSet1") = 1,Round(Fields!NumLong01.Value,1),Round(Fields!NumLong01.Value,0)))))))))
...and that works. But it seems like such a ridiculous way to go about it.
I'm also comfortable passing only rounded numbers out of t-SQL. But then I run into the problem of showing only a certain number of decimals on the report, because in the number formatting it doesn't allow for a dynamic number of decimal places for some reason.
Any ideas would be appreciated.
This isn't an exhaustive list of ways to accomplish dynamic rounding or number formatting as you can achieve this using custom code in the report or by adapting your dataset's SQL query.
Using Rounding:
The first set of entries in this table is for headers and rounding parameters. That way I can put a whole series of numbers into the table but they all have to be formatted the same way.
To avoid building expressions in your report that require aggregate functions such First and Sum and generating a blank row that you then have to remove, consider just entering the number of decimal places for every row instead of using a header row. The costs (storage and expression evaluation) are low even if it seems redundant.
This means that instead of using: =Round(Fields!NumLong01.Value,First(Fields!NumLongDP01.Value, "DataSet1")) you can use =Round(Fields!NumLong01.Value,Fields!NumLongDP01.Value) either as an expression or as a calculated field in DataSet1 or whatever your dataset is called.
Using Number Formatting:
But then I run into the problem of showing only a certain number of decimals on the report, because in the number formatting it doesn't allow for a dynamic number of decimal places for some reason.
You can define custom formatting for the NumLong01 field in the report and make it dynamic using an expression to build your custom formatting string.
Open the Text Box Properties for the NumLong01 textbox or tablix field
Open Number tab and select Custom from the Category list
Click the fx button and use the following expression ="0." + StrDup(First(Fields!NumLongDP01.Value, "DataSet1"), "0")
Using your example data, this expression would produce the custom formatting string 0.0000 which changes 543210987654321.87654321 to 543210987654321.8765. For your information, StrDup duplicates the specified string X number of times.
In cases where the fractional part of the number is less than the decimal precision required, this formatting string will pad it with 0s. If that's not desired, change the string to be duplicated to "#" like so: StrDup(First(Fields!NumLongDP01.Value, "DataSet1"), "#").
You can also use this method as a calculated field in the dataset but only if you have removed the header row and are entering the decimal places for every row as mentioned earlier. This is because you can't use the aggregate function in the calculated field expression.
To do this, add a calculated field to your dataset with the expression: =Format(Fields!NumLong01.Value, "0." + StrDup(Fields!NumLongDP01.Value, "0"))

MS SQL Float Decimal Comparison Problems

I'm in the process of normalising a database, and part of this involves converting a column from one table from a FLOAT to a DECIMAL(28,18). When I then try to join this converted column back to the source column, it returns no results in some cases.
It seems to bee something to do with the ways its converted. For example, the FLOAT converted to a DECIMAL(28,18) produces:
51.051643260000006000
The original FLOAT is
51.05164326
I have tried various ways of modifying the FLOAT, and none of these work either:
CAST(51.05164326 AS DECIMAL(28,18)) = 51.051643260000000000
STR(51.05164326 , 28,18) = 51.0516432599999990
The reason for the conversion is due to improving the accuracy of these fields.
Has anyone got a consistent strategy to convert these numbers, and be able to ensure subsequent joins work?
Thanks in advance
CM
For your application, you need to consider how many decimal places you need. It looks like in reality you require about 8-14 decimal places not 18.
One way to do the conversion is cast(cast(floatColumn as decimal(28,14)) as decimal(28,18)).
To do a join between a decimal and float column, you can do something like this:
ON cast(cast(floatColumn as decimal(28,14)) as decimal(28,18)) = decimalColumn
Provided the double-cast is the same double-cast used to create the decimalColumn, this will allow you to make use of an index on the decimalColumn.
Alternatively you can use a range join:
ON floatColumn > decimalColumn - #epsilon AND floatColumn < decimalColumn + #epsilon
This should still make use of the index on decimalColumn.
However, it is unusual to join on decimals. Unless you actually need to join on them or need to do a direct equality comparision (as opposed to a range comparison), it may be better to simply do the conversion as you are, and document the fact that there is a small loss of accuracy due to the initial choice of an inappropriate data type.
For more information see:
Is it correct to compare two rounded floating point numbers using the == operator?
Dealing with accuracy problems in floating-point numbers

Solr date field tdate vs date?

So I have a question about Solr's field date types which is pretty straight forward: what's the difference between a 'date' field and a 'tdate' one?
The schema .xml claims that 'For faster range queries, consider the tdate type' and 'A Trie based date field for faster date range queries and date faceting. '
Fair enough... but what's the precisionStep="6" all about? should i change this? does it change the way i would create the query in case I use the tdate? What's the real advantage or what does Solr do that makes it better?
P.S went through google, Solr manual, solr wiki and the java docs without any luck so I'd appreciate a kind and explanatory answer :)...
Also checked:
http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
http://web.archiveorange.com/archive/v/AAfXfqRYyLnDFtskmLRi
Trie fields make range queries faster by precomputing certain range results and storing them as a single record in the index. For clarity, my example will use integers in base ten. The same concept applies to all trie types. This includes dates, since a date can be represented as the number of seconds since, say, 1970.
Let's say we index the number 12345678. We can tokenize this into the following tokens.
12345678
123456xx
1234xxxx
12xxxxxx
The 12345678 token represents the actual integer value. The tokens with the x digits represent ranges. 123456xx represents the range 12345600 to 12345699, and matches all the documents that contain a token in that range.
Notice how in each token on the list has successively more x digits. This is controlled by the precision step. In my example, you could say that I was using a precision step of 2, since I trim 2 digits to create each extra token. If I were to use a precision step of 3, I would get these tokens.
12345678
12345xxx
12xxxxxx
A precision step of 4:
12345678
1234xxxx
A precision step of 1:
12345678
1234567x
123456xx
12345xxx
1234xxxx
123xxxxx
12xxxxxx
1xxxxxxx
It's easy to see how a smaller precision step results in more tokens and increases the size of the index. However, it also speeds up range queries.
Without the trie field, if I wanted to query a range from 1250 to 1275, Lucene would have to fetch 25 entries (1250, 1251, 1252, ..., 1275) and combine search results. With a trie field (and precision step of 1), we could get away with fetching 8 entries (125x, 126x, 1270, 1271, 1272, 1273, 1274, 1275), because 125x is a precomputed aggregation of 1250 - 1259. If I were to use a precision step larger than 1, the query would go back to fetching all 25 individual entries.
Note: In reality, the precision step refers to the number of bits trimmed for each token. If you were to write your numbers in hexadecimal, a precision step of 4 would trim one hex digit for each token. A precision step of 8 would trim two hex digits.
Basically trie ranges are faster. Here is one explanation. With precisionStep you configure how much your index can grow to get the performance benefits. To quote from the link you are referring:
More importantly, it is not dependent on the index size, but instead the precision chosen.
and
the only drawbacks of TrieRange are a little bit larger index sizes, because of the additional terms indexed
Your best bet is to just look at the source code. Some of the things for Solr aren't well documented and the fastest way to get a trustworthy answer is to simply look at the code. If you haven't been in the code yet, that too is to your benefit. At least in the long run.
Here's a link to the TrieTokenizerFactory.
http://www.jarvana.com/jarvana/view/org/apache/solr/solr-core/1.4.1/solr-core-1.4.1-sources.jar!/org/apache/solr/analysis/TrieTokenizerFactory.java?format=ok
The javadoc in the class at least hints at the purpose of the precisionStep. You could dig futher.
EDIT: I dug a bit further for you. It's passed off directly to Lucene's NumericTokenStream class, which will used the value during parsing the token stream. Probably worth closer examination. It seems to deal with granularity and is probably a tradeoff between size in the index and speed.

Sql query to populate iPhone quick-access-bar (count rows by first letter)

I have a sqlite database, with potentially tens of thousands of rows, which is exposed as a UITableView on the iPhone. There's many columns in the relevant table, but for now I only care about a 'name' column. (The database is in a server app, the iPhone app is a client, talking over a socket) Naturally to make this usable, I want to support the 'quick access bar', and hence to count the number of rows that start with A, B, C .. Z (and ideally handle the rows that start with an awkward characters such as #, or a digit).
I'm having trouble formulating a SQL query (supported by SQLite) to count the rows starting with a given letter; if necessary I could make 26 separate queries, but I wonder if there's some nested query magic to compute all the values at once. (Obviously 'COUNT WHERE NAME LIKE 'A*' would work, repeated twenty-six times, but it feels very crude, and wouldn't handle digits or symbols)
Relevant facts: modifications to the database are infrequent, and I can easily cache the query, and refresh it when the DB is modified. The DB is much larger than the RAM on the server device, so avoiding paging the whole DB file would be good. There is an index on the relevant 'name' column of the DB already. Performance is more important than brevity, so if separate queries is faster than one complex query, that's fine.
Use substr(X,Y,Z) instead of LIKE.
substr(X,Y) The substr(X,Y,Z) function returns a substring of input string X that begins with the Y-th character and which is Z characters long. If Z is omitted then substr(X,Y) returns all characters through the end of the string X beginning with the Y-th. The left-most character of X is number 1. If Y is negative then the first character of the substring is found by counting from the right rather than the left. If Z is negative then the abs(Z) characters preceding the Y-th character are returned. If X is a string then characters indices refer to actual UTF-8 characters. If X is a BLOB then the indices refer to bytes.