Solr date field tdate vs date? - date

So I have a question about Solr's field date types which is pretty straight forward: what's the difference between a 'date' field and a 'tdate' one?
The schema .xml claims that 'For faster range queries, consider the tdate type' and 'A Trie based date field for faster date range queries and date faceting. '
Fair enough... but what's the precisionStep="6" all about? should i change this? does it change the way i would create the query in case I use the tdate? What's the real advantage or what does Solr do that makes it better?
P.S went through google, Solr manual, solr wiki and the java docs without any luck so I'd appreciate a kind and explanatory answer :)...
Also checked:
http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
http://web.archiveorange.com/archive/v/AAfXfqRYyLnDFtskmLRi

Trie fields make range queries faster by precomputing certain range results and storing them as a single record in the index. For clarity, my example will use integers in base ten. The same concept applies to all trie types. This includes dates, since a date can be represented as the number of seconds since, say, 1970.
Let's say we index the number 12345678. We can tokenize this into the following tokens.
12345678
123456xx
1234xxxx
12xxxxxx
The 12345678 token represents the actual integer value. The tokens with the x digits represent ranges. 123456xx represents the range 12345600 to 12345699, and matches all the documents that contain a token in that range.
Notice how in each token on the list has successively more x digits. This is controlled by the precision step. In my example, you could say that I was using a precision step of 2, since I trim 2 digits to create each extra token. If I were to use a precision step of 3, I would get these tokens.
12345678
12345xxx
12xxxxxx
A precision step of 4:
12345678
1234xxxx
A precision step of 1:
12345678
1234567x
123456xx
12345xxx
1234xxxx
123xxxxx
12xxxxxx
1xxxxxxx
It's easy to see how a smaller precision step results in more tokens and increases the size of the index. However, it also speeds up range queries.
Without the trie field, if I wanted to query a range from 1250 to 1275, Lucene would have to fetch 25 entries (1250, 1251, 1252, ..., 1275) and combine search results. With a trie field (and precision step of 1), we could get away with fetching 8 entries (125x, 126x, 1270, 1271, 1272, 1273, 1274, 1275), because 125x is a precomputed aggregation of 1250 - 1259. If I were to use a precision step larger than 1, the query would go back to fetching all 25 individual entries.
Note: In reality, the precision step refers to the number of bits trimmed for each token. If you were to write your numbers in hexadecimal, a precision step of 4 would trim one hex digit for each token. A precision step of 8 would trim two hex digits.

Basically trie ranges are faster. Here is one explanation. With precisionStep you configure how much your index can grow to get the performance benefits. To quote from the link you are referring:
More importantly, it is not dependent on the index size, but instead the precision chosen.
and
the only drawbacks of TrieRange are a little bit larger index sizes, because of the additional terms indexed

Your best bet is to just look at the source code. Some of the things for Solr aren't well documented and the fastest way to get a trustworthy answer is to simply look at the code. If you haven't been in the code yet, that too is to your benefit. At least in the long run.
Here's a link to the TrieTokenizerFactory.
http://www.jarvana.com/jarvana/view/org/apache/solr/solr-core/1.4.1/solr-core-1.4.1-sources.jar!/org/apache/solr/analysis/TrieTokenizerFactory.java?format=ok
The javadoc in the class at least hints at the purpose of the precisionStep. You could dig futher.
EDIT: I dug a bit further for you. It's passed off directly to Lucene's NumericTokenStream class, which will used the value during parsing the token stream. Probably worth closer examination. It seems to deal with granularity and is probably a tradeoff between size in the index and speed.

Related

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

AWS Athena: Handling big numbers

I have files on S3 where two columns contain only positive integers which can be of 10^26. Unfortunately, according to AWS Docs Athena only supports values in a range up to 2^63-1 (approx 10^19). So at the moment these column are represented as a string.
When it comes to filtering it is not that big of an issue, as I can use regex. For example, if I want to get all records between 5e^21 and 6e^21 my query would look like:
SELECT *
FROM database.table
WHERE (regexp_like(col_1, '^5[\d]{21}$'))
I have approx 300M rows (approx 12GB in parquet) and it takes about 7 seconds, so performance wise it ok.
However, sometimes I would like to perform some math operation on these two big columns, e.g subtract one big column from another. Casting these records to DOUBLE wouldn't work due to approximation error. Ideally, I would want to stay within Athena. At the moment, I have about 100M rows that are greater then 2^63-1, but this number can grow in a future.
What would be the right way to approach problem of having numerical records that exceed available range? Also what are your thoughts on using regex for filtering? Is there a better/more appropriate way to do it?
You can cast numbers of the form 5e21 to an approximate 64bit double or an exact numeric 128bit decimal. First you'll need to remove the caret ^, with the replace function. Then a simple cast will work:
SELECT CAST(replace('5e^21', '^', '') as DOUBLE);
_col0
--------
5.0E21
or
SELECT CAST(replace('5e^21', '^', '') as DECIMAL);
_col0
------------------------
5000000000000000000000
If you are going to this table often, I would rewrite it the new data type to save processing time.

rethinkdb: group documents by price range

I want to group documents in rethinkdb by price range (0-100, 100-200, 200-300, and so on), instead of a single price value. How do I do that?
Unfortunately, ReQL doesn't support rounding at the moment (see github issue #866), but you can get something similar through some minor annoyances.
First of all, I would recommend making this an index on the given table if you're going to be running this regularly or on large data sets. The function I have here is not the most efficient because we can't round numbers, and an index would help mitigate that a lot.
These code samples are in Python, since I didn't see any particular language referenced. To create the index, run something like:
r.db('foo').table('bar').index_create('price_range',
lambda row: row['price'].coerce_to('STRING').split('.')[0].coerce_to('NUMBER')
.do(lambda x: x.sub(x.mod(100)))).run()
This will create a secondary index based on the price where 0 indicates [0-100), 100 is [100-200), and so on. At this point, a group-by is trivial:
r.db('foo').table('bar').group(index='price_range').run()
If you would really rather not create an index, the mapping can be done during the group in a single query:
r.db('foo').table('bar').group(
lambda row: row['price'].coerce_to('STRING').split('.')[0].coerce_to('NUMBER')
.do(lambda x: x.sub(x.mod(100)))).run()
This query is fairly straight-forward, but to document what is going on:
coerce_to('STRING') - we obtain a string representation of the number, e.g. 318.12 becomes "318.12".
split('.') - we split the string on the decimal point, e.g. "318.12". becomes ["318", "12"]. If there is no decimal point, everything else should still work.
[0] - we take the first value of the split string, which is equivalent the original number rounded down. e.g. "318".
coerce_to('NUMBER') - we convert the string back into an integer, which allows us to do modulo arithmetic on it so we can round, e.g. "318" becomes 318.
.do(lambda x: x.sub(x.mod(100))) - we round the resulting integer down to the nearest 100 by running (essentially) x = x - (x % 100), e.g. 318 becomes 300.

MS SQL Float Decimal Comparison Problems

I'm in the process of normalising a database, and part of this involves converting a column from one table from a FLOAT to a DECIMAL(28,18). When I then try to join this converted column back to the source column, it returns no results in some cases.
It seems to bee something to do with the ways its converted. For example, the FLOAT converted to a DECIMAL(28,18) produces:
51.051643260000006000
The original FLOAT is
51.05164326
I have tried various ways of modifying the FLOAT, and none of these work either:
CAST(51.05164326 AS DECIMAL(28,18)) = 51.051643260000000000
STR(51.05164326 , 28,18) = 51.0516432599999990
The reason for the conversion is due to improving the accuracy of these fields.
Has anyone got a consistent strategy to convert these numbers, and be able to ensure subsequent joins work?
Thanks in advance
CM
For your application, you need to consider how many decimal places you need. It looks like in reality you require about 8-14 decimal places not 18.
One way to do the conversion is cast(cast(floatColumn as decimal(28,14)) as decimal(28,18)).
To do a join between a decimal and float column, you can do something like this:
ON cast(cast(floatColumn as decimal(28,14)) as decimal(28,18)) = decimalColumn
Provided the double-cast is the same double-cast used to create the decimalColumn, this will allow you to make use of an index on the decimalColumn.
Alternatively you can use a range join:
ON floatColumn > decimalColumn - #epsilon AND floatColumn < decimalColumn + #epsilon
This should still make use of the index on decimalColumn.
However, it is unusual to join on decimals. Unless you actually need to join on them or need to do a direct equality comparision (as opposed to a range comparison), it may be better to simply do the conversion as you are, and document the fact that there is a small loss of accuracy due to the initial choice of an inappropriate data type.
For more information see:
Is it correct to compare two rounded floating point numbers using the == operator?
Dealing with accuracy problems in floating-point numbers

Filemaker: making queries of large data more efficient

OK I have a Master Table of shipments, and a separate Charges table. There are millions of records in each, and it's come into Filemaker from a legacy system, so all the fields are defined as Text even though they may be Date, Number, etc.
There's a date field in the charges table. I want to create a number field to represent just the year. I can use the Middle function to parse the field and get just the year in a Calculation field. But wouldn't it be faster to have the year as a literal number field, especially since I'm going to be filtering and sorting? So how do I turn this calculation into its value? I've tried just changing the Calculation field to Number, but it just renders blanks.
There's something wrong with your calculation, it should not turn blank just because field type is different. I.e.:
Middle("10-12-2010", 7, 4)
should suffice, provided the calc result is set to Number. You may also wrap it into GetAsNumber(...), but, really, there's no difference as long as field type is right.
If you have FM Advanced, try to set up your calc in the Data Viewer (Tools -> Data Viewer) rather than in Define Fields, this would be faster and, once you like the result, you can transfer it into a field or make a replace. But, from the searching/sorting standpoint there's no difference between a (stored) calculation and a regular field, so replacing is pointless and, actually, more dangerous, as there's no way to undo a wrong replace.
Here's what i was looking for, from
http://help.filemaker.com/app/answers/detail/a_id/3366/~/converting-unstored-calculation-fields-to-store-data
:
Basically, instead of using a
Calculation field, you create am EMPTY
Number, date or text field and use
Replace Field Contents from the Records menu, and put
your calculation (or reference, or
both) there.
Not dissing FileMaker at all, but millions of records means FileMaker is probably the wrong choice here. Your system will be slow, slow, slow. FileMaker is great for workgroups and there is no way to develop a database app faster. But one thing FileMaker is not good at is handling huge numbers of records.
BTW, Mikhail Edoshin is exactly right.