A question about pgcrypto document in PostgreSQL - postgresql

In the PostgreSQL document, it says:
F. s2k-mode
Which S2K algorithm to use. Values: 0 - Without salt. Dangerous!
1 - With salt but with fixed iteration count.
3 - Variable iteration count.
Default: 3
Applies to: pgp_sym_encrypt
F. s2k-count
The number of iterations of the S2K algorithm to use. It must be a
value between 1024 and 65011712, inclusive.
Default: A random value between 65536 and 253952 Applies to:
pgp_sym_encrypt, only with s2k-mode=3
If I want to use:
pgp_sym_encrypt(data, psw, 'compress-algo=0,disable-mdc=1,s2k-mode=1,cipher-algo=aes256')
For s2k-mode=1,
1 - With salt but with fixed iteration count.
Anybody knows how many iteration count is used for it exactly?
Just curious the document doesn't mention this clearly.


In postgres with a non-unique column storing a wide range of values, is it possible to index it and gain in performance?

I receive large anounts of data on a growing number of users who attempt a physical feat but then lose interest and leave. Each user is given a unique id. Each attempt is given a unique id. Data flows to me in the form of a table relating the users to attempts (rel_user_attempts). Note the attempts arrived in batches but not always chronologically.
id (pk) archived userid attemptid (unique)
1 false 152 4001
2 false 152 4002
3 false 152 4003
4 false 19 4004
5 false 19 4005
6 false 19 4006
7 false 2409 3301
8 true 2409 3302
9 false 2409 3303
... etc
The most common search my analytics team will perform is by user (example user 19)
SELECT * FROM rel_user_attempts WHERE userid=19 AND archived=false;
In postgres with a non-unique column (userid) storing a wide range of values is it possible to index it and gain in performance?
The benefit of using any index, or whether Postgres might even choose to use a particular index, depends on several things, among them including the cardinality of the underlying data. Indices help the most when used on columns which tend to have values which are either unique or relatively unique. You may find the following index to be helpful here:
CREATE INDEX idx ON rel_user_attempts (userid, archived);
The cardinality on the userid column appears, at least from your sample data, to be not high, but at least somewhat high. Given that archived is a boolean column, assuming that true/false values occur with equal probability, the cardinality on this column would be low. But, we can still include it in the above index to completely cover the WHERE clause of your query. Consider adding the above index and then checking the execution plan.

Cannot allocate memory for a column of compound floats on a partitioned table

I have a partitioned table in my hdb that includes a column containing large lists of floats (at most 400 floats per element). eg each element looks like
(100.0 1.0 ...)
When trying to select on this column from days where there are particularly high numbers of rows I get an error saying
'./2015.02.07/table/column# Cannot allocate memory
The same error arises from a query like:
select column[;0] from table where date=2015.02.07
even though on days with fewer rows this query returns the first value of each element in the column.
Is there a way to stream this column in a select to decrease the memory requirements of holding the whole column in memory for a large day?
.Q.ind on large days fails with the same error.
ie given I can work with 2015.02.01 but not 2015.02.02:
.Q.ind[select from table where date=2015.02.01;enlist 1]
is fine but
.Q.ind[select from table where date=2015.02.02;enlist 1]
fails with
'./2015.02.10/table/column2#: Cannot allocate memory
I should note I am using the free 32-bit version
I think this is all just a combination of the free-32bit memory limitation, the fact that your row counts are possibly large and the fact that (unavoidably) something must be pulled entirely into memory when retrieving data from a column, whether it is the column itself that gets entirely pulled in (in the non-nested case) or if its the nested-index column that gets entirely pulled in.
Another thing to consider is that kdb uses powers-of-two (buddy) memory allocation. Even if todays table only contains one more row than yesterdays, the memory requirements per column could double. Take a simple example:
In the free 32bit version (windows) you can create this many floats and it only uses ~1.07gb of memory
q)\ts 134217726?1.0
3093 1073741952
However, try to generate one extra float and you hit a memory limit
q)\ts 134217727?1.0
So even a small amount of rows in the difference between one day and the next can be very significant if you're near the boundary of allocatable powers of two.
--DISCLAIMER-- the following is hacky and is only intended for debugging!
You can actually manually try to access the data from the nested list, though you may still have memory issues here anyway.
Create a nested table and splay it
q)tab:([] col1:(101 102 103f;104 105f;106 107 108 109 110f;111 112f))
101 102 103f
104 105f
106 107 108 109 110f
111 112f
q)`:test/ set tab
You can try to read in the indices from the nested-index file
q)2_first (enlist "j";enlist 8)1:`:test/col1
3 5 10 12
So the indices for splitting the full list of floats (the col1# file) is index 3, index 5, 10 etc etc
Say I want the first 3 rows
q)myrows:3#2_first (enlist "j";enlist 8)1:`:test/col1
3 5 10
then I know that I need the first 10 floats from the col1# file and need to split them at index 3 and 5. Then I can read the col1# file partially and split it correctly
q)(0,-1_myrows) cut raze (enlist "f";enlist 8)1:(`$":test/col1#";0;8*last myrows)
101 102 103f
104 105f
106 107 108 109 110f
But this is precisely what KDB does under the covers anyway so I suspect that you'll still have trouble even reading in the nested-index file in the first place.
Check this debug/hack and see if you can partially read that way. But obviously it's not a long-term solution!
Nested columns make querying in the usual way difficult, as the # file also needs to be loaded into memory (even with a [;0])
Your best bet is to select map a date partition in, and then select within that chunk by chunk, e.g. a million rows at a time (or whatever is sensible given the size of nested floats).
Perhaps also consider 32bit floats, if some decimal accuracy can be sacrificed.
So after comments I guess the best way is to go each partition a number of lines at a time with .Q.ind
Just to give my 2 cents on this, I had a similar error but with a 64-bit instance.
I suspected that the memory needed to be de-fragmented as it was running for almost a year.
Bouncing the instance solved the issue, and released a lot of virtual memory

rethinkdb: group documents by price range

I want to group documents in rethinkdb by price range (0-100, 100-200, 200-300, and so on), instead of a single price value. How do I do that?
Unfortunately, ReQL doesn't support rounding at the moment (see github issue #866), but you can get something similar through some minor annoyances.
First of all, I would recommend making this an index on the given table if you're going to be running this regularly or on large data sets. The function I have here is not the most efficient because we can't round numbers, and an index would help mitigate that a lot.
These code samples are in Python, since I didn't see any particular language referenced. To create the index, run something like:
lambda row: row['price'].coerce_to('STRING').split('.')[0].coerce_to('NUMBER')
.do(lambda x: x.sub(x.mod(100)))).run()
This will create a secondary index based on the price where 0 indicates [0-100), 100 is [100-200), and so on. At this point, a group-by is trivial:
If you would really rather not create an index, the mapping can be done during the group in a single query:
lambda row: row['price'].coerce_to('STRING').split('.')[0].coerce_to('NUMBER')
.do(lambda x: x.sub(x.mod(100)))).run()
This query is fairly straight-forward, but to document what is going on:
coerce_to('STRING') - we obtain a string representation of the number, e.g. 318.12 becomes "318.12".
split('.') - we split the string on the decimal point, e.g. "318.12". becomes ["318", "12"]. If there is no decimal point, everything else should still work.
[0] - we take the first value of the split string, which is equivalent the original number rounded down. e.g. "318".
coerce_to('NUMBER') - we convert the string back into an integer, which allows us to do modulo arithmetic on it so we can round, e.g. "318" becomes 318.
.do(lambda x: x.sub(x.mod(100))) - we round the resulting integer down to the nearest 100 by running (essentially) x = x - (x % 100), e.g. 318 becomes 300.

reasons behind decreasing of "last value" of a SEQUENCE automatically in postgresql?

I am using a "sequence" with start value = 1 and increment = 1 and suppose i insert 10 elements in the table then "last_value" goes to 10. now if i delete 10th element then "last_value" still points to 10. now my question is, is there any possibility that value of "last_value" may decrease to 9 by killing postgres or by taking dump and then again restoring the database (or any other case).
In my case it happened (i don't know how).please provide possible reasons for this.
There are only 2 cases, when a sequence is "decreased":
Sequence reached its MAXVALUE and is reset to MINVALUE . This can be controled with [ NO ] CYCLE in CREATE SEQUENCE
Sequence is explicitly reset with setval()
There are no other ways to get the same value from a sequence 2 times.
You can set the value of a sequence.
SELECT setval('public.sequence_name', 9, true);
However, if you need the values to be contiguous, I'm not sure that a sequence is the most appropriate way to go about that. Perhaps using rank()?

Solr date field tdate vs date?

So I have a question about Solr's field date types which is pretty straight forward: what's the difference between a 'date' field and a 'tdate' one?
The schema .xml claims that 'For faster range queries, consider the tdate type' and 'A Trie based date field for faster date range queries and date faceting. '
Fair enough... but what's the precisionStep="6" all about? should i change this? does it change the way i would create the query in case I use the tdate? What's the real advantage or what does Solr do that makes it better?
P.S went through google, Solr manual, solr wiki and the java docs without any luck so I'd appreciate a kind and explanatory answer :)...
Also checked:
Trie fields make range queries faster by precomputing certain range results and storing them as a single record in the index. For clarity, my example will use integers in base ten. The same concept applies to all trie types. This includes dates, since a date can be represented as the number of seconds since, say, 1970.
Let's say we index the number 12345678. We can tokenize this into the following tokens.
The 12345678 token represents the actual integer value. The tokens with the x digits represent ranges. 123456xx represents the range 12345600 to 12345699, and matches all the documents that contain a token in that range.
Notice how in each token on the list has successively more x digits. This is controlled by the precision step. In my example, you could say that I was using a precision step of 2, since I trim 2 digits to create each extra token. If I were to use a precision step of 3, I would get these tokens.
A precision step of 4:
A precision step of 1:
It's easy to see how a smaller precision step results in more tokens and increases the size of the index. However, it also speeds up range queries.
Without the trie field, if I wanted to query a range from 1250 to 1275, Lucene would have to fetch 25 entries (1250, 1251, 1252, ..., 1275) and combine search results. With a trie field (and precision step of 1), we could get away with fetching 8 entries (125x, 126x, 1270, 1271, 1272, 1273, 1274, 1275), because 125x is a precomputed aggregation of 1250 - 1259. If I were to use a precision step larger than 1, the query would go back to fetching all 25 individual entries.
Note: In reality, the precision step refers to the number of bits trimmed for each token. If you were to write your numbers in hexadecimal, a precision step of 4 would trim one hex digit for each token. A precision step of 8 would trim two hex digits.
Basically trie ranges are faster. Here is one explanation. With precisionStep you configure how much your index can grow to get the performance benefits. To quote from the link you are referring:
More importantly, it is not dependent on the index size, but instead the precision chosen.
the only drawbacks of TrieRange are a little bit larger index sizes, because of the additional terms indexed
Your best bet is to just look at the source code. Some of the things for Solr aren't well documented and the fastest way to get a trustworthy answer is to simply look at the code. If you haven't been in the code yet, that too is to your benefit. At least in the long run.
Here's a link to the TrieTokenizerFactory.
The javadoc in the class at least hints at the purpose of the precisionStep. You could dig futher.
EDIT: I dug a bit further for you. It's passed off directly to Lucene's NumericTokenStream class, which will used the value during parsing the token stream. Probably worth closer examination. It seems to deal with granularity and is probably a tradeoff between size in the index and speed.