I wonder, what is the difference between query and command fields into mongostat output? Documentation is just says that command - number of commands. huh...
insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn set repl time
15 161 72 *0 194 113|0 0 45g 90.8g 290m 6 Site:2.2% 0 0|0 0|0 66k 157k 105 sitename PRI 11:25:48
from mongo manual ::
query The number of query operations per second.
update The number of update operations per second.
delete The number of delete operations per second.
getmore The number of get more (i.e. cursor batch) operations per
second.
command The number of commands per second. On slave and secondary
systems, mongostat presents two values separated by a pipe character
(e.g. |), in the form of local|replicated commands.
source
Related
In PG log file we can find some statistics about how vacuum works on a table.
Why vacuum needs to access WAL for cleaning up a table?
pages: 0 removed, 1640 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 99960 remain, 0 are dead but not yet removable, oldest xmin: 825
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 8.878 MB/s, avg write rate: 0.000 MB/s
buffer usage: 48 hits, 1 misses, 0 dirtied
**WAL usage: 1 records**, 0 full page images, 229 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM modifies the table by deleting old row versions, freezing rows and truncating empty pages at the end of the table.
All modifications to a table are logged in WAL, and that applies to VACUUM as well. After all, you need to replay these activities after a crash.
When I run below query to get the count of a table, the size and time taken to run the query is almost same. / table t has 97029 records and 124 cols
Q.1. - Does column i in below query uses unique attribute internally to return the output in constant time using has function?
\ts select last i from t where date=.z.d-5 / 3j, 1313248j
/ time taken to run the query and memory used is always same not matter how many times we run same query
When I run below query:
For the first time required time and memory is very high but from next run the time and memory required is very less.
Q.2. Does kdb caches the output when we run the query for the first time and show the output from cache from next time?
Q.3 Is there attribute applied on column i while running below query, if so, then which one?
\ts select count i from t where date=.z.d-5 / 1512j, 67292448j
\ts select count i from t where date=.z.d-5 / 0j, 2160j
When running below query:
Q.4 Is any attribute applied on column i on running below query?
\ts count select from t where date=.z.d-5 / 184j, 37292448j
/time taken to run the query and memory used is always same not matter how many times we run
Q.5 which of the following queries should be used to get the column of tables with very high number of records? Any other query which can be more fast and less memory consuming to get same result?
There isn't a u# attribute applied to the i column, to see this:
q)n:100000
q)t:([]a:`u#til n)
q)
q)\t:1000 select count distinct a from t
2
q)\t:1000 select count distinct i from t
536
The timing of those queries isn't constant, there are just not enough significant figures to see the variability. Using
\ts:100 select last i from t where date=.z.d-5
will run the query 100 times and highlight that the timing is not constant.
The first query will request more memory be allocated to the q process, which will remain allocated to the process unless garbage collection is called (.Q.gc[]). The memory usage stats can be viewed with .Q.w[]. For example, in a new session:
q).Q.w[]
used| 542704
heap| 67108864
peak| 67108864
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
q)
q)\t b: til 500000000
6569
q)
q).Q.w[]
used| 4295510048
heap| 4362076160
peak| 4362076160
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
q)
q)b:0
q)
q).Q.w[]
used| 542768
heap| 4362076160
peak| 4362076160
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
q)
q)\t b: til 500000000
877
q)
q).Q.w[]
used| 4295510048
heap| 4362076160
peak| 4362076160
wmax| 0
mmap| 0
mphy| 16827965440
syms| 1044
symw| 48993
Also, assuming the table in question is partitioned, the query shown will populate .Q.pn which can then be used to get the count afterwards, for example
q).Q.pn
quotes|
trades|
q)\ts select count i from quotes where date=2014.04.25
0 2656
q).Q.pn
quotes| 85204 100761 81724 88753 115685 125120 121458 97826 99577 82763
trades| ()
In more detail, .Q.ps does some of the select operations under the hood. if you look on the 3rd line:
if[$[#c;0;(g:(. a)~,pf)|(. a)~,(#:;`i)];f:!a;j:dt[d]t;...
this checks the "a" (select) part of the query, and if it is
(#:;`i)
(which is count i) it ends up running .Q.dt, which runs .Q.cn, which gets the partition counts. So the first time it is running this, it runs .Q.cn, getting the count for all partitions. The next time .Q.cn is run, it can just look up the values in the dictionary .Q.pn which is a lot faster.
See above.
See above about attributes on i. count is a separate operation, not part of the query, and wont be affected by attributes on columns, it will see the table as a list.
For tables on disk, each column should contain a header where the count of the vector is available at very little expense:
q)`:q set til 123
`:q
q)read1 `:q
0xfe200700000000007b000000000000000000000000000000010000000000000002000000000..
q)9#read1 `:q
0xfe200700000000007b
q)`int$last 9#read1 `:q
123i
q)
q)`:q set til 124
`:q
q)9#read1 `:q
0xfe200700000000007c
q)`int$last 9#read1 `:q
124i
Still, reading any file usually takes ~1ms at least, so the counts are cached as mentioned above.
I am pulling down stock market data and inserting it into a postgresql database. I have 500 stocks for 60 days of historical data. Each day has 390 trading minutes, and each minute is a row in the database table. The summary of the issue is that the first 20-50 minutes of each day are missing for the each stock. Sometimes its less than 50, but it is never more than 50. Every minute after that for each day is fine (EDIT: on further inspection there are missing minutes all over the place). The maximum matches the max number of concurrent goroutines (https://github.com/korovkin/limiter).
The hardware is set up in my home. I have a laptop that pulls the data, and a 8 year old gaming computer that has been repurposed as a postgres database running in ubuntu. They are connected through a netgear nighthawk x6 router and communicate over the LAN.
The laptop is running a go program that pulls data down and performs concurrent inserts. I loop through the 60 days, for each day I loop through each stock, and for each stock I loop through each minute and insert it into the database via a INSERT statement. Inside the minute loop I used a library that limits the max number of goroutines.
I am fixing it by grabbing the data again, and inserting until the first time the postgres server responds that the entry is a duplicate and violates the unique constraints on the table and breaking out of the loop for each stock.
However, I'd like to know what happened, as I want to better understand how these problems can arise under load. Any ideas?
limit := NewConcurrencyLimiter(50)
for _, m := range ms {
limit.Execute(func() {
m.Insert()
})
}
limit.Wait()
The issue is that using a receiver means that everything is passed by reference. I needed to copy the values I wanted inserted within the for loop, and change the method away from a receiver to one with input parameters
for i, _ := range ms {
value := ms[i]
limit.Execute(func() {
Insert(value)
})
}
limit.Wait()
I'm using a simple lua program to do a batch insert, the expire time of each item was set to 86400 seconds, which won't get expired for the whole day.
Right now I have 1,000,000 curr_items, but if I dump them with memcached-tools, and grep for '^add', I got only 27235 items
%> memcached-tool 127.0.0.1:11211 display
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
8 480B 4s 464 1012907 no 0 0 0
%> memcached-tool 127.0.0.1:11211 dump | grep ^add -c
Dumping memcache contents
Number of buckets: 1
Number of items : 1012907
Dumping bucket 8 - 1012907 total items
27235
%> memcached-tool 127.0.0.1:11211 stats | egrep '(curr|bytes)'
bytes 447704894
bytes_read 407765187
bytes_written 78574999
curr_connections 10
curr_items 1012907
limit_maxbytes 2147483648
I need this to estimate possible memory need for my system, but now I'm not sure about the item count, which one is correct?
Okay, there's hard limit on cachedump, which is 2M as in the mailing list.
That's what stopped the tool from dumping more keys.
I have around 105 milions records similar to this:
{
"post_id": 1314131221,
"date": 1309187001,
"upvotes": 2342
}
in MongoDB collection.
I also have an index on "post_id" and "date".
Then i need to do this:
db.fb_pages_fans.find({
post_id: 1314131221,
date: {"$gt": 1309117001, "$lta": 1309187001}
}).sort({date: 1});
If i set "date" on the specific date:
when it returns 30 records, it took ~130ms
when it returns 90 records, it took ~700ms
when it returns 180 records, it took ~1200ms
Of course i'm talking about first query, second and more queries are very fast, but i need to have first queries fast.
It's much more slower from 90 records than PostgreSQL, which I use now. Why is that so slow?
btw. creating index on mentioned two "cols" on 105mil records took around 24 hours.
It runs on one machine with 12GB RAM, here is a log from mongostats when i was executing the query:
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 0 0 0 0 1 0 23.9g 24.1g 8m 0 0 0 0|0 0|0 62b 1k 1 18:34:04
0 1 0 0 0 1 0 23.9g 24.1g 8m 21 0 0 0|0 0|0 215b 3k 1 18:34:05
If your first query is slow and all consequtive, similar queries fast then mongo is moving the queried data from disk to memory. That's relatively hard to avoid with data sets that size. Use mongostat and check faults statistic to see if you're getting pagefaults during your queries. Alternatively it might be that your index(es) do not fit into memory, in which case you can try and right balance them so that the relevant, high throughput parts of it are consistently in physical memory.
Also, are we talking a single physical database or a sharded setup?