We are getting conflicting reports on the maximum size of JSON blobs that DB2 will allow queries on (e.g. functions like JSON_VAL and JSON_TABLE.)
There is evidence that it is limited to 16M, but I have found nothing conclusive. For example, here is a link to an IBM tech note regarding DB2 11 for z/OS. The Setup & Configuration section shows DB2’s definitions for SYSTOOLS.BSON2JSON, which seems to declare the BSON value as a 16M CLOB.
On the other hand, one source told me that larger sizes ARE allowed but it will not perform well due to DB2's inability to cache a value larger than 16M. If true, this would at least allow us to run BSON queries in our development environment, or for one-time data extraction.
Can anyone point to a more definitive answer?
IBM confirmed in response to a support request that there is a 16 MB limit, imposed by the JSON2BSON and BSON2JSON functions. They note that MongoDB has the same limit.
They implied that if you implemented your own versions of JSON2BSON and BSON2JSON in C or Java, you could get around the limit. But they have no plans to increase the limit themselves, presumably because the values would not be cacheable by DB2.
Related
I have questions on InsertManyAsync vs BulkWriteAsync via the NuGet below:
https://www.nuget.org/packages/MongoDB.Driver/2.12.3?_src=template
I want to export 300,000 rows of data around 20 MB, convert them into JSON and import them into Mongo Altas.
My questions:
1 Which operation, InsertManyAsync vs BulkWriteAsync, is transactional, e.g. all or nothing?
2 What is the maximum rows or size allowed for each operation?
The link below or elsewhere don't have the answers:
MongoDB C# driver 2.0 InsertManyAsync vs BulkWriteAsync
https://mongodb.github.io/mongo-csharp-driver/2.12/apidocs/html/M_MongoDB_Driver_IMongoCollection_1_BulkWrite.htm
https://mongodb.github.io/mongo-csharp-driver/2.12/apidocs/html/Overload_MongoDB_Driver_IMongoCollection_1_InsertManyAsync.htm
there is no difference between InsertManyAsync and BulkWriteAsync other than BulkWriteAsync can work not only with insert. In other words, InsertManyAsync calls BulkWriteAsync internally.
Is the operation is one transactional operation, e.g. all or nothing?
see transactions. Also, see ordered/unordered option here.
What is the maximum rows or size allowed?
I don't think that there is any limitation other then restriction on the document size which is 16 MB. Pay attention that bulk items can be merged into a single document before sending to the server (that should not be bigger 16 MB in sum) or they should be sent separately one by one. This logic depends on whether you use isOrdered option and whether all items in your batch are with the same bulk type.
I upgraded my Cloud SQL machine from a 'db-f1-micro' 0.6GB RAM machine to a 'db-n1-standard-1' 3.75GB RAM machine last week. Running:
SELECT ##innodb_buffer_pool_size;
The output is:
1375731712
which I believe is 1.38GB. Here's the memory utilization for the primary and replica:
This seems oddly low for this machine type but researching (How to set innodb_buffer_pool_size in mysql in google cloud sql?) it doesn't appear I can alter the innodb_buffer_pool_size. Is this somehow dynamically set and slowly increasing over time? Doesn't appear to be near the 75-80% range google appears to aim for on these.
What is the value of innodb_buffer_pool_chunk_size and innodb_buffer_pool_instances?
innodb_buffer_pool_size must always be equal to or a multiple of the product of these two values, and will be automatically adjusted to be so. Chunk size can only be modified at startup, as explained in the docs page for InnoDB Buffer Pool Size configuration.
For Google CloudSQL in particular, not only the absolute, but also the relative size of the innodb_buffer_pool_size depends on instance type. I work for GCP support, and after some research in our documentation, I can tell that pool size is automatically configured based on an internal formula, which is subject to change. Improvements are being made to make instances more resilient against OOMs, and the buffer pool size has an important role in this.
So, it is expected behaviour that with your new instance type, and possibly different innodb_buffer_pool_chunk_size and innodb_buffer_pool_instances, you might get a quite changed memory usage. Currently, the user does not have control over the innodb_buffer_pool_size.
I recently upgraded a Postgres 9.6 instance to 11.1 on Google Cloud SQL. Since then I've begun to notice a large number of the following error across multiple queries:
org.postgresql.util.PSQLException: ERROR: could not resize shared
memory segment "/PostgreSQL.78044234" to 2097152 bytes: No space left
on device
From what I've read, this is probably due to changes that came in PG10, and the typical solution involves increasing the instance's shared memory. To my knowledge this isn't possible on Google Cloud SQL though. I've also tried adjusting work_mem with no positive effect.
This may not matter, but for completeness, the instance is configured with 30 gigs of RAM, 120 gigs of SSD hd space and 8 CPU's. I'd assume that Google would provide an appropriate shared memory setting for those specs, but perhaps not? Any ideas?
UPDATE
Setting the database flag random_page_cost to 1 appears to have reduced the impact the issue. This isn't a full solution though so would still love to get a proper fix if one is out there.
Credit goes to this blog post for the idea.
UPDATE 2
The original issue report was closed and a new internal issue that isnt viewable by the public was created. According to a GCP Account Manager's email reply however, a fix was rolled out by Google on 8/11/2019.
This worked for me, I think google needs to change a flag on how they're starting the postgres container on their end that we can't influence inside postgres.
https://www.postgresql.org/message-id/CAEepm%3D2wXSfmS601nUVCftJKRPF%3DPRX%2BDYZxMeT8M2WwLSanVQ%40mail.gmail.com
Bingo. Somehow your container tech is limiting shared memory. That
error is working as designed. You could figure out how to fix the
mount options, or you could disable parallelism with
max_parallel_workers_per_gather = 0.
show max_parallel_workers_per_gather;
-- 2
-- Run your query
-- Query fails
alter user ${MY_PROD_USER} set max_parallel_workers_per_gather=0;
-- Run query again -- query should work
alter user ${MY_PROD_USER} set max_parallel_workers_per_gather=2;
-- -- Run query again -- fails
You may consider increasing Tier of the instance, that will have influence on machine memory, vCPU cores, and resources available to your Cloud SQL instance. Check available machine types
In Google Cloud SQL PostgreSQL is also possible to change database flags, that have influence on memory consumption:
max_connections: some memory resources can be allocated per-client, so the maximum number of clients suggests the maximum possible memory use
shared_buffers: determines how much memory is dedicated to PostgreSQL to use for caching data
autovacuum - should be on.
I recommend lowering the limits, to lower memory consumption.
We are specing out a system that will index and store zillions of Syslog messages. These are text messages, with a few attributes (system name, date/time, message type, message body), that are typically 100 to 1500 bytes each.
We generate 2 to 10 gb of these messages per day, and need to retain at least 30 days of them.
The splunk system has a really great indexing and document compression system.
What to use?
I thought of mongodb, but it seems inappropriate for documents of this small size.
SQL Server is a possibility, but seems perhaps not super efficient for this purpose.
Text files with lucene?
-- The windows file system doesn't always like dirs with zillions of files
Suggestions ?
Thanks!
I thought of mongodb, but it seems inappropriate for documents of this small size
There's a company called Boxed Ice that actually builds a server monitoring system using MongoDB. I would argue that it's definitely appropriate.
These are text messages, with a few attributes (system name, date/time, message type, message body), that are typically 100 to 1500 bytes each.
From a MongoDB perspective, we would say that you are storing lots of small documents with a few attributes. In a case like this MongoDB has several benefits here.
It can handle changing attributes seamlessly.
It will flexibly handle different types.
We generate 2 to 10 gb of these messages per day, and need to retain at least 30 days of them.
This is well within the type of data range that MongoDB can handle. There are several different methods of handling the 30 day retention periods. These will depend on your reporting needs. I would poke around on the groups for ideas here.
Based on the people I've worked with, this type of insert-heavy logging is one of the places where Mongo tends to be a very good fit.
Graylog2 is an open-source log management tool that is built on top of MongoDB. I believe Loggy, a logging-as-a-service provider, also uses MongoDB as their backend store. So there are quiet a few products using MongoDB for logging.
It should be possible to store the ngrams returned by a Lucene analyzer for better text searching. Not sure about the feasibility though given the large amount of documents. What is primary reporting use case?
It seems that you would want something like mongodb full-text search server, which will enable you to search on different attributes without losing performance. You may try MongoLantern: http://sourceforge.net/projects/mongolantern/. Though it's in alpha stage but gives very best result for me for 5M records.
Let me know whether this serves your purpose.
I would strongly consider using something Lucene or Solr.
Lucene is built specifically for full text search and provides a ton of additional helpful features that you may find useful in your application. As a bonus, Solr is dead simple to setup and configure. (And its super fast for searching)
They do not keep a file per entry, so you shouldnt have to worry much about zillions of files.
None of the free database options specialize in full text search - dont try to force them to do what you want.
I think you should deploy your own (intranet-wide) stack of Grafana, Logstash + ElasticSearch
When setup once you have a flexibel schema, retention and a wonderful UI for your data with Grafana.
What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice?
When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself.
When using SimpleDB I have the problem of locking me into Amazons data structure right?
What differences are there development-wise? Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB?
SimpleDB has some scalability limitations. You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. Scalability is manual (you have to shard).
If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lose the last minute of your data. Scalability is manual. It's much faster than simpledb. Autosharding is implemented in 1.6 version.
Cassandra has weak query options but is as durable as postgresql. It is as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much easier to scale than other options. It has strong .net and java client options. They have connection pooling, load balancing, marking of dead servers,...
Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).
For web apps, if your data is small I recommend mongo, if it is large cassandra is better. You don't need a caching layer with mongo or cassandra, they are already fast. I don't recommend simpledb, it also locks you to Amazon as you said.
If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.
I think you have both a question of time and speed.
MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.
On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.
In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).
Cassandra:
Scaling / Redundancy is very core
Configuration can be very intense
To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.
MongoDB:
Configuration is relatively easy (even for the new sharding, this week)
Redundancy is still "getting there"
Map-reduce is built-in and it's easy to get data out.
Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.
In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.
Now 5 years later it is not hard to set up Mongo on any OS. Documentation is easy to follow, so I do not see setting up Mongo as a problem. Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has):
I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.