I'm trying to compose some rather largish (~50-10GB) objects on cloud storage and the compose object limit seems arbitrarily low - only 32 items. So I'm currently using 350MB chunks -- why is the limit so low?
I read somewhere that an iterative compose doesn't work - is that correct? ie, if I wanted to compose 64 objects to two, then two to 1 - does that not work?
One single request can only compose 32 components, but you can compose composed objects. The overall limit is 1024:
https://developers.google.com/storage/docs/composite-objects#_Compose
Related
I'm reading "Seven Databases in Seven Weeks". Could you please explain me the text below:
One downside of a distributed system can be the lack of a single
coherent filesystem. Say you operate a website where users can upload
images of themselves. If you run several web servers on several
different nodes, you must manually replicate the uploaded image to
each web server’s disk or create some alternative central system.
Mongo handles this scenario by its own distributed filesystem called
GridFS.
Why do you need replicate manually uploaded images? Does they mean some of the servers will have linux and some of them Windows?
Do all replicated data storages tend to implement own filesystem?
On the need for data distribution and its intricacies
Let us dissect the example in a bit more detail. Say you have a web application where people can upload images. You fire up your server, save the images to the local machine in /home/server/app/uploads, the users use the application. So far, so good.
Now, your application becomes the next big thing, you have tens of thousands of concurrent users and your single server simply can not handle that load any more. Luckily, aside from the fact that you store the images in the local file system, you implemented the application in a way that you could easily put up another instance and distribute the load between them. But now here comes the problem: the second instance of your application would not have access to the images stored on the first instance – bad thing.
There are various ways to overcome that. Let us take NFS as an example. Now your second instance can access the images, and even store new ones, but that puts all the images on one machine, which sooner or later will run out of disk space.
Scaling storage capacity can easily become a very expensive part of an application. And this is where GridFS comes to help. It uses the rather easy means of MongoDB to distribute data across many machines, a process which is called sharding. Basically, it works like this: Instead of accessing the local filesystem, you access GridFS (and the contained files within) via the MongoDB database driver.
As for the OS: Usually, I would avoid mixing different OSes within a deployment, if at all possible. Nowadays, there is little to no reason for the average project to do so. I assume you are referring to the "different nodes" part of that text. This only refers to the fact that you have multiple machines involved. But they perfectly can run the same OS.
Sharding vs. replication
Note The following is vastly simplified, because going into details would well exceed the scope of one or more books.
The excerpt you quoted mixes two concepts a bit and is not clear enough on how GridFS works.
Lets first make the two involved concepts a bit more clear.
Replication is roughly comparable to a RAID1: The data is stored on two or more machines, and each machine holds all data.
Sharding (also known as "data partitioning") is roughly comparable to a RAID0: Each machine only holds a subset of the data, albeit you can access the whole data set (files in this case) transparently and the distributed storage system takes care of finding the data you requested (and decides where to store the data when you save a file)
Now, MongoDB allows you to have a mixed form, roughly comparable to RAID10: The data is distributed ("partitioned" or "sharded") between two or more shards, but each shard may (and almost always should) consist of a replica set, which is an uneven number of MongoDB instances which all hold the same data. This mixed form is called a "sharded cluster with a replication factor of X", where X denotes the non-hidden members per replica set.
The advantage of a sharded cluster is that there is no single point of failure any more:
Depending on your replication factor, one or more replica set members can fail, and the cluster is still working
There are servers which hold the metadata (which part of the data is stored on which shard, for example). Those are called config servers. As of MongoDB version 3.0.x (iirc), they form a replica set themselves – not much of a problem if a node fails.
You access a sharded cluster via a the mongos sharded cluster query router of which you usually have one per instance of your application (and most often on the same server as your application instance). But: most drivers can be given multiple mongos instances to connect to. So if one of those mongos instances fails, the driver will happily use the next one you configured.
Another advantage is that in case you need to add additional storage or have more IOPS than your current system can handle, you can add another shard: MongoDB will take care of distributing the existing data between the old shards and the new shard automagically. The details on how this is done are covered in the introduction to Sharding in the MongoDB docs.
The third advantage – and the one that has the most impact, imho – is that you can distribute (and replicate) data on relatively cheap commodity hardware, whereas most other technologies offering the benefits of GridFS on a sharded cluster require you to have specialized and expensive hardware.
A disadvantage is of course that this setup only is feasible if you have a lot of data, since many machines are necessary to set up a sharded cluster:
At least 3 config servers
At least a single shard, which should consist of a replica set. The minimal setup would be two data bearing nodes plus an arbiter
But: in order to use GridFS in general, you do not even need a replica set ;).
To stay within our above example: Both instances of your application could well access the same MongoDB instance holding a GridFS.
Do all replicated data storages tend to implement own filesystem?
Replicated? Not necessarily. There is DRBD for example, which could be described as "RAID1 over ethernet".
Assuming we have the same mixup of concepts here as we had above: Distributed file systems by their very definition implement a file system.
In this case,IMHO, author was stating that each web server has own disk storage, not shared with others - having that - upload path could be /home/server/app/uploads and as it is part of server filesystem is not shared at all as a kind of security with service provider. To populate those we need to have a script/job which will sync data to other places behind the scenes.
This scenario could be a case to use GridFS with mongo.
How gridFS works:
GridFS divides the file into parts, or chunks 1, and stores each
chunk as a separate document. By default, GridFS uses a chunk size of
255 kB; that is, GridFS divides a file into chunks of 255 kB with the
exception of the last chunk. The last chunk is only as large as
necessary. Similarly, files that are no larger than the chunk size
only have a final chunk, using only as much space as needed plus some
additional metadata.
In reply to comment:
BSON is binary format, and mongo has special replication mechanism for replicating collection data (gridFS is a special set of 2 collections). It uses OpLog to send diffs toother servers in replica set. More here
Any comments welcome!
A part of the application, I am working on, allows a user to build a query (visually) to export data from the system. Currently, I am building a Elasticsearch query based on the user input and writing a web service that will collect all the results of that query (after applying some basic filters).
After doing some basic benchmarking on the production Elasticsearch cluster, I am starting to doubt if Elasticsearch is the right tool for this. It takes about 22 minutes to export 1 million contacts (from index size of 11 million). There are 3 nodes in the cluster - each with 4 cores, 16 GB ram (heap size 8GB), and EBS storage (I know this is not the most efficient storage for ES).
Is Elasticsearch really not well suited for these kind of large volume (and frequent) exports? In my setup Elasticsearch follows Mongo (using transporter plugin), so its possible to get data out from Mongo as well. Would Mongo be a better option? I am bit skeptical to use Mongo considering its not-so-good memory management and also potentially polluting the working set when running export(s).
Currently I am pulling data from Elasticsearch over HTTP REST using scroll (and scan api). Its possible that I might get more throughput using Elasticsearch's Java node client or using a plugin like (https://github.com/jprante/elasticsearch-knapsack). Although with knapsack plugin I lose the ability of recording time (and other book-keeping) for each export.
Any suggestions would be highly appreciated.
You can experiment with the scroll and size parameters. It depends on the size of your documents, but you could try something like 30s for scroll and 1000 for size.
You retrieve docs at a rate of ~760 docs/s which is not great but as you say, EBS is not fast. Did you try to calculate what the current throughput is and compare it to EBS bandwidth?
A Java client could be a bit faster but I would guess the overhead is somewhere else in this case.
In our project, we have two distinctive classes of data that we want to store inside the Redshift cluster - one class of data requires a lot of space (terabytes), but we only run ad-hoc queries against it, so performance is not a big issue and another class of data is much smaller (tens of gigabytes), but we need to run significantly larger number of queries against it, so performance is an important factor. Given that distinction - first class of data is best suited for HDD-backed nodes, while second class is better on SSD-powered, more performance oriented nodes. We could separate this data potentially into two different clusters, but I wonder if it is possible to mix nodes of different type in the same cluster and somehow control which tables reside on which nodes.
No, you can not mix two types of nodes together as of now.
I have same scenario like you have. So We also have two clusters one is ssd with less space, more queries but faster and other is non ssd which have tera bytes data which is used for historic report where query performance is not issue.
What is the largest known Neo4j cluster (in db size, graph stats, or # of machines)?
The # of nodes and relationships was recently (with the 1.3 release) expanded to 32 billion each and another 64 billion for properties. If you look at the mailing list, there have been recent inquiries for quite large datastores.
As an approach to an answer you might want to check out this interview with Emil Eifrem (neo's founder): http://www.infoq.com/interviews/eifrem-graphdbs. In particular check out the part on "From a data complexity perspective, how does Neo4j help remove some of the implementation complexity in storing your data?": "hundreds of millions is probably a large one. And billions that's definitly a large one."
I was in conversation with neo technologies recently, in which they shared that the largest installations they know of machine-wise do not have more than 3-5 machines.
Also, they said that the size of the graph neo4j can efficiently handle is dependent on the number of nodes and edges in the graph. If they can all be kept in memory, most queries will be fast. You find the sizes for nodes and edges in memory at http://wiki.neo4j.org/content/Configuration_Settings (it's 9 bytes per node and 33 bytes per relationship).
We're still evaluating Cassandra for our data store. As a very simple test, I inserted a value for 4 columns into the Keyspace1/Standard1 column family on my local machine amounting to about 100 bytes of data. Then I read it back as fast as I could by row key. I can read it back at 160,000/second. Great.
Then I put in a million similar records all with keys in the form of X.Y where X in (1..10) and Y in (1..100,000) and I queried for a random record. Performance fell to 26,000 queries per second. This is still well above the number of queries we need to support (about 1,500/sec)
Finally I put ten million records in from 1.1 up through 10.1000000 and randomly queried for one of the 10 million records. Performance is abysmal at 60 queries per second and my disk is thrashing around like crazy.
I also verified that if I ask for a subset of the data, say the 1,000 records between 3,000,000 and 3,001,000, it returns slowly at first and then as they cache, it speeds right up to 20,000 queries per second and my disk stops going crazy.
I've read all over that people are storing billions of records in Cassandra and fetching them at 5-6k per second, but I can't get anywhere near that with only 10mil records. Any idea what I'm doing wrong? Is there some setting I need to change from the defaults? I'm on an overclocked Core i7 box with 6gigs of ram so I don't think it's the machine.
Here's my code to fetch records which I'm spawning into 8 threads to ask for one value from one column via row key:
ColumnPath cp = new ColumnPath();
cp.Column_family = "Standard1";
cp.Column = utf8Encoding.GetBytes("site");
string key = (1+sRand.Next(9)) + "." + (1+sRand.Next(1000000));
ColumnOrSuperColumn logline = client.get("Keyspace1", key, cp, ConsistencyLevel.ONE);
Thanks for any insights
purely random reads is about worst-case behavior for the caching that your OS (and Cassandra if you set up key or row cache) tries to do.
if you look at contrib/py_stress in the Cassandra source distribution, it has a configurable stdev to perform random reads but with some keys hotter than others. this will be more representative of most real-world workloads.
Add more Cassandra nodes and give them lots of memory (-Xms / -Xmx). The more Cassandra instances you have, the data will be partitioned across the nodes and much more likely to be in memory or more easily accessed from disk. You'll be very limited with trying to scale a single workstation class CPU. Also, check the default -Xms/-Xmx setting. I think the default is 1GB.
It looks like you haven't got enough RAM to store all the records in memory.
If you swap to disk then you are in trouble, and performance is expected to drop significantly, especially if you are random reading.
You could also try benchmarking some other popular alternatives, like Redis or VoltDB.
VoltDB can certainly handle this level of read performance as well as writes and operates using a cluster of servers. As an in-memory solution you need to build a large enough cluster to hold all of your data in RAM.