storage engine: how to quickly find that key is not exist - memcached

Our distributed storage project using LevelDB as storage engine and memcached as cache layer, we have one scenario: 95% queries with keys are not exist in the storage engine.
In memcached layer, if can't find the key, then query LevelDB.
In LevelDB, we use default bloom filter to figure out the key exist or not but still have 1% false positive rate. Because of the 1% percentage, we have to request the value through IO, which can't be tolerated by client. (95% keys are not exist)
Is there any better solution to know whether the key is not exist?
Update:
1. Keys are generated everyday (userid+date), once can't get the key, then client would put the value into storage layer.
2. Client want read latency(TP99) < x ms (client is latency sensitive)

I think there are two methods which can be used to improve your solution:
1. assume that all the keys that may request are in a limited set. Maybe you can put all the keys in the set, the ones which not exist with a value like "FALSE".
2. improve your leveldb performance. adjust the size of table-cache and block-size or use ssd as storage media.
we use leveldb as persistent kv-storage in productive enviroment and support applications like blacklist which is similar to your scenario.

Related

Is MongoDB a good choice for storing a huge set of text files?

I'm currently building a system (with GCP) for storing large set of text files of different sizes (1kb~100mb) about different subjects. One fileset could be more than 10GB.
For example:
dataset_about_some_subject/
- file1.txt
- file2.txt
...
dataset_about_another_subject/
- file1.txt
- file2.txt
...
The files are for NLP, and after pre-processing, as pre-processed data are saved separately, will not be accessed frequently. So saving all files in MongoDB seems unnecessary.
I'm considering
saving all files into some cloud storage,
save file information like name and path to MongoDB as JSON.
The above folders turn to:
{
name: dataset_about_some_subject,
path: path_to_cloud_storage,
files: [
{
name: file1.txt
...
},
...
]
}
When any fileset is needed, search its name in MongoDB and read the files from cloud storage.
Is this a valid way? Will there be any I/O speed problem?
Or is there any better solution for this?
And I've read about Hadoop. Maybe this is a better solution?
Or maybe not. My data is not that big.
As far as I remember, MongoDB has a maximum object size of 16 MB, which is below the maximum size of the files (100 MB). This means that, unless one splits, storing the original files in plaintext JSON strings would not work.
The approach you describe, however, is sensible. Storing the files on cloud storage such as S3 or Azure, is common, not very expensive, and does not require a lot of maintenance comparing to having your own HDFS cluster. I/O would be best by performing the computations on the machines of the same provider, and making sure the machines are in the same region as the data.
Note that document stores, in general, are very good at handling large collections of small documents. Retrieving file metadata in the collection would thus be most efficient if you store the metadata of each file in a separate object (rather than in an array of objects in the same document), and have a corresponding index for fast lookup.
Finally, there is another aspect to consider, namely, whether your NLP scenario will process the files by scanning them (reading them all entirely) or whether you need random access or lookup (for example, a certain word). In the first case, which is throughput-driven, cloud storage is a very good option. In the latter case, which is latency-driven, there are document stores like Elasticsearch that offer good fulltext search functionality and can index text out of the box.
I recommend you to store large file using storage service provide by below. It also support Multi-regional access through CDN to ensure the speed of file access.
AWS S3: https://aws.amazon.com/tw/s3/
Azure Blob: https://azure.microsoft.com/zh-tw/pricing/details/storage/blobs/
GCP Cloud Storage: https://cloud.google.com/storage
You can rest assured that for the metadata storage you propose in mongodb, speed will not be a problem.
However, for storing the files themselves, you have various options to consider:
Cloud storage: fast setup, low initial cost, medium cost over time (compare vendor prices), datatransfer over public network for every access (might be a performance problem)
Mongodb-Gridfs: already in place, operation cost varies, data transfer is just as fast as from mongo itself
Hadoop cluster: high initial hardware and setup cost, lower cost over time. Data transfer in local network (provided you build it on-premise.) Specialized administration skills needed. Possibility to use the cluster for parrallel calculations (i.e. this is not only storage, this is also computing power.) (As a rule of thumb: if you are not going to store more than 500 TB, this is not worthwile.)
If you are not sure about the amount of data you cover, and just want to get started, I recommend starting out with gridfs, but encapsulate in a way that you can easily exchange the storage.
I have another answer: as you say, 10GB is really not big at all. You may want to also consider the option of storing it on your local computer (or locally on one single machine in the cloud), simply on your regular file system, and executing in parallel on your cores (Hadoop, Spark will do this too).
One way of doing it is to save the metadata as a single large text file (or JSON Lines, Parquet, CSV...), the metadata for each file on a separate line, then have Hadoop or Spark parallelize over this metadata file, and thus process the actual files in parallel.
Depending on your use case, this might turn out to be faster than on a cluster, or not exceedingly slower, especially if your execution is CPU-heavy. A cluster has clear benefits when the problem is that you cannot read from the disk fast enough, and for workloads executed occasionally, this is a problem that one starts having from the TB range.
I recommend this excellent paper by Frank McSherry:
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

how to manage user sessions in distributed databases?

Since i have zero experience in developing web applications which can be scaled up horizontally, i need someone with experience guide me in right direction.
I had difficulties to figure out the right way of storing login sessions in database, so i came to the question is that even right to store them in databases when i am planing to use replication in future ? and if not what are the alternates ??
I need different clients(android,Windows, ...) be connected to server with their own sessions related to the same user and i am using:
1 - Cent-OS as OS
2 - PostgreSQL as DBMS
3 - Tomee as HTTP server and Servlet container
4 - Partitioned Tables (Inherited Tables in PostgreSQL) to improve performance, chance of in memory index scan, prevent fragmentation and etc
My problem raise from the fact that i need to check session availability in every received request from clients (every session has its own encryption keys) and it is possible to have millions of sessions, in a distributed environment i can not be sure that the created session will be available in replicated database at the right time.
Thanks for helping
Storing the user sessions in RDBMS will evantually decrease the perfermance of your application. You should take a look at distributed caching mechanisms to store and read user sessions. I strongly recommend NoSQL database solutions for user sessions such as Redis which enables you in-memory storage of your key-value pairs and responds in a very high speed. In addition, you can tune the configuration file in order to store the in-memory key-value pairs to disk for persistence. You also need to focus on distribution of your key-value pairs across multiple NoSQL instances and internal data structures such as HashMaps. You can use your own hashing algorithm to distribute your key-value pairs on multiple instances that will enable you high availability. You should keep in mind that shardening your data horizontally will result in some issues, you can refer to CAP theorem - Availability and Partition Tolerance fore details.

How to performance test MongoDB Storage Engine for website Session data?

I'm looking to utilize MongoDB for session data storage, so we don't need sticky sessions in our load balanced environment.
As of 3.0, we can use different storage engines within MongoDB.
While MMapV1 and WiredTiger come out of the box, it's also possible to run other storage engines (RocksDB?).
What I would like to do is test out my website using MongoDB with the different storage engines backed behind it.
I currently have a JMeter script that will hit multiple pages on the site for many different users.
Between tests I can switch out the Mongo connection, to different Mongod instances on different storage engines.
All I can really take out of this is the average latency for the page loads in JMeter.
Is there better results I can find, possibly using different tools or techniques?
Or, for session data, which is heavily read/write, is there one storage engine that would be preferred over another?
I'm not sure if this question is too open-ended or not, but I thought I'd ask here to maybe get more direction about how to test this out.
An important advantage of WiredTiger over the default MMAP storage engine is that while MMAP locks the whole collection for a write, WiredTiger locks only the affected document(s). That means multiple users can change multiple documents at the same time. This is especially interesting in your case of session data, because you will likely have many website visitors at the same time, each one regularly updating their own session document. But when you want to test if this feature really provides a benefit in your use-case, you will have to build a more sophisticated test setup which simulates many simultaneous updates and requests from multiple users.
Another interesting feature of WiredTiger is that it compresses both data and indexes, which greatly reduces filesize. But this feature does of course cost performance. So when you only want to compare performance, you should switch off compression to have a fair comparison. The relevant config keys are:
storage.wiredTiger.collectionConfig.blockCompressor = none
storage.wiredTiger.indexConfig.prefixCompression = false
Keep in mind that changes to these keys will only take effect on newly created collections and indexes.
Another factor which could skew your results is cache size. The MMAP engine always uses all the RAM it can get to cache data. But WiredTiger is far more conservative and only uses half of the available RAM, unless you set a different value in
storage.wiredTiger.engineConfig.cacheSizeGB
So when you want a fair comparision, you should set this to the RAM size of the machine it runs on, minus the ram required by other processes running on the same machine. But this will of course only make a difference when your test uses more test data than fits into memory, so that the cache handling of both engines starts to matter.

Deciding suitable key value store : Voldemort vs Cassandra vs Memcached vs Redis

I am using triple store database for one of my project (semantic search engine for healthcare) and it works pretty fine. I am considering on giving it a performance boost by using a layer of key value store above triple store. Triple store querying is slower since we do deep semantic processing.
This is how I am planning to improve performance:
1) Running Hadoop job for all query terms every day by querying triple store.
2) Caching these results in a key value store in a cluster.
3) When user searches for a query term, instead of searching triple store, key value store will be searched first. Triple store will be searched only when query term not found in key value store.
Key value pair which I plan to save is a "String" to "List of POJO mapping". I can save it as a BLOB.
I am confused on using which key value store. I am looking mainly for failover and load balancing support. All I need is a simple key value store which provides above features. I do not need to sort/search within values or any other functionalities.
Please correct me if I am wrong. I am assuming memcached and Redis will be faster since it is in memory. But I do not know if any Java clients of Redis(Jredis) or memchaced(Spymemcached) supports failover. I am not sure whether to go with in memory or persistent storage. I am also considering Voldemort, Cassandra and HBase. Overall key values will be around 2GB to 4GB size. Any pointers on this will be really helpful.
I am very new to nosql and key value stores. Please let me know if you need any more details.
Have you gone over memcached tutorial article (they explain load balancing aspects there, since memcached instances balance load based on your key hash, also spymemcached is discussed how it handles connectivity failures):
Use Memcached for Java enterprise performance, Part 1: Architecture and setup http://www.javaworld.com/javaworld/jw-04-2012/120418-memcached-for-java-enterprise-performance.html
Use Memcached for Java enterprise performance, Part 2: Database-driven web apps http://www.javaworld.com/javaworld/jw-05-2012/120515-memcached-for-java-enterprise-performance-2.html
For enterprise grade fail-over/cross data center replication support in memcached you should use Couchbase that offers these features. The product has evolved from memcached base.
Before you build infrastructure to load your cache, you might just try adding memcached on top of your existing system. First, measure your current performance well. I suggest JMeter or similar tools. Here's the workflow in your application: Check memcached, if it's there, you're done. If not, run the query against the triple store and save the results in memcached. This will improve performance if you have queries that are repeated. Memcached will use the memory you give it efficiently, throwing away things that don't get used very often. Failover is handled by your application (if it's not in memcached, you use your existing infrastructure).
We use triple store and cache data in memcache provided by google app engine and it works fine. It reduced the overhead of sparql query over triple store.
Only cassandra will have mentioned features and CQL full support, which helps in maintaining, otherwise maybe you should look in another direction:
Write heavy, replicated, bigger-than-memory key-value store
Since you want just to cache data in front of your triple store, going with disk-based, or replicated/distributed key-value stores seems to be pointless. All you need is essentially to cache data in front of your queries right on the machines where those queries are done. No "key-value stores", just vanilla Java caching solutions.
In 2016 the best cache for Java is Caffeine.

local UUID vs network unique counter id

Which one will you choose as your surrogate key implementation ?
Local UUID
That is generated locally in the application, no network trip to retrieve it
But the length is long, and can affect the size of your storage size usage
Lengthy URL with the long UUID
The tiniest fear that UUID collision will happen
Or .. Network-unique-counter id (not sure on what is the proper term for this)
I imagine a remote Redis with the atomic INC or Mongo with $inc
The cost of network trip
Is much shorter, takes up less space and resulting in much shorter URL
No fear on collision, even on clustered applications
If you are using MongoDB, you should look into using BSON ObjectIDs:
http://www.mongodb.org/display/DOCS/Object+IDs
They are created by default as the _id field unless you specify otherwise and create the _id field yourself (which can also be an ObjectID, just created by you). No fear of collision, and you could get a natively supported ID type in the DB that you can also use in your application. Seems like a win-win, as long as you use MongoDB of course ;)
You can combine both approaches. Have a look for twitter's SnowFlake algorithm. The algorithm will produce global unique integers (64bit) but without any coordination, a pure local algortim.
For a low concurrency app, you can probably use network counter id.
But except for url, there is no interest for low concurrency (= not a lot of data).
In case of heavy concurrency access, so a lot of data, so a lot of clusters, you redis engine + associated network will be probably to slow for this solution.
In conclusion :
- network counter seems to be sexy but useless, in my opinion, with MongoDB.
On MongoDB collision, due to the creation algorithm, the collision is near zero. I explain, a part of the uuid is build with the machine address, which should be unique and you can get this address before putting your cluster in production.