gcloud run that requires large databse - gcloud

i hope this is the right place to ask this. So what i want to do is perform a sequence search against a large database and create an API for this. I expect that this service will be accessed VERY rarely, as such I though about gcloud run, because this only bills me for each use case (and i dont use a lot). I already have a docker container configured that does what i expect it to, however I have an issue with the data thats required. I need a Database thats roughly 100 GB large. Is there a way to access this in glcoud run?
What would be the optimal way for me to get there? I think downlading 100GB of data every time a request is made is a waste. Maybe I could fetch a zip file from a storage bucket and inflate it in the run instance? But I am not sure if there is even that much space available.
Thank you

I believe the simpler way to do this is to rip the weight of the Cloud Run shoulders.
I'm assuming it is some sort of structured data (json, csv, etc) - if it really is it is simpler to import this data into BigQuery and make your Cloud Run service to query against BQ.
This way your API will answer way faster, you will save costs from running Cloud Run with very large instances to load into memory part of those 100gigs as also you will separate your architecture in layers (ie. an application layer and a data layer).

Related

Saving query results back into elastic stack

I am absolutely new to the elastic stack.
So my problem space is I have utility which runs on client machines .We have few logs which are generated on these machines (thousands of them), So we have three data source- csv files, log files(generated by my application) and windows event log . I want to combine these three and generate some useful information out of them .Also want to generate a dashboard with some graphs which will be used by managers.
I have zeroed down on elk stack , idea is I install beats on client machine and push data to elastic and then use Kibana to get some visualization. Since I might have thousand of client pushing the data to elastic server, it might not be feasible to keep this data in the server for ever. But I need updated visualizations, to be available always. So I was planning that periodic queries will be run on the indexed data in elastic and the result which is generated (which is real information I need) will be saved back in elastic in a separate index and the visualization in Kibana are set up based on this index .And all the original data can now be cleared. This way I extract real info and keep it and delete unnecessary info.
My question to the expert are
Is my thinking or design correct(wrt to elk stack) given the problem statement
Is it feasible in elk stack and are there any examples or utilities to achieve this.
Thanks
Gaurav
Saving the results of your aggregations back into ElasticSearch is a perfectly valid option. You should also consider Cold storage as an option for storing large amounts of data with long retention.
You tagged logz.io in your question, so it's worth mentioning that there is a logz.io feature called 'Timeless accounts' which uses Optimizers to define query results that should be saved for longer than the retention periods of the underlying logs.
For the record, I work at logz.io

DynamoDB vs ElasticSearch vs S3 - which service to use for superfast get/put 10-20MB files?

I have backend that recieves, stores and serves 10-20 MB json files. Which service should I use for superfast put and get (I cannot break the file in smaller chunks)? I dont have to run queries on these files just get them, store them and supply them instantly. The service should scale to tens of thousands of files easily. Ideally I should be able to put the file in 1-2 seconds and retrieve it in the same time.
I feel s3 is the best option and elastic search the second best option. Dyanmodb doesnt allow such object size. What should I use? Also, is there any other service? Mongodb is a possible solution but i dont see that on AWS, so something quick to setup would be great.
Thanks
I don't think you should go for Dynamo or ES for this kind of operation.
After all, what you want is to store and serve it, not going into the file's content which both Dynamo and ES would waste time to do.
My suggestion is to use AWS Lambda + S3 to optimize for cost
S3 does have some small downtime after putting till the file is available though ( It get bigger, minutes even, when you have millions of object in a bucket )
If downtime is important for your operation and total throughput at any given moment is not too huge, You can create a server ( preferably EC2) that serves as a temporary file stash. It will
Receive your file
Try to upload it to S3
If the file is requested before it's available on S3, serve the file on disk
If the file is successfully uploaded to S3, serve the S3 url, delete the file on disk

Is there a way to configure Heroku PostgreSQL to not bother loading a particular column into RAM?

This may be a long shot, but I thought I'd ask anyway.
I am looking at using Heroku's new Crane Postgres DB (400 MB RAM Cache) in conjunction with an app I'm deploying on Heroku. The 400 MB cache size should be plenty for our needs... except for one column of one table, in which we store a cached PDF file as a string. The PDF's could easily use up the 400MB RAM pretty quickly if Heroku uses its Cache for them.
If I were on an actual server, I'd just store the PDF as a file, but given Heroku's ephemeral file system, my life is much simpler if I just store the pdf in the DB rather than rigging up a connection to S3 just for this one thing. (It further complicates that we're looking at deploying multiple heroku instances, one for each client ... so using the DB's is simpler than creating a new bucket for each one.) I don't really care about the speed on this. If people are getting the file, they will expect speeds as if it were coming from a file system anyhow, since thats how most file downloads are done. Is there any way to tell PostGRES to not bother caching this column?
Or maybe I'm asking the wrong question, and there is some other way to solve the problem or design alternatives that make it irrelevant.
You don't have to do anything. PostgreSQL will automatically use TOAST on values larger than 8 kB.
From http://www.postgresql.org/docs/9.1/static/storage-toast.html
PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or "the best thing since sliced bread").
PostgreSQL caching is also done at the page level so TOAST does not have to be cached with the rest of the row (http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf).
The fact that Postgres can TOAST large field values, it doesn't mean it's the best thing to do.
If you store big fields in your main database, it will make many things harder, such as creating forks or followers, and creating and restoring backups in particular. I would strongly reconsider utilizing S3 to store the PDF files, and simply invest in automated onboarding of new clients (create heroku app, provision database, provision/create S3 bucket).
I'm not quite sure how you're managing to store large PDF's, since Postgres imposes a maximum field size (or at least a maximum page size). However, you might be able to get around this by using TOAST. TOASTed items are stored in a separate (physical) table, so if you're not selecting them frequently they shouldn't be cached.
If you are selecting them frequently, then I'm not sure if what you want is possible. Remember that Postgres only supplies one "level" of caching - the Linux VFS does caching also.

How to build local environment with large databases

I have two storages (PostgreSQL, MongoDB) and as I need to develope application locally on my computer (ideally offline), i need data from those storages to be copied to my HDD.
Anyway those are massive databases with around hundreds of gigabytes of data.
I don't need all data stored there, just sample of them to be able to launch my app locally on that data. Both storages have some capable tools for data export (pg_dump, mongodump, mongoexport etc.).
But I don't know how to easily and effectively do the export of small sample of data. Even if I would take the list of all tables/collections and build some whitelist, which would define tables, which should be limited on number of rows, there comes troubles with triggeres, functions, indexes etc.
I don't know about testing for MongoDB, but for PostgreSQL here's what I do.
I follow a pattern while developing against databases that separates the DB side from the app side. For testing the DB side, I have a test schema which includes a single stored procedure that resets all the data in the real schema. This reset is done following the MERGE pattern (delete any records with an unrecognized key, update records that have matching keys but which are changed, and insert missing records). This reset is called before running every unit test. This gives me simple, clear test coverage for stored functions.
For testing code that calls into the database, the database layer is always mocked, so there are never any calls that actually go to the database.
What you are describing suggests to me that you are attempting to mix unit testing with integration testing, and I rather strongly suggest that you don't do that. Integration testing is what happens when you've already proved base functionality and want to prove integration between components and probably also performance, too. For IT, you really need a representative data set on representative hardware. Usually this means a dedicated machine, and using hudson for CI.
The direction you seem to be going in is going to be difficult because, as you've already noticed, it's difficult to handle that volume of data and it's difficult to generate representative data sets (most CI systems actually use production data that's been "cleaned" of sensitive information)
Which is why most of the places I've worked have not gone that way.
Just copy it all. Several hundreds gigabytes is not very much by today's standards — you can buy 2000GB disk for $80.
If you test your code on small sample data then how do you know if your coding will be efficient enough for full database?
Just remember to encrypt it with strong password if it goes out of your company building.

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.