How to collect binary [log] files over https from 1000+ Win/Linux machines? - elastic-stack

I need to collect performance/tracing/config binary [log] files (they use proprietary formats, like winperf .BLG or sqlserverxe .XEL) over https from 1000+ Win/Linux servers distributed across public and private clouds (i.e. not in the same LAN/VPN/Active Directory). For political reasons I'm nailed to these binary formats as well as to https and I'm not allowed to transform the files on site and send out the resulting data. I can ensure the files are rolled over every ~10 minutes, to keep their sizes ~manageable (up to 100 MB with average < 10 MB).
QUESTION: Is there any FOSS solution out there which can be modified or is already capable of collecting all these files in a centralized location?
I'm looking for something that already covers as many as possible of these requirements:
ensure upload channel security
endure network outages/problems of all kinds
pass through corporate proxies in all ways that a web browser can
throttle upload speed based on local network usage or at least based on preconfigured limits (to avoid overloading the upload server as well as avoid local network problems induced when hogging all upload bandwidth)
be configurable on how to select which files to upload (e.g. upload all files in a dir order by X rule except the newest file / any file modified in the last 2 minutes).
Basically, I'm looking for MS BITS on steroids, open sourced and able to run on Linux.
This is a DevOps question. The solution I'm looking for is nowhere near as simple as writing a script for wput and running it on cron. It's more like writing a custom ELK beats module or logstash variant and that's why I'm asking here.

Related

DynamoDB vs ElasticSearch vs S3 - which service to use for superfast get/put 10-20MB files?

I have backend that recieves, stores and serves 10-20 MB json files. Which service should I use for superfast put and get (I cannot break the file in smaller chunks)? I dont have to run queries on these files just get them, store them and supply them instantly. The service should scale to tens of thousands of files easily. Ideally I should be able to put the file in 1-2 seconds and retrieve it in the same time.
I feel s3 is the best option and elastic search the second best option. Dyanmodb doesnt allow such object size. What should I use? Also, is there any other service? Mongodb is a possible solution but i dont see that on AWS, so something quick to setup would be great.
Thanks
I don't think you should go for Dynamo or ES for this kind of operation.
After all, what you want is to store and serve it, not going into the file's content which both Dynamo and ES would waste time to do.
My suggestion is to use AWS Lambda + S3 to optimize for cost
S3 does have some small downtime after putting till the file is available though ( It get bigger, minutes even, when you have millions of object in a bucket )
If downtime is important for your operation and total throughput at any given moment is not too huge, You can create a server ( preferably EC2) that serves as a temporary file stash. It will
Receive your file
Try to upload it to S3
If the file is requested before it's available on S3, serve the file on disk
If the file is successfully uploaded to S3, serve the S3 url, delete the file on disk

Is it better to store 1 email/file in Google Cloud Storage or multiple emails in one large file?

I am trying to do analytics on emails for some users. To achieve this, I am trying to store the emails on Cloud Storage so I can run Hadoop jobs on them. (Earlier I tried App Engine DataStore but it was having a hard time scaling over that many user's data: hitting various resource limits etc.)
Is it better to store 1 email/file in Cloud Storage or all of a user's emails in one large file? In many examples about cloud storage, I see folks operating on large files, but it seems more logical to keep 1 file/email.
From a GCS scaling perspective there's no advantage to storing everything in one object vs many objects. However, listing the objects in a bucket is an eventually consistent operation. So, if your computation would proceed by first uploading (say) 1 million objects to a bucket, and then immediately starting a computation that lists the objects in the bucket and computing over their content, it's possible the listing would be incomplete. You could address that problem by maintaining a manifest of objects you upload and passing the manifest to the computation instead of having the computation list the objects in the bucket. Alternatively, if you load all the emails into a single file and upload it, you wouldn't need to perform a bucket listing operation.
If you plan to upload the data once and then run a variety of analytics computations (or rev a single computation and run it a number of times), uploading a large number of objects and depending on listing the bucket from your analytics computation would not be a problem, since the eventual consistency problem really only impacts you in the case where you list the bucket shortly after uploading.

Large Files in Source Control (TFS)

Recently at the office we have been talking about placing large files into our TFS repository. The files themselves are XML, usually 100-200MB in size, and sometimes as large as 1GB. We use them as data for automated testing and they are mostly static (one gets a minor tweak every year or so). Anyway, there is a notion that putting files like this into the repository is a no-no because they are "big" and that will make things "slow" (outside of the original check-in/out) but we don't really have any evidence to back this up.
So my question is, what are the pros / cons / implications of putting large static files into a source code repository like TFS (or SVN, Git, etc. for that matter) Is it OK? Will it "fill up the server" or have some other dire consequence?
tl;dr: TFS is designed to handle large files gracefully. The largest hurdle you'll have to face is network bandwidth to upload/download the files. The second issue is that of storage space on the server. Assuming you've considered these two issues, you shouldn't have any other problems.
Network bandwidth: There is very little overhead in checking in or getting files, it should be as fast as a typical HTTP upload or download. If your clients are remote from the server, network-wise, they may benefit by having a TFS source control proxy on their local network to speed up downloads.
Note that unlike some version control systems, TFS does not compute and transmit deltas when uploading or downloading new content. That is to say, if a client had revision 4 of a large text file, and revision 5 had added a few lines at the end, some version control tools optimize this experience to only send the changed lines. TFS does not do this optimization, so if your files change frequently, clients will need to download the entirety of the file each time.
Server storage: Disk space on the server is fairly straightforward - you'll need enough space to hold the files, there's little overhead beyond that. TFS will not slow down just because your repository contains large files.
If these files get modified frequently, you will need to account for the disk space used by the revisions, also. TFS stores "deltas" between file revisions - that is, a binary difference between two versions. So if the file's contents change minimally between revisions as in the typical use case with text files, the storage cost should be inexpensive. However, if the entirety of the contents change as would be typical with binary files like images or DLLs, then you'll need enough disk space to store each revision. (Of course, you can destroy previous revisions in order to regain that space.)
One note on deltas in TFS: to reduce overhead at check-in time, the deltas between revisions are not computed immediately, there's a background "deltafication" job that runs nightly to compute the deltas to trim space. Until that point, each revision is stored in its entirety in the database. So if you have a very large text file with a lot of revisions happening daily, your disk space requirements will need to take this into account.
Client storage: Clients will need to have enough disk space to contain these files also (although only at the revision that they've downloaded.) This can be mitigated in your workspace mappings such that the large files are cloaked (or otherwise not included in your workspace) if they're not needed.
Caveat: Getting Historic Versions: If you find yourself requesting historical versions of large files frequently (for example: I want an ISO image seven changesets ago), then you're going to make the server apply the delta chain to get back to that revision. If you have multiple clients doing this concurrently, this could tax your memory.
If those files were constantly changing & their deltas were big, I would eventually expect a penalty in the overall TFS performance.You clearly state that this is not the case, so, provided that your SQL server has the capacity to house the storage, I believe you should be able to proceed without any implications. A minor downside you may experience, is when you 're constructing new workspaces, where you would have to pull those files from their repository. Unfortunately this does also happen during TFS Build, so it's possible that your builds will now take that much longer. The severity of this angle greatly depends on your network constellation/stability.
The biggest problem (inconvenience) you'll have is having to download these massive files to all your workspaces, or map them out. Consider putting them into a separate team project to make this easier (unless you want to include them in branches, in which case I'd abuse keeping everything in one team project)
If you have control of the xml format then also consider a few tweaks to make them smaller. This will improve performance of store/get operations and also loading speed... Shorten element and attribute names, reduce the number of decimal places you are outputting for floating point numbers, etc. You will find threat simple schemes like this will knock many megabytes off the size of Gb-sized files, and it's easy to knock up a quick xslt transform or code to convert the files quickly over to the new format.

What's a suitable storage RDBMS,NoSQL, for caching web site responses?

We're in the process of building an internal, Java-based RESTful web services application that exposes domain-specific data in XML format. We want to supplement the architecture and improve performance by leveraging a cache store. We expect to host the cache on separate but collocated servers, and since the web services are Java/Grails, a Java or HTTP API to the cache would be ideal.
As requests come in, unique URI's and their responses would be cached using a simple key/value convention, for example...
KEY VALUE
http://prod1/financials/reports/JAN/2007 --> XML response of 50Mb
http://prod1/legal/sow/9004 --> XML response of 250Kb
Response values for a single request can be quite large, perhaps up to 200Mb, but could be as small as 1Kb. And the number of requests per day is small; not more than 1000, but averaging 250; we don't have a large number of consumers; again, it's an internal app.
We started looking at MongoDB as a potential cache store, but given that MongoDB has a max document size of 8 or 16Mb, we did not feel it was the best fit.
Based on the limited details I provided, any suggestions on other types of stores that could be suitable in this situation?
The way I understand your question, you basically want to cache the files, i.e. you don't need to understand the files' contents, right?
In that case, you can use MongoDB's GridFS to cache the xml as a file. This way, you can smoothly stream the file in and out of the database. You could use the URI as a 'file name' and, well, that should do the job.
There are no (reasonable) file size limits and it is supported by most, if not all, of the drivers.
Twitter's engineering team just blogged about their SpiderDuck project that does something like what you're describing. They use Cassandra and Scribe+HDFS for their backends.
http://engineering.twitter.com/2011/11/spiderduck-twitters-real-time-url.html
The simplest solution here is just caching these pieces of data in a file system. You can use tmpfs to ensure everything is in the main memory or any normal file system if you want the size of your cache be larger than the memory you have. Don't worry, even in the latter case the OS kernel will efficiently cache everything that is used frequently in the main memory. Still you have to delete the old files via cron if you're using Linux.
It seems to be like an old school solution, but it could be simpler to implement and less error prone than many others.

Attaching Informix .dat and .idx files

We are trying to duplicate one of our informix database on a test server, but without Informix expertise in house we can only guess what we need to do. I am learning this stuff on the fly myself and nowhere near the expertise level needed to operate Informix efficiently or even inefficiently. Anyhow...
We managed to copy the .dat and .idx files from the live server somewhere. Installed Linux and the latest Informix Dynamic Server on it and have it up and running.
Now what should we do with the .dat and idx files from the live server? Do we copy it somewhere and it will recognize it automatically?
Or is there an equivalent way like you can do attach DB from MS SQLServer to register the database files in the new database?
At my rope end...
You've asked a pretty complicated question without realizing it. Informix is architected as a shared everything database engine, meaning all resources available to the instance are available to every database in that instance. This means that more than one database can store data in any given dbspace, .dat or .idx file in your case. Most DBA's know better than to do that but it's something to be aware of. Given this knowledge you now know that the .dat and .idx files do not belong to a database but belong the instance. The dbspaces and files were created to contain your databases data but they technically belong to the instance. It's worth noting that the .dat and .idx files are known to the database by the logical dbspace name.
Armed with this background info and assuming that the production and development servers are running the same OS and that your hardware is relatively the same, not a combination of PARISC, Itanium or x86/x64, I'll throw a couple of options out for you.
Create the dbspaces that you need in the new instance and use onunload and onload
to copy the database from production to development.
Use ontape or onbar to backup the entire production instance and
restore it over your development instance.
Option 1 requires that you know what the dbspaces are named and how large they are. Use onstat -d on the production instance to find this out. BTW, the numbers listed in onstat -d are in pages, I believe that Linux is a 2K page.
Option 2 simply requires that the paths for the data files are the same on both servers. This means that the ROOTDBS needs to be the same in both instances. That can be found by executing onstat -c | grep ROOTDBS
There's a lot that has been left out but I hope that this gives you the info that you need to move forward with your task.
The .dat and .idx files are associated with C-ISAM, or, when organized in a directory called dbase.dbs (where dbase is the name of your database), the .dat and .idx files are associated with Informix Standard Engine, aka Informix SE. SE uses C-ISAM to manage its storage. SE is rather different from (and much simpler than) Informix Dynamic Server (IDS). It is not impossible that the .dat and .idx files are associated with IDS; it is just extremely unlikely.
From the information available, it sounds as though your production server is running SE. To get the data from SE to IDS, you will probably want to use DB-Export at the SE end and DB-Import at the Linux/IDS end. Certainly, that is the simplest way to do it.
There are other possible solutions - C-ISAM datablade being one such - but they are more expensive and probably not warranted. There are other possible loading solutions, such as HPL (High-Performance Loader).
For more information about Informix, either use the various web sites already referenced (http://www.informix.com is a link to the Informix section of IBM's web site), or use the International Informix User Group (IIUG) web site. There are mailing lists available (which require you to belong, but membership is free) for discussing Informix in detail.
Those Informix-SE datafiles (.DAT) and their associated index files (.IDX) are useless unless you also have all the associated catalog files, such as SYSTABLES.DAT SYSTABLES.IDX, SYSCOLUMNS, SYSINDEXES, etc.
Then you also have to worry about which version of Informix-SE created them, as some have a 2K or 4K index file node size.
Your best approach is to obtain all the .DAT and .IDX files from the source db, plus the correct standard engine, installed on the same hardware and operating system it came from.
Long story short, on the source machine, run "dbexport" to unload all the data to ascii files, and run "dbschema" to generate all the table schemas and indexes. It also wouldn't hurt to run a "bcheck" on all the files before unloading them to ascii flat files.
I don't have any Informix-specific advice but for situations like this you can usually find the answer by looking up how to move a database (a common admin task, and usually well described in the manual) and just skipping the steps that would remove the old database.
Also, be careful of problems caused by different system architectures; some DBs fail spectacularly if you move them from a big-endian system (such as Solaris) to a little-endian system (such as x86 Linux) Again, the manual section on moving a DB would cover any extra steps that are needed.