streaming data to opentsdb - streaming

I have played a bit with opentsdb each time reading data from a txt file and writing it to a .txt file.
Is there a way to bypass the .txt file i.e. streaming some data directly to opentsdb?
I have seen the implementation with tcollector but I am looking for something more general that would scale with some pretty large data.
thanks for you help.
Philippe C.
ps I am as specific as I can but I know it isn't really clear if you have any questions that I haven't thought of, ask away!

The general way to get data into openTSDB is to use the HTTP Put api.
To scale with large amounts of data, you can have multiple openTSDB servers in a cluster and round robin the data between them with a http load balancer.

Related

Writing Custom Extensions in Druid

I am new to Druid.
Problem Statement
We do currently push raw event data to Druid. I have a requirement to apply certain calculations on the data (say like certain stat techniques) which are not supported by Druid or the extensions it provides out of the box.
There are two questions I have -
What would be a better way to achieve this? (Have some external script that reads data from Druid, computes the calculations and puts it back to Druid)?
Can I take a route of writing Custom Extensions on Druid? I could not find any good documentation on how do we go about writing/ testing Druid Extensions.
These link does not provide any in-depth information -
http://druid.io/docs/latest/development/modules.html
https://github.com/apache/incubator-druid (Druid repo that has some core and community contrib extensions)
Appreciate any help on this. Thank you.
You can achieve this both ways now it's up to you how much comfortable you are writing an extension by yourself and then maintain it. This is certainly time-consuming compared to another way.
If you read data from druid and then perform your calculation and write data back to the druid, you will end up writing to the separate table. If you are not storge bound on druid cluster then you can certainly take this path and its less time-consuming.
Yes, this is the recommended way to perform any custom computation on data. You can certainly write a simple extension easily. Here's the example git hub repo link which helps to write a custom druid extension: https://github.com/implydata/druid-example-extension

We are trying to persist logs in S3 using Kinesis firehose. However I would like to merge each stream of data into 1 big file. How would I do that?

Should I be using lambda or use spark streaming to merge each incoming streaming file into 1 big file in s3. ?
Thanks
Sandip
You can't really append files in S3, you would read in the entire file, add the new data and then write the file back out - either with a new name or the same name.
However, I don't think you really want to do this - sooner or later, unless you have a trivial amount of data coming in on firehose, your s3 file is going to be too big to be constantly reading, appending new text and sending back to s3 in an efficient and cost-efficient manner.
I would recommend you set the firehose limits to the longest time/largest size interval (to at least cut down on the number of files you get), and then re-think whatever processing you had in mind that makes you think you need to constantly merge everything into a single file.
You will want to use an AWS Lambda to transfer your Kinesis Stream data to the Kinesis Firehose. From there, you can use Firehose to append the data to S3.
See the AWS Big Data Blog for a real-life example. The GitHub page provides a sample KinesisToFirehose Lambda.

Can Put Large Binary Files in RabbitMQ Queue

I'm trying to design one multiserver updates deployment system, I was thinking if there is any limitation for big binary strings. If I put for example a string from one 100MB file in the queue?
Thanks,
Pedro
I've done it and I would not necessarily recommend it. Its probably better to store the file in something like GridFS (MongoDB) and then reference the _id in the RabbitMQ message. You can then pull the file on the consumer using Mongo's interface and delete it once done.
I have this running with about 20M objects in GridFS and its been rocksolid.
Searching for "RabbitMQ Large Files" turned up a significant amount of advice on the subject.
The standard response seems to be that it should, in theory, be able to handle it, but you may find that your broker becomes unresponsive.
If you own both sides of the queue (sender/receiver), then you may consider chunking the data into more manageable 'chunks' of data. e.g. 100KB chunks. This will be nicer to your broker. One of the search hits from above had a link to a 'streaming' sender written in ruby, which did chunking.
If you do not own both sides of the queue, then consider using a form of 'claim check', where your message contains the location of the large blob/file/data in storage location better suited to it.
This could be pretty interesting background information: http://rabbitmq.1065348.n5.nabble.com/Can-RabbitMQ-handle-big-messages-tt566.html#a569

Is Cassandra good for storing files?

I'm developing a php platform that will make huge use of images, documents and any file format that will come in my mind so i was wondering if Cassandra is a good choice for my needs.
If not, can you tell me how should i store files? I'd like to keep using cassandra because it's fault-tolerant and uses auto-replication among nodes.
Thanks for help.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
You should be OK with files of 10MB. In fact, DataStax Brisk puts a filesystem on top of Cassandra if I'm not mistaken: http://www.datastax.com/products/enterprise.
(I'm not associated with them in any way- this isn't an ad)
As fresh information, Netflix provides utilities for their cassandra client called astyanax for storing files as handled object stores. Description and examples can be found here. It can be a good starting point to write some tests using astyanax and evaluate Cassandra as a file storage.

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.