I am looking for a recommendation about a free linux tool that can collect netflow v9 traffic and store the parsed data in a time series data base for further analysis. I don't need analysis capabilities just good reliable collection and storage performance
Related
We're trying to design a real time sentiment analysis system (on paper) for a school project. We got some (very vague) negative feedback on how we store our data, but it isn't fully clear why this would be a bad idea or how we'd best improve this.
The setup of the system is as follows:
data from real time news RSS feeds is gathered in a kafka messaging queue, which connects to our preprocessing platform. This preprocessing step would transform all the news articles into semi-structured data, which we can do a sentiment analysis on.
We then want to store both the sentiment analysis and the preprocessed, semi-structured news article for reference.
We were thinking of using MongoDB as a database to do so since you have a lot of freedom in defining different fields in the value (in the key:value pair you store) instead of Cassandra (which would be faster).
The basic use case is for people to look up an institution and get the sentiment analysis of a bunch of news articles in a certain timeframe.
As a possible improvement: do we need to use a NoSQL database or would it make sense to use a SQL database? I think our system could benefit from being denormalized (as is the case by default in NoSQL) and we wouldn't be needing any operations such as join operations that are significantly faster in SQL systems.
Does anyone know of existing systems that do similar things, for comparison?
Any input would be highly appreciated.
I am currently streaming IOT data to my MongoDB which is running in a Docker Container(hosted in AWS). Per day I am getting a couple of thousands of data points.
I will be using this data gathered for some intensive data analysis and ML which will run on day to day basis.
So is this how normally how big data is stored? What are the industrial standards and best practices?
It depends on a lot of factors, for example, the type of data one is analyzing, how much data one has and how quickly you need it.
For applications such as user behavior analysis, relational DB is best.
Well, if the data fits into a spreadsheet, then it is better suited for a SQL-type database such as Postgres, BigQuery as relational databases are good at analyzing data in rows and columns.
For semi-structured data, think social media, texts or geographical data which requires a large amount of text mining or image processing, NoSQL type database such as MongoDB, CouchDB works best.
On the other hand, in relational databases, one can use SQL to query them. SQL as a language is well-known among data analysts and engineers and is also easy to learn than most programming languages.
Databases that commonly used in the industry to store Big Data are:
Relational Database Management System: As data engine storage, the platform employs the B-Tree structure. B-Tree concepts are used to organize the index and data, and logarithmic time is used to write and read the data.
MongoDB: You can use this platform if you need to de-normalize
tables. It is apt if you want to resort to documents that comprise all the allied nested structures in a single document for maintaining consistency.
Cassandra: This database platform is perfect for upfront queries and fast writing. However, the query performance is slightly less, and that makes it ideal for Time-Series data. Cassandra uses the
Long-Structured-Merge-Tree format in the storage engine.
Apache HBase: This data management platform has similarities with
Cassandra in its formatting. HBase also comes with the same performance metrics as Cassandra.
OpenTSDB: The platform is perfect for IoT user-cases where the information gathers thousands within seconds. The collected questions are needed for the dashboards.
Hope it helps.
I have read about various so called time series NoSQL databases. But NoSQL has only 4 types: key-value, document, columnar and graph.
For example InfluxDB does not state which NoSQL type it has, but from the documentation it seems like simple key-value store to me.
Are these time series databases only specialized databases from one of those 4 types or it is a new type of NoSQL database?
So to make it short, you can find both pure time series database, or engine that runs at the top of a more generic engine like Redis, Hbase, Cassandra, Elasticsearch ...
TimeSeries Databases (TSDBs) are data engines that are focusing on saving and retrieving time-based information very efficiently.
Very often since these databases will capture "events" (systems, devices/iot, applications ticks) they have support highly concurrent writes, and they usually do a lot more writes than reads.
TSDBs are storing data points within a time series, and the timestamp is usually the main index/key; allowing very efficient time range queries (give me data point from this time to this time).
Data points can be multi dimensionnal, and add tags/label.
TSDBs provide mathematical operations on datapoint: SUM, DIV AVG, ... to combine data over time.
So based on these characteristics you can find databases that offer this. As you mention you can use specialized solutions like Influx DB, Druid, Prometheus; or you can find more generic databases engines that provide native time series support, or extension, let me list some of them:
Redis TimeSeries
Elastisearch
OpenTSDB: that runs at the top of Apache HBase
Warp10: that runs at the top of Apache HBase
i'm involved in a project with 2 phases and i'm wondering if this is a big data project (i'm newbie in this field)
In the first phase i have this scenario:
i have to collect huge amont of data
i need to store them
i need to build a web application that shows data to the users
In the second phase i need to analyze stored data and builds report and do some analysis on them
Some example about data quantity; in one day i may need to collect and store around 86.400.000 record
Now i was thinking to this kind of architecture:
to colect data some asynchronous tecnology like Active MQ and MQTT protocol
to store data i was thinking about a NoSQL DB (mongo, Hbase or other)
Now this would solve my first phase problems
But what about the second phase?
I was thinking about some big data SW (like hadoop or spark) and some machine learning SW; so i can retrieve data from the DB, analyze them and build or store in a better way in order to build good reports and do some specific analysis
I was wondering if this is the best approach
How would you solve this kind of scenario? Am I in the right way?
thank you
Angelo
As answered by siddhartha, whether your project can be tagged as bigdata project or not, depends on context and buiseness domain/case of your project.
Coming to tech stack, each of the technology you mentioned has specific purpose. For example if you have structured data, you can use any new age base database with query support. NoSQL databases come in different flavours (columner, document based, key-value, etc), so technology choice depends again on the kind of data and use-case that you have. I suggest you to do some POCs and analysis of technologies before taking final calls.
Definition of big data varies from user to user. For Google 100 TB might be a small data but for me this is big data because of difference in available Hardware commodity. Ex -> Google can have cluster of 50000 nodes each node having 64 GB Ram for analysing 100 Tb of data so for them this not big data. But I cannot have cluster of 50000 node so for me it is big data.
Same is your case if have commodity hardware available you can go ahead with hadoop. As you have not mentioned size of file you are generating each day I cannot be certain about your case. But hadoop is always a good choice to process your data because of new projects like spark which can help you process data in much less time and moreover it also give you features of real time analysis. So according to me it is better if you can use spark or hadoop because then you can play with your data. Moreover since you want to use nosql database you can use hbase which is available with hadoop to store your data.
Hope this answers your question.
I have a strong use case for mixing up scientific data i.e. double matrices and vectors along with relational data and use this as data source for a distributed computation e.g. MapReduce, hadoop etc. Up to now I have been storing my scientific data in HDF5 files with custom HDF schemas and the relational data in Postgres but since this setup does not scale very well I was wondering whether there is a more NoSQL hybrid approach to support the heterogeneity of this data?
e.g. my use case would be to distribute a complex process that involves:
loading GB of data from a time series database provider
link the time series to static data e.g. symbol information, expiry, maturity dates etc
launch a series of scientific computations e.g. covariance matrix, distribution fitting, MC simulations
distribute the computations across many separate HPC nodes and storing the intermediate results for traceability.
These steps require a distributed database that can handle both relational and scientific data. A possibility would be to store the scientific data in HDF5 and then put it as BLOB columns within a relational database but this is a misuse. Another would be to store the HDF5 results in disk and have a relational database linking to it but we lose self-containment. However, none of these two approaches accounts for distributing the data for direct access in the HPC nodes as the data would need to be pulled from a central node and this is not ideal.
I am not sure if I can give a proper solution but we have a similar setup.
We have meta-information stored in a RBDMS (postgresql) and the actual scientific data in HDF5 files.
We have a couple of analysis that are run on a our HPC. The way it is done is as follows:
User wants to run an analysis (from a web-frontend)
A message is sent to a central message broker (AMQP, RabbitMQ) containing the type of analysis and some additional information
A worker machine (VM) picks up the message from the central message broker. The worker uses REST to retrieve meta-information from the RDBMS database and stages the files on the HPC and then creates a PBS job on the cluster.
Once the PBS job is submitted a message with the job-id is sent back to the message broker to be stored in the RBDS database.
The HPC job will run the scientific analysis and then store the result in a HDF5 file.
Once the job is finished, the worker machine will stage-out the HDF5 files into a NFS share and it will store the link in the RBMS database.
I would recommend against storing binary files in a RDBMS as a BLOB.
I would keep them in HDF5 format. You can have different backup policies for the database and the filesystem.
A couple of additional pointers:
You could hide everything (both RBMS and HDF5 storage) behind a REST interface. This might solve your containment issue
If you want to store everything in a NoSQL DB I would recommend to have a look at Elasticsearch. It works well with time-series data, it is distributed out of the box and it has also a Hadoop plugin