Best way of pushing data into distributed file systems - streaming? - streaming

I am writing an abstraction layer that will abstract a back end implementation of (yet to be decided) distributed file system.
Possible choices for file systems to be used are HDFS, GlusterFS, CEPH ... .
Front end will be SOAP/ REST services.
The abstraction layer to be implemented will receive a stream of data from web-services and send it to back-end distribute file system.
The file sizes will be Multiple GBytes.
My question
What is the best approach to push data into distributed file system - if we need max through-put, no loss of data, and leveraging the distributed nature of back-end file system

Related

System architecture - ETL

We are in the process of designing an ETL process, where we’ll be getting a daily account file (maybe half a million records, could grow) from client and we’ll be loading that file to our database.
Our current process splits the file into smaller files and load it to staging...sometime or if the process fails, we try to figure out how many records we have processed and then start again from that point. Is there any other better alternative to this problem?
We are thinking about using Kafka. I’m pretty new to Kafka. I would really appreciate some feedback if kafka is the way to go or we’re just over-killing a simple ETL process where we just load the data to a staging table and finally to destination table.
Apache Kafka® is a distributed streaming platform. What exactly does
that mean?
A streaming platform has three key capabilities:
Publish and subscribe to streams of records, similar to a message
queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
Building real-time streaming data pipelines that reliably get data
between systems or applications
Building real-time streaming applications that transform or react to
the streams of data
https://kafka.apache.org/intro
If you encounter errors which make you check the last commited record to your staging database and need system to auto manage this stuff, Kafka can help you ease the process.
Though Kafka is built to work with massive data loads and spread across a cluster, you certainly can use it for smaller problems and utilize it's queuing functionalities and offset management, even with one broker (server) and low number of partitions (level of parallelism).
If you don't anticipate any scale at all, I would suggest you to consider RabbitMQ.
RabbitMQ is a message-queueing software also known as a message
broker or queue manager. Simply said; it is software where queues are
defined, to which applications connect in order to transfer a message
or messages.
https://www.cloudamqp.com/blog/2015-05-18-part1-rabbitmq-for-beginners-what-is-rabbitmq.html
“How to know if Apache Kafka is right for you” by Amit Rathi
https://link.medium.com/enGzNaNvT4
In case you chose Kafka:
When you receive a file, create a process which iterates all over it's lines and sends them to Kafka (Kafka Producer).
Create another process which continuously receive events from kafka (Kafka Consumer) and writes them in mini batches to the database (similar to your small files).
Setup Kafka:
https://dzone.com/articles/kafka-setup
Kafka Consumer/Producer simple example:
http://www.stackframelayout.com/programowanie/kafka-simple-producer-consumer-example/
Don't assume importing data is as easy as dumping it in your database and having the computer handle all the processing work. As you've discovered, an automated load can have problems.
First, database ELT processes depreciate the hard drive. Do not stage the data into one table prior to inserting it in its native table. Your process should only import the data one time to its native table to protect hardware.
Second, you don't need third-party software to middle-man the work. You need control so you're not manually inspecting what was inserted. This means your process is to first clean / transform the data prior to import. You want to prevent all problems prior to load by cleaning and structuring and even processing the data. The load should only be an SQL insert script. I have torn apart many T-SQL scripts where someone thought it convenient to integrate processing with database commands. Don't do it.
Here's how I manage imports from spreadsheet reports. Excel formulas are better than learning ETL tools like SSIS. I use cell formulas to validate whether the record is valid to go into our system. This result is its own column, and then if that column is true, a concatentation column displays an insert script.
=if(J1, concatenate("('", A1, "', ", B1, "),"), "")
If the column is false, the concat column shows nothing. This allows me to copy/paste the inserts into SSMS and conduct mass inserts via "insert into table values" scripts.
If this is actually updating existing records, as your comment appears to suggest, then you need to master the data, organizing what's changed in logs for your users.
Synchronization steps:
Log what is there before you update
Download and compare local vs remote copies for differences; you cannot compare the two without a) having them both in the same physical location or b) controlling the other system
Log what you're updating with, and timestamp when you're updating it
Save and close the logs
Only when 1-4 are done should you post an update to production
My guide to synchronizing data sources and handling Creates/Updates/Deletes:
sync local files with server files

For Data Transfer, REST API vs SFTP, which is more secure?

Two different applications needs a data transfer from one another for certain activities. Option to do this data transfer is either prepare a file of data and push it through SFTP at certain point of time, or push/pull the changes through REST API in real time.
Which approach will be more secure if the data in one system is completely encrypted and in one it is raw?
When choosing an integration pattern, as always, the answer is: "It depends".
How much data, and how frequently?
What is the security classification for the data (and the potential impact of a data breach)?
How mature are the IT Engineering/Operations/DevOps teams that will be involved in implementing the integration points, on both ends?
What facilities are available for ensuring that data is encrypted at-rest, as well as in-transit?
What is the business requirement regarding data latency?
What is the physical distance between the two systems?
What is the sustainable network connection speed between the two systems?
What are the hours during which the integration needs to be active/scheduled?
Is the data of a bulk/reference type - or is it an event/transaction nature type?
What is the size per message/transaction?
What is the total size of the data (MB? GB? PB? ...?) that needs to be transferred during a given processing cycle (per hour? per day?)?
Additional considerations that should be examined:
Network timeout/retry scenarios
Batch reruns
CPU, Memory, Network bandwidth utilization
Peak hour processing vs. off peak hour processing
Infrastructure/environmental limits/constraints - and cost factors (e.g. Cloud hosting limits on transactions, data, file size, Cloud-hosted API Gateway pricing strategies, ...)
Network Latency introduced by number of messages that must be sent to complete transfer of all data via an API vs. SFTP
Data Latency introduced by SFTP batch scheduling
It is hard to compare SSH (SFTP) with SSL (RESTful API using HTTPS) as both have different functions.
SFTP has a broader surface area for attack as it uses SSH for tunneling which means there are multiple areas for there to be compromises. That does not mean it is less secure.

NAS to store Firebird database

I read this blog saying that it is not recommended to use NAS as a firebird data media. Is there an experience that you use NAS as firebird data storage?
I plan to buy NAS synology to store the firebird database.
http://www.ibexpert.net/ibe/index.php?n=Doc.FirebirdPerformanceRecommendations
NAS in fact is a file server that uses network protocols for data exchange.
Firebird uses intensive small block reads/writes file operations at different offsets.
Some network protocols like FTP, FTPS, SFTP works at file level, meaning they are incompatible with Firebird.
Other network protocols like SMB, NFS supports file block level operations. But their problem is low latency, which will result in poor performance, due to multiple layers and long chain of involved components compared to a local/direct storage and low guarantee that your database keep its logical integrity and atomicity due to network communication, multiple caches and power failures.
Recent protocol versions SMBv3, NFSv4 have added many improvements and optimizations working with small blocks, decreased latency, RDMA support and accordingly network cards that support it, using 10Gbps+ bandwidth, ethernet/or fiber channel. Devices with persistent cache or even not NAS but SAN solutions. But they cost a lot and are used now mainly at Enterprise level.
Concluding, better to keep databases near to your Firebird server, using direct file access.

How good are ZooKeeper and Etcd?

Disclaimer: I'm quite new for the etcd project and ZooKeeper project.
I'm recently getting interested in the distributed open source products.
I found they seems to require configuration(coordination?) systems such as ZooKeeper for Presto DB, Hive and Etcd for kubernetes and I think that understanding the role of etcd and ZooKeeper is the first step to understand the distributed systems.
But now, I feel like getting lost... I could not yet understand what is the good and unique points of the etcd and ZooKeeper. They look for me a well-distributed key-value storage or file systems.
Here is the impression that I have for the products. I know the impressions don't reflect the feature of the products. but I don't know what is the remaining feature that I should know.
ZooKeeper: According to the overview page of ZooKeeper, it guarantees the following things.
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
The sequential consistency and atomicity are the unique features which is not supported by most file systems but others are common among other file systems.
Etcd: According to the README of etcd. it focuses on
Simple: curl'able user-facing API (HTTP+JSON)
Secure: optional SSL client cert authentication
Fast: benchmarked 1000s of writes/s per instance
Reliable: properly distributed using Raft
Most of them seems common with Amazon S3 (S3 doesn't support such a fast access.)
I know those products are very good ones because most of the distributed open source products depend on them. but what is the key, unique feature that the distributed open source product choose them?
I think you're confusing the file-system-like interface with an actual file system. The systems you are mentioning are well suited for cluster coordination, in particular ZooKeeper. What they are not designed for is storing large amounts of data like a file system would. You should think of them more as suited for coordinating a file system. That is, one could imagine a file system storing paths to files in a consistent store like ZooKeeper or etcd, but not the files themselves. That they expose a file system-like interface does not correlate to any ability to store files. Indeed, these systems are designed to store small amounts of data that can be held in memory. By using a consistent store like ZooKeeper for storing file information in a distributed file system, the file system would ensure that clients see changes in the file system in sequential order.
ZooKeeper is really a set of primitives with which distributed systems can be coordinated. Particularly relevant to coordinating distributed systems with ZooKeeper are its session events (watches) which allow clients to listen for changes to the cluster state. Distributed systems typically use watches in ZooKeeper for things like locks, and the strong consistency guarantees of ZooKeeper make it perfectly suitable for that use case.
If you want a good idea of what systems like ZooKeeper and etcd are used for, you should check out the Apache Curator recipes. Atomix also implements similar types of APIs for coordinating distributed systems on top of a consensus algorithm. All of these tools are demonstrative of typical use cases for consensus-based distributed systems.
What's important to note is that these types of systems are built on top of consensus algorithms and usually store state in memory. They're suitable for operations that involve a small amount of data but require a high level of consistency, and that's why they're frequently used for things like distributed locking, configuration management, and group membership.

Suggested Hadoop-based Design / Component for Ingestion of Periodic REST API Calls

We are planning to use REST API calls to ingest data from an endpoint and store the data to HDFS. The REST calls are done in a periodic fashion (daily or maybe hourly).
I've already done Twitter ingestion using Flume, but I don't think using Flume would suit my current use-case because I am not using a continuous data firehose like this one in Twitter, but rather discrete regular time-bound invocations.
The idea I have right now, is to use custom Java that takes care of REST API calls and saves to HDFS, and then use Oozie coordinator on that Java jar.
I would like to hear suggestions / alternatives (if there's easier than what I'm thinking right now) about design and which Hadoop-based component(s) to use for this use-case. If you feel I can stick to Flume, then kindly give me also an idea how to do this.
As stated in the Apache Flume web:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
As you can see, among the features attributed to Flume is the gathering of data. "Pushing-like or emitting-like" data sources are easy to integrate thanks to HttpSource, AvroSurce, ThriftSource, etc. In your case, where the data must be let's say "actively pulled" from a http-based service, the integration is not so obvious, but can be done. For instance, by using the ExecSource, which runs a script getting the data and pushing it to the Flume agent.
If you use a proprietary code in charge of pulling the data and writting it into HDFS, such a design will be OK, but you will be missing some interesting built-in Flume characteristics (that probably you will have to implement by yourself):
Reliability. Flume has mechanisms to ensure the data is really persisted in the final storage, retrying until is is effectively written. This is achieved through the usage of an internal channel buffering data both at the input (ingesting peaks of loads) and the output (retaining data until it is effecively persisted) and the transaction concept.
Performance. The usage of transactions and the possibility to configure multiple parallel sinks (data processors) will your deployment able to deal with really large amounts of data generated per second.
Usability. By using Flume you don't need to deal with the storage details (e.g. HDFS API). Even, if some day you decide to change the final storage you only have to reconfigure the Flume agent for using the new related sink.