As a part of our architecture, we are having Kinesis streams which will send data streams to Redshift, the current thought process is to create external external schema on top of the Kinesis streams and then use materialized views to persist the data with minimal transformations as needed.
However In addition to this there is a need to perform a series of transformations (which are quite complex) on this data and load it into target tables. Was thinking about using stored procedures to perform these transformations and load into a target table. So the flow is like Kinesis Streams -> External View (Real time) -> Batch Processing (Materialized view and stored procedure).
To call the stored procedures (SP) on a timely schedule, the thought process was to schedule the SQL queries. Being fairly new to AWS Redshift and the Kinesis streaming ingestion and exploring on the available options, would like to get thoughts on the above approach.
Further understand that there are limitations with stored procedures, (https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-constraints.html) and scheduling of SQL queries might not give the feasibility of implementing a good orchestration flow. Hence would like to understand alternate methods that are available to implement the above specifically post the availability of data in external view.
Related
I'm designing a new way for my company to stream data from multiple MongoDB databases, perform some arbitrary initial transformations, and sink them into BigQuery.
There are various requirements but the key ones are speed and ability to omit or redact certain fields before they reach the data warehouse.
We're using Dataflow to basically do this:
MongoDB -> Dataflow (Apache Beam, Python) -> BigQuery
We basically need to just wait on the collection.watch() call as the input, but from the docs and existing research it may not be possible,
At the moment, the MongoDB connector is bounded and there seems to be no readily-available solution to read from a changeStream, or a collection in an unbounded way.
Is it possible to read from a changeStream and have the pipeline wait until the task is killed rather than being out of records?
In this instance I decided to go via Google Pub/Sub which serves as the unbounded data source.
I'm pretty new to AWS, and I'm trying to find a way to reliably transfer data from a Kinesis stream to an AWS RDS postgres database table. The records will need to undergo small transformations on the way in, like filter (not all records will be inserted, depending on a key), and parsed for an insert into postgres. Currently, the data from the Kinesis stream is being dumped by firehose into S3 buckets as parquet.
I'm a bit lost in the many possible ways there seems to be of doing this, like maybe:
Kinesis streams -> Firehose -> Lambda -> RDS
Kinesis streams -> Firehose -> S3 -> Data Pipeline ETL job -> RDS
Database migration for S3 -> RDS?
AWS Glue?
others...?
In a non serverless world, I would run a chron job every, say, one hour which would take the files in the most recent S3 bucket partition (which is year/month/day/hour), so the latest hour, and filter out the records not needed in RDS, and do a bulk insert the rest into the RDS. I don't want to have a EC2 instance that sits idle 95% of the time to do this. Any advice?
Thanks for the clarification. Doing it in traditional ETL way with servers has some drawbacks. Either you'll need to keep a machine idle most of the time or you'll need to wait every time before the machine is created on demand - exactly as you're saying.
For Firehose, IMO it's interesting when you have a lot of real-time data to ingest. Regarding to AWS Glue, for me it's more like a "managed" Apache Spark, hence if you have some data processing logic to implement in a big amount of batch data, it can be interesting. But according to your description, it's not the case right ?
To sum up, if you think the amount of inserted data will always be still a few mb at a time, for me the simplest solution is the best, i.e. Kinesis -> Lambda -> RDS with maybe another Lambda to backup data on S3 (Kinesis retention period is limited to 7 days). It's especially interesting from the pricing point of view - apparently you have not a lot data, Lambda is executed at demand, for instance by batching 1000 Kinesis records, so it's a good occasion to save some money. Otherwise, if you expect to have more and more data, the use of "Firehose -> Lambda" version seems t be a better fit for me because you don't load the database with a big amount of data at once.
My project requires to expose a REST service from HDFS, currently we are processing huge amount of data on HDFS, we are using MR jobs to store all the data from HDFS to Apache-Impala database for our reporting needs.
At present we have a REST endpoint hitting the Impala database but the problem is the Impala database is not fully updated with the latest data from HDFS.
We run MR jobs periodically to update the Impala database, but as we know the MR will consume lot-of time due to this we are not able to perform real-time queries on HDFS.
Use case/Scenario : Okay let me explain in detail; We have one application called "duct" built on top of hadoop, this application process huge amount of data and creates individual archives (serialized avro files) on HDFS for every run.We have another application (lets say the name is Avro-To-Impala) which takes these AVRO archives as input, process them using MR jobs and populates a new schema on Impala for every "duct" run.This tool reads the AVRO files and creates and populates the tables on Impala schema. Inorder to expose the data outside (REST endpoint) we are relaying on the Impala database.In this case whenever we have output from "duct" eventually to update the database we explicitly run "Avro-To-Impala" tool.This processing is taking long time because of this the REST endpoint returning obsolete or old data to the consumers of the web service.
Can anyone suggest solution for this kind of problem ?
Many Thanks
We are planning to use REST API calls to ingest data from an endpoint and store the data to HDFS. The REST calls are done in a periodic fashion (daily or maybe hourly).
I've already done Twitter ingestion using Flume, but I don't think using Flume would suit my current use-case because I am not using a continuous data firehose like this one in Twitter, but rather discrete regular time-bound invocations.
The idea I have right now, is to use custom Java that takes care of REST API calls and saves to HDFS, and then use Oozie coordinator on that Java jar.
I would like to hear suggestions / alternatives (if there's easier than what I'm thinking right now) about design and which Hadoop-based component(s) to use for this use-case. If you feel I can stick to Flume, then kindly give me also an idea how to do this.
As stated in the Apache Flume web:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
As you can see, among the features attributed to Flume is the gathering of data. "Pushing-like or emitting-like" data sources are easy to integrate thanks to HttpSource, AvroSurce, ThriftSource, etc. In your case, where the data must be let's say "actively pulled" from a http-based service, the integration is not so obvious, but can be done. For instance, by using the ExecSource, which runs a script getting the data and pushing it to the Flume agent.
If you use a proprietary code in charge of pulling the data and writting it into HDFS, such a design will be OK, but you will be missing some interesting built-in Flume characteristics (that probably you will have to implement by yourself):
Reliability. Flume has mechanisms to ensure the data is really persisted in the final storage, retrying until is is effectively written. This is achieved through the usage of an internal channel buffering data both at the input (ingesting peaks of loads) and the output (retaining data until it is effecively persisted) and the transaction concept.
Performance. The usage of transactions and the possibility to configure multiple parallel sinks (data processors) will your deployment able to deal with really large amounts of data generated per second.
Usability. By using Flume you don't need to deal with the storage details (e.g. HDFS API). Even, if some day you decide to change the final storage you only have to reconfigure the Flume agent for using the new related sink.
From lots of sources i am planning to use Amazon kinesis to catch the stream and after certain level of data transformation i want to direct the stream to Redshift Cluster in some table schema. Here i am not sure as is it right way to do this or not ?
From the Kineis documentation i have found that they have direct connector to redshift. However i have also found that Redshift looks better if we take bulk upload as data ware house system needs indexing. So the recommendation was to store all stream to S3 and then COPY command to make bulk push on redshift . Could someone please add some more view ?
When you use the connector library for Kinesis you will be pushing data into Redshift, both through S3 and in batch.
It is true that calling INSERT INTO Redshift is not efficient as you are sending all the data through a single leader node instead of using the parallel power for Redshift that you get when running COPY from S3.
Since Kinesis is designed to handle thousands of events per second, running a COPY every few seconds or minutes will already batch many thousands of records.
If you want to squeeze the juice from Kinesis and Redshift, you can calculate exactly how many shards you need, how many nodes in Redshift you need and how many temporary files in S3 you need to accumulate from Kinisis, before calling the COPY command to Redshift.