Can i use Amazon Kinesis to connect to amazon redshift for data load in every couple of mins - amazon-redshift

From lots of sources i am planning to use Amazon kinesis to catch the stream and after certain level of data transformation i want to direct the stream to Redshift Cluster in some table schema. Here i am not sure as is it right way to do this or not ?
From the Kineis documentation i have found that they have direct connector to redshift. However i have also found that Redshift looks better if we take bulk upload as data ware house system needs indexing. So the recommendation was to store all stream to S3 and then COPY command to make bulk push on redshift . Could someone please add some more view ?

When you use the connector library for Kinesis you will be pushing data into Redshift, both through S3 and in batch.
It is true that calling INSERT INTO Redshift is not efficient as you are sending all the data through a single leader node instead of using the parallel power for Redshift that you get when running COPY from S3.
Since Kinesis is designed to handle thousands of events per second, running a COPY every few seconds or minutes will already batch many thousands of records.
If you want to squeeze the juice from Kinesis and Redshift, you can calculate exactly how many shards you need, how many nodes in Redshift you need and how many temporary files in S3 you need to accumulate from Kinisis, before calling the COPY command to Redshift.

Related

Databricks Delta Lake Structured Streaming Performance with event hubs and ADLS g2

I'm currently attempting to process telemetry data which has a volume of around 4TB a day using Delta Lake on Azure Databricks.
I have a dedicated event hub cluster where the events are written to and I am attempting to ingest this eventhub into delta lake with databricks structured streaming. there's a relatively simple job that takes the event hub output and extracts a few columns and then writes with a stream writer to ADLS gen2 storage that is mounted to the DBFS partitioned by date and hour.
Initially on a clean delta table directory the performance keeps up with the event hub writing around 18k records a second but after a few hours this drops to 10k a second and then further till it seems to stabilize around 3k records a second.
tried a few things on the databricks side with different partition schemes and the day hour partitions seemed to perform the best for the longest but still, after a pause and restart in this case the performance dropped and started to lag behind the event hub.
looking for some suggestions as to how I might be able to maintain performance.
I had a similar issue once, and it was not the Delta lake, but the Spark Azure EventHubs connector. It was extremely slow and using up a lot of resources.
I solved this problem by switching to the Kafka interface of Azure EventHubs: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
It's a little tricky to set up but it has been working very well for a couple of months now.

Push billions of records spread across CSV files in s3 to MongoDb

I have an s3 bucket, which gets almost 14-15 Billion records spread across 26000csv files, every day.
I need to parse these files and push it to mongo db.
Previously with just 50 to 100 million records, I was using bulk upsert with multiple parallel processes in an ec2 instance and it was fine. But since the number of records increased drastically, previous method is not that efficient.
So what will be the best method to do this?
You should look at mongoimport which is written in GoLang and can make effective use of threadsto parallelize the uploading. It's pretty fast. you would have to copy the files from S3 to local disk prior to uploading but if you put the node in the same region as the S3 bucket and the database it should run quickly. Also, you could use MongoDB Atlas and its API to turn up the IOPS on your cluster while you were loading and dial it down afterwards to speed up uploading.

Transfer data from Kinesis (or s3) to RDS postgres chron job

I'm pretty new to AWS, and I'm trying to find a way to reliably transfer data from a Kinesis stream to an AWS RDS postgres database table. The records will need to undergo small transformations on the way in, like filter (not all records will be inserted, depending on a key), and parsed for an insert into postgres. Currently, the data from the Kinesis stream is being dumped by firehose into S3 buckets as parquet.
I'm a bit lost in the many possible ways there seems to be of doing this, like maybe:
Kinesis streams -> Firehose -> Lambda -> RDS
Kinesis streams -> Firehose -> S3 -> Data Pipeline ETL job -> RDS
Database migration for S3 -> RDS?
AWS Glue?
others...?
In a non serverless world, I would run a chron job every, say, one hour which would take the files in the most recent S3 bucket partition (which is year/month/day/hour), so the latest hour, and filter out the records not needed in RDS, and do a bulk insert the rest into the RDS. I don't want to have a EC2 instance that sits idle 95% of the time to do this. Any advice?
Thanks for the clarification. Doing it in traditional ETL way with servers has some drawbacks. Either you'll need to keep a machine idle most of the time or you'll need to wait every time before the machine is created on demand - exactly as you're saying.
For Firehose, IMO it's interesting when you have a lot of real-time data to ingest. Regarding to AWS Glue, for me it's more like a "managed" Apache Spark, hence if you have some data processing logic to implement in a big amount of batch data, it can be interesting. But according to your description, it's not the case right ?
To sum up, if you think the amount of inserted data will always be still a few mb at a time, for me the simplest solution is the best, i.e. Kinesis -> Lambda -> RDS with maybe another Lambda to backup data on S3 (Kinesis retention period is limited to 7 days). It's especially interesting from the pricing point of view - apparently you have not a lot data, Lambda is executed at demand, for instance by batching 1000 Kinesis records, so it's a good occasion to save some money. Otherwise, if you expect to have more and more data, the use of "Firehose -> Lambda" version seems t be a better fit for me because you don't load the database with a big amount of data at once.

How to expose a REST service from HDFS?

My project requires to expose a REST service from HDFS, currently we are processing huge amount of data on HDFS, we are using MR jobs to store all the data from HDFS to Apache-Impala database for our reporting needs.
At present we have a REST endpoint hitting the Impala database but the problem is the Impala database is not fully updated with the latest data from HDFS.
We run MR jobs periodically to update the Impala database, but as we know the MR will consume lot-of time due to this we are not able to perform real-time queries on HDFS.
Use case/Scenario : Okay let me explain in detail; We have one application called "duct" built on top of hadoop, this application process huge amount of data and creates individual archives (serialized avro files) on HDFS for every run.We have another application (lets say the name is Avro-To-Impala) which takes these AVRO archives as input, process them using MR jobs and populates a new schema on Impala for every "duct" run.This tool reads the AVRO files and creates and populates the tables on Impala schema. Inorder to expose the data outside (REST endpoint) we are relaying on the Impala database.In this case whenever we have output from "duct" eventually to update the database we explicitly run "Avro-To-Impala" tool.This processing is taking long time because of this the REST endpoint returning obsolete or old data to the consumers of the web service.
Can anyone suggest solution for this kind of problem ?
Many Thanks

Analytics implementation in hadoop

Currently, we have mysql based analytics in place. We read our logs after every 15 mins, process them & add to mysql database.
As our data is growing(In one case, 9 million rows added till now & 0.5 million rows are adding in each month), we are planning to move analytics to no sql database.
As per my study, Hadoop seems to be better fit as we need to process the logs & it can handle very large data set.
However, it would be great if I can get some suggests from experts.
I agree with the other answers and comments. But if you want to evaluate Hadoop option then one solution can be following.
Apache Flume with Avro for log collection, agregation. Flume can ingest data into Hadoop File System (HDFS)
Then you can have Hbase as distributed scalable data store.
with Cloudera Impala on top of hbase you can have a near to real time (streaming) query engine. Impala uses SQL as its query language so it will be beneficial for you.
This is just one option. There can be multiple alternatives e.g. flume + hdfs + hive.
This is probably not a good q. for this forum but I would say that 9 million row and 0.5m per month hardly seems like a good reason to go to noSQL. This is a very small database and your best action would be to scale up the server a little (RAM, more disks, move to SSDs etc.)