Question on Microsoft Sentinel Retention period - azure-sentinel

One of our customer raised the below queries:
What is the recommended retention period for UEBA and Machine Learning data that are associated with Defender for Cloud, when alerts are being forwarded to Sentinel? Will the 90 days additional storage that comes with Sentinel be enough?
What is the optimum period of data for UEBA and Machine Learning data that are associated with Defender for Cloud, i.e. at what age does the data become non-useful or ignored?
Any guidance would be of great help!
Thank you in advance.

Related

How to estimate RAM and CPU per Kubernetes Pod for a Spring Batch processing job?

I'm trying to estimate hardware resources for a Kubernetes Cluster to be able to handle the following scenarios:
On a daily basis I need to read 46,3 Million XML messages 10KB each (approx.) from a queue and then insert them in a Spark instance and in a Sybase DB instance. I need to come out with an estimation of how many pods I will need to process this amount of data and how much RAM and how many vCPUs will be required per pod in order to determine the characteristics of the nodes of the cluster. The reason behind all this is that we have some budget restrictions and we need to have an idea of the sizing before starting the corresponding development.
The second scenario is the same as the one already described but 18,65 times bigger, i.e. 833,33 Million XML messages per day. This is expected to be the case within a couple of years.
So far we plan to use Spring Batch with partitioning steps. I need orientation on how to determine the ideal Spring Batch configuration, required RAM, and required CPU per POD, as well as the number of PODS.
I will greatly appreciate any comments from your side.
Thanks in advance.

How to share data between microservices without sync RPC (use topics as changelogs) and deal with consistency?

I learned about using Kafka's topics as a changelog to avoid doing synchronous RPC, but I don't understand how we deal with consistency as topics are not persistent (retention policy).
i.e, I run an application, 2 microservices:
The User Service, is used to update users' data in the system(address, First Name, phone number...).
The Shipping Service, uses Users' data to create a shipping order and send it to the shipping company's system.
Each service has its own db to persist the Users' data.. To communicate any changes made on a User's data, the confluent's teacher proposed to create a topic and use it as a changelog. User Service inputs the changes, other microservices can consume.
But What if:
User X changed his address 1 year ago
the retention policy of the changelog is 6 months
today we add BillingService to the system.
The BillingService won't know the User X's address, so its view is inconsistent. Should I run a one-time "Call UserService to copy its full DB" when a new service enters the system? Seems ugly solution.
More tricky and challenging:
We have a changelog with a retention policy of time T
A consumer service failed more than time T
Therefore, it will potentially miss some changelogs. How do we deal with that? We are never confident how the service knows everything it has to know about the users.
Did some research, but found nothing. I really think I don't have enough vocabulary yet to do good research, as the problem sounds pretty common to everyone. Sorry if it exists a source dedicated to this problem that I did not find!
If the changelog topic is covering entities that are of unbounded lifetime (like your users, hopefully), that strongly suggests that the retention period for that topic should be infinite. Chances are that topic is sufficiently low volume that infinite retention is viable (consider that it can probably be partitioned).
If for some reason that's not viable, you can arrange for producers to at some period shorter than the retention period publish out "this is the state of this entity" for every entity they own to the topic. For entities which don't change very much, this is pretty wasteful and duplicative (but for those a very long to infinite retention period is more viable), for entities which do change a lot, this is a rounding error in terms of volume.
That neatly solves the first case and eventually allows for the second to be solved. For the second, there is basically no solution, which means that you have to choose the retention period for a topic such that you can guarantee that no consumer of this topic will ever be down (or not deployed) for longer than the retention period: this typically means that a retention period shorter than, say, 7 days, should be really heavily scrutinized. Note that if you have a 1 week retention period and a consumer has been down for more than a few days, you can temporarily bump up the retention period to buy you time for the consumer to get fixed, and if there's a consumer which can be down for more than a week without anybody noticing, how important is that consumer, really?
This is quite common issue in replication - a node goes offline for a significant amount of time. For example, a node's hardware completely failed/lost and it takes weeks to order/get new one.
In that case, in distributed systems, we don't do fail recovery, but we provision a new node as a replacement. That new node is completely empty, hence it needs some initial state.
If your queue has all events since the beginning of time, you could apply those events one by one to the node - that would do the job - but in a very inefficient way (imagine processing years of data).
There is a better process - first restore data for the new node from the most recent backup, and then reapply newer items.
Backing up data is important. Every Microservices should do its own job saving/restoring its data. As a result, the original Kafka system won't have to keep data forever.
As a quick summary: in distributed replication these are two different problems - catching up a lagging node and provisioning a new node. And if a node is lagging for a long time, then this becomes provisioning problem.

split Kubernetes cluster costs between namespaces

We are running a multi tenant Kubernetes cluster running on EKS (in AWS) and I need to come up with an appropriate way of charging all the teams that use the cluster. We have the costs of the EC2 worker nodes but I don't know how to split these costs up given metrics from prometheus. To make it trickier I also need to give the cost per team (or pod/namespace) for the past week and the past month.
Each team uses a different namespace but this will change soon so that each pod will have a label with the team name.
From looking around I can see that I'll need to use container_spec_cpu_shares and container_memory_working_set_bytes metrics but how can these two metrics be combined to used so that we get a percentage of the worker node cost?
Also, I don't know promql well enough to know how to get the stats for the past week and the past month for the range vector metrics.
If anyone can share a solution if they're done this already or maybe even point me in the right direction i would appreciate it.
Thanks

RabbitMQ Best Practices for High Availability on Cloud

I'm planning to deploy RabbitMQ on Kubernetes Engine Cluster. I see there are two kinds of location types i.e. 1. Region 2. Zone
Could someone help me understand what kind of benefits I can think of respective to each location types? I believe having multi-zone set up
could help enhancing the network throughout. While multi-region set up can ensure an undisputed service even if in case of regional failure events. Is this understanding correct? I'm looking at relevant justifications to choose a location type. Please help.
I'm planning to deploy RabbitMQ on Kubernetes Engine Cluster. I see there are two kinds of location types:
Region
Zone
Could someone help me understand what kind of benefits I can think of respective to each location types?
A zone (Availability Zone) is typically a Datacenter.
A region is multiple zones located in the same geographical region. When deploying a "cluster" to a region, you typically have a VPC (Virtual private cloud) network spanning over 3 datacenters and you spread your components to those zones/datacenters. The idea is that you should be fault tolerant to a failure of a whole _datacenter, while still have relatively low latency within your system.
While multi-region set up can ensure an undisputed service even if in case of regional failure events. Is this understanding correct? I'm looking at relevant justifications to choose a location type.
When using multiple regions, e.g. in different parts of the world, this is typically done to be near the customer, e.g. to provide lower latency. CDN services is distributed to multiple geographical locations for the same reason. When deploying a service to multiple regions, communications between regions is typically done with asynchronous protocols, e.g. message queues, since latency may be too large for synchronous communication.

temporarily shut down redshift to reduce bill

Amazon says the following on Redshift billing
"Node usage hours are billed for each hour your data warehouse cluster is running in an Available state. If you no longer wish to be charged for your data warehouse cluster, you must terminate it to avoid being billed for additional node hours."
This means if I just create a cluster and whether use it or not I'll be billed 24/7 because the cluster doesn't have any state like "Suspend". Is there a way to shut down the whole Redshift server when not in use so that I'll be billed only for the hours when I want to use the clusters?
Edit: With Tomasz's reply it sounds like if I want to shutdown the cluster on weekend it'll be like backing up the whole database on Friday evening and restoring on Sunday evening. This doesn't sound good. What does Amazon really mean when they say "PAY ONLY FOR THE HOURS YOU USE"?
Can you tell me how much time will it take to backup/restore a data warehouse of size around 100GB? Can I automatically associate security groups to the cluster after restoring from the Java code?
You can create a manual snapshot of a cluster when you have finished work and then remove cluster.
You will pay for S3 storage, but that is much less than for running Redshift cluster.
Next day just restore cluster from latest snapshot. You will have to add security groups to new cluster, probably with JAVA API:
The new cluster will be associated only with the default security and
parameter groups. If the original cluster was associated with any
other security or parameter group, you will need to manually associate
those groups with the new cluster.
The easiest way to create snapshot is from the console, but you probably will want to do it automatically using cli or Java SDK.
Creating a snapshot of a 3 node cluster filled up to 80% took me about 5 minutes (it's so quick because snapshots are incremental). 100GB is much less than my setup, so it should be even faster. Also restore shouldn't take long time.
UPDATE: A lot has changed in the intervening years, in particular restore from snapshot is now quite fast. Your cluster becomes available in a few minutes and you can run queries while the restore continues in the background. Total time for complete restore of 100GB would now be measured in minutes (varies based on node type & count).
What does Amazon really mean when they say "PAY ONLY FOR THE HOURS YOU USE"?
You pay for the whole hour of any partial hours used.
Can you tell me how much time will it take to backup/restore a data warehouse of size around 100GB?
Snapshots are incremental and this is what makes them fast (as Tomasz mentioned). It's is fairly quick to shutdown a cluster about half an hour. However restoring from a snapshot is very slow I'd suggest around 3 hours for restoring 100GB.
If you really want to be able to take a database cluster up and down quickly you might be better using another analytic DB (e.g. Greenplum or Vertica free editions) with the data stored on EBS volumes. It'd be a lot more work to manage though, that's the tradeoff.
Now we can able to pause and resume the Redshift cluster (both Console and CLI)
check out the link:
https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/