How does Cloud Data Fusion decide which project network to use for the dataproc resources? - google-cloud-data-fusion

I have a project with 4 VPC networks. I created a GCDF instance, I had expected that the "default" network would be picked but I see that another one was picked, (the first one alphabetically). Is this the algorithm, the alphabetic order of names?
Is there a way to specify the network to be used, that would be very useful since I would like to isolate the network where those VMs run.

Your observation is correct. Current implementation selects network alphabetically. To use specific network, there are multiple options:
Create a dataproc compute profile that uses default or any other VPC network you have already created.
Use system.profile.properties.network=default as system preference.

Related

Working with multiple data warehouses in dbt

I'm building an application where each of our clients needs their own data warehouse (for security, compliance, and maintainability reasons). For each client we pull in data from multiple third party integrations and then merge them into a unified view, which we use to perform analytics and report metrics for the data across those integrations. These transformations and all relevant schemas are the same for all clients. We would need this to scale to 1000s of clients.
From what I gather dbt is designed so each project corresponds with one warehouse. I see two options:
Use one project and create a separate environment target for each client (and maybe a single dev environment). Given that environments aren't designed for this, are there any catches to this? Will scheduling, orchestrating, or querying the outputs be painful or unscalable for some reason?
profiles.yml:
example_project:
target: dev
outputs:
dev:
type: redshift
...
client_1:
type: redshift
...
client_2:
type: redshift
...
...
Create multiple projects, and create a shared dbt package containing most of the logic. This seems very unwieldy needing to maintain a separate repo for each client and less developer friendly.
profiles.yml:
client_1_project:
target: dev
outputs:
client_1:
type: redshift
...
client_2_project:
target: dev
outputs:
client_2:
type: redshift
...
Thoughts?
I think you captured both options.
If you have a single database connection, and your client data is logically separated in that connection, I would definitely pick #2 (one package, many client projects) over #1. Some reasons:
Selecting data from a different source (within a single connection), depending on the target, is a bit hacky, and wouldn't scale well for 1000's of clients.
The developer experience for packages isn't so bad. You will want a developer data source, but depending on your business you could maybe get away with using one client's data (or an anonymized version of that). It will be good to keep this developer environment logically separate from any individual client's implementation, and packages allow you to do that.
I would consider generating the client projects programmatically, probably using a Python CLI to set up, dbt run, and tear down the required files for each client project (I'm assuming you're not going to use dbt Cloud and have another orchestrator or compute environment that you control). It's easy to write YAML from Python with pyyaml (each file is just a dict), and your individual projects probably only need separate profiles.yml, sources.yml, and (maybe) dbt_project.yml files. I wouldn't check these generated files for each client into source control -- just check in the script and generate the files you need with each invocation of dbt.
On the other hand, if your clients each have their own physical database with separate connections and credentials, and those databases are absolutely identical, you could get away with #1 (one project, many profiles). The "hardest" parts of that approach would likely be managing secrets and generating/maintaining a list of targets that you could iterate over (ideally in a parallel fashion).

AnyLogic: How i know i have multiple networks and solve that problem?

The agent is not following the outlined path for moving to its destination
because destination is in different network.
How i know i have multiple networks and how to avoided that. So agents move on the path.
You can get the number of networks by calling getNetworks().
You can loop through them using
for (INetwork currentNetwork :getNetworks()) {
// do something with currentNetwork
}
Check my current video series on making networks a lot more powerful, including the problem of having several networks: https://www.benjamin-schumann.com/blog/2022/8/6/taking-control-of-your-network-agent-based-pathfinding
to see if you have multiple networks, you can check on the projects panel
Here you will see all the networks you have, and what nodes are contained inside each network.
AnyLogic doesn't know automatically how to connect these 2 networks if you want to do it, so you need to have attention to details when you build your networks.
if your agent move between nodes that are not in the same network, you will get unexpected results
When paths and the destination **node" do not connect with each other, then the agent does not follow the path and uses the shortest path to reach its destination.
A Network is the collection of paths and nodes that we create on canvas.
If we create multiple collections of paths and nodes and do not connect one collection with the other, then they act as separate networks.

Increase ECS fargate memory using EFS

one of my application running ECS(with fargate) needs more memory, the 20GB ephemeral memory is not sufficient for my application, so I am planning to use efs
volume {
name = "efs-test-space
efs_volume_configuration {
file_system_id = aws_efs_file_system.efs_apache.id
root_directory = "/"
transit_encryption = "ENABLED"
container_path = "/home/user/efs/"
authorization_config {
access_point_id = aws_efs_access_point.efs-access-point.id
iam = "ENABLED"
}
}
I can see it is mounted and my application is able to access the mounted folder, but because of HA and to have parallelism my ecs task count are 6. Since I am using sone EFS and same will be shared by all tasks. So here the problem I got stuck is providing unique mounted EFS filepath for each task .
I added something like this /home/user/efs/{random_id} but this I want to make as part of task lifecycle, I mean this folder should get deleted if my task is stopped or destroyed/
So is there a way to mount efs as bind mount or enable delete of folder during task destroy stage?
You can now increase your ephemeral storage size to 200GB all you need to do is to set the ephemeral parameter in fargate task definition
"ephemeralStorage": {
"sizeInGiB": 100
}
While this could be achieved in theory, there is a lot of moving parts to it because the life cycle of EFS (and Access Points) is decoupled from the life cycle of tasks. This means you need to create Access Points out of band and on the fly AND these Access Points (and their data) are not automatically deleted when you tear down your tasks. The Fargate/EFS integration did not have this as a primary use case. The primary use case were more around sharing of the same data among different tasks(which is what you are observing but it doesn't serve your use case!) in addition to provide point persistency for a single task. .
What you need to solve your problem easily is a new feature the Fargate team is working on right now and that will allow you to expand your local ephemeral storage as a property of the task. I can't say more about the timing but the feature is actively being developed so you may want to consider intercepting it rather than building a complex workflow to achieve the same result.

How to properly monitor ELB latency on AWS using Grafana?

I am trying to monitor Latency on ElasticBeanstalk environment using Grafana.
I get some things to work, and some things do not provide any information.
I am using "CloudWatch" data source.
There is ELB and ApplicationELB.
The ApplicationELB does not offer Latency metric. In fact, every metric I select here will result with "no data".
When I configure monitoring on AWS, I get this following graph:
I am able to query for Latency on a region using Grafana and I do get some correlation
As you can see around 13:50 some requests timed-out. But it is also obvious Grafana is showing additional information from other environments which I would like to ignore.
My query currently looks like this:
Which I know is too broad, but I do not know how to refine.
I tried using "InstanceName" as dimension, but it is not clear to me which ELB I should look for, and seems to me like ApplicationELB should be what I am looking for, but that one does not offer Latency and does not provide any data either way.
Using AvailabilityZone does not help, and that's the only other option for dimension (other than InstanceName).
I need a way to refine the query so I see the same result in AWS and Grafana.
A clarification about ApplicationELB and ELB would be great also!
Application ELB vs ELB: they are just different types of load balancers provided by AWS https://aws.amazon.com/elasticloadbalancing/ - I'm not sure which one is used by ElasticBeanstalk.
You need to add dimension to filter your metrics. Some metrics may need multiple dimensions for correct filtering. Available dimensions are available in the docs. For example LoadBalancerName is a correct dimension for AWS/ELB namespace: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-cloudwatch-metrics.html
I recommend to use existing published AWS dashboard(s) (https://github.com/monitoringartist/grafana-aws-cloudwatch-dashboards - I'm the author) and then just customize them for your needs.

Billing by tag in Google Compute Engine

Google Compute Engine allows for a daily export of a project's itemized bill to a storage bucket (.csv or .json). In the daily file I can see X-number of seconds of N1-Highmem-8 VM usage. Is there a mechanism for further identifying costs, such as per tag or instance group, when a project has many of the same resource type deployed for different functional operations?
As an example, Qty:10 N1-Highmem-8 VM's are deployed to a region in a project. In the daily bill they just display as X-seconds of N1-Highmem-8.
Functionally:
2 VM's might run a database 24x7
3 VM's might run batch analytics operation averaging 2-5 hrs each night
5 VM's might perform a batch operation which runs in sporadic 10 minute intervals through the day
final operation writes data to a specific GS Buckets, other operations read/write to different buckets.
How might costs be broken out across these four operations each day?
The Usage Logs do not provide 'per-tag' granularity at this time and it can be a little tricky to work with the usage logs but here is what I recommend.
To further break down the usage logs and get better information out of em, I'd recommend trying to work like this:
Your usage logs provide the following fields:
Report Date
MeasurementId
Quantity
Unit
Resource URI
ResourceId
Location
If you look at the MeasurementID, you can choose to filter by the type of image you want to verify. For example VmimageN1Standard_1 is used to represent an n1-standard-1 machine type.
You can then use the MeasurementID in combination with the Resource URI to find out what your usage is on a more granular (per instance) scale. For example, the Resource URI for my test machine would be:
https://www.googleapis.com/compute/v1/projects/MY_PROJECT/zones/ZONE/instances/boyan-test-instance
*Note: I've replaced the "MY_PROJECT" and "ZONE" here, so that's that would be specific to your output along with the name of the instance.
If you look at the end of the URI, you can clearly see which instance that is for. You could then use this to look for a specific instance you're checking.
If you are better skilled with Excel or other spreadsheet/analysis software, you may be able to do even better as this is just an idea on how you could use the logs. At that point it becomes somewhat a question of creativity. I am sure you could find good ways to work with the data you gain from an export.
9/2017 update.
It is now possible to add user defined labels, then track usage and billing by these labels for Compute and GCS.
Additionally, by enabling the billing export to Big Query, it is then possible to create custom views or hit Big Query in a tool more friendly to finance people such as Google Docs, Data Studio, or anything which can connect to Big Query. Here is a great example of labels across multiple projects to split costs into something friendlier to organizations, in this case a Data Studio report.