Transforming and Reading json files in Azure Synapse notebook

Transforming and Reading json files in Azure Synapse notebook - pyspark

I have below Json in one of my storage account and I am able to read it by following the below code. I need help in reading the columns where "pod" has value "kube-apiserver-78" or "kube-apiserver-79" and username has "system:serviceaccount:xyz" or "system:serviceaccount:poq" : can someone help me how can I translate it below code.
df = spark.read.json('abfss://insights-logs-kube-audit#azogs.dfs.core.windows.net/resourceId=/SUBSCRIPTIONS/5IS/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=08/d=09/h=11/m=00/')
df.show()
Sample Json file in Storage container Which I read:
{ "operationName": "Microsoft.ContainerService/managedClusters/diagnosticLogs/Read", "category": "kube-audit", "ccpNamespace": "5f", "resourceId": "/SUBSCRIPTIONS/SID/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV", "properties": {"log":"{\"kind\":\"Event\",\"apiVersion\":\"audit.k8s.io/v1\",\"level\":\"Metadata\",\"auditID\":\"b7b1ca3\",\"stage\":\"ResponseComplete\",\"requestURI\":\"/apis/chaos-mesh.org/v1alpha1/namespaces/ve/httpchaos?limit=500\",\"verb\":\"list\",\"user\":{\"username\":\"system:serviceaccount:xyz\",\"uid\":\"3eb35e\",\"groups\":[\"system:serviceaccounts\",\"system:serviceaccounts:internal-services\",\"system:authenticated\"]},\"sourceIPs\":[\"100.100.100.100\"],\"userAgent\":\"ktl/v1.18.10 (linux/amd64) kubernetes/62c\",\"objectRef\":{\"resource\":\"httpchaos\",\"namespace\":\"vo\",\"apiGroup\":\"chaos-mesh.org\",\"apiVersion\":\"v1alpha1\"},\"responseStatus\":{\"metadata\":{},\"code\":200},\"requestReceivedTimestamp\":\"2022-05-23T13:45:13.140759Z\",\"stageTimestamp\":\"2022-05-23T13:45:13.146101Z\",\"annotations\":{\"authentication.k8s.io/legacy-token\":\"system:serviceaccount:ixyzr\",\"authorization.k8s.io/decision\":\"allow\",\"authorization.k8s.io/reason\":\"RBAC: allowed by ClusterRoleBinding \\\"admin\\\" of ClusterRole \\\"cluster-admin\\\" to ServiceAccount \\\"abc/xyz\\\"\"}}\n","stream":"stdout","pod":"kube-apiserver-78"}, "time": "2022-05-23T13:45:13.0000000Z", "Cloud": "AzureCloud", "Environment": "prod", "UnderlayClass": "hcp-underlay", "UnderlayName": "h-24"}
{ "operationName": "Microsoft.ContainerService/managedClusters/diagnosticLogs/Read", "category": "kube-audit", "ccpNamespace": "5f", "resourceId": "/SUBSCRIPTIONS/SID/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV", "properties": {"log":"{\"kind\":\"Event\",\"apiVersion\":\"audit.k8s.io/v1\",\"level\":\"Metadata\",\"auditID\":\"b7b1cax3\",\"stage\":\"ResponseComplete\",\"requestURI\":\"/apis/chaos-mesh.org/v1alpha1/namespaces/ve/httpchaos?limit=500\",\"verb\":\"list\",\"user\":{\"username\":\"system:serviceaccount:xyz\",\"uid\":\"3eb35e\",\"groups\":[\"system:serviceaccounts\",\"system:serviceaccounts:internal-services\",\"system:authenticated\"]},\"sourceIPs\":[\"100.100.100.100\"],\"userAgent\":\"ktl/v1.18.10 (linux/amd64) kubernetes/62c\",\"objectRef\":{\"resource\":\"httpchaos\",\"namespace\":\"vo\",\"apiGroup\":\"chaos-mesh.org\",\"apiVersion\":\"v1alpha1\"},\"responseStatus\":{\"metadata\":{},\"code\":200},\"requestReceivedTimestamp\":\"2022-05-23T13:45:13.140759Z\",\"stageTimestamp\":\"2022-05-23T13:45:13.146101Z\",\"annotations\":{\"authentication.k8s.io/legacy-token\":\"system:serviceaccount:ixyzr\",\"authorization.k8s.io/decision\":\"allow\",\"authorization.k8s.io/reason\":\"RBAC: allowed by ClusterRoleBinding \\\"admin\\\" of ClusterRole \\\"cluster-admin\\\" to ServiceAccount \\\"abc/xyz\\\"\"}}\n","stream":"stdout","pod":"kube-apiserver-78"}, "time": "2022-05-23T13:45:13.0000000Z", "Cloud": "AzureCloud", "Environment": "prod", "UnderlayClass": "hcp-underlay", "UnderlayName": "h-24"}

To query Json file After reading it convert it into temporal tables in Apache Spark and query them using Spark SQL.
To convert it into temporal table, use command:
df.createOrReplaceTempView("Name for temporal table")
Then query on this temporal table using Spark SQL.
SELECT * FROM "Name for temporal table"
WHERE (pod = 'kube-apiserver-78' or pod = 'kube-apiserver-79')
and (username = 'system:serviceaccount:xyz' or username = 'system:serviceaccount:poq')
Reference: Query JSON Files with Azure Synapse Analytics Notebooks

Related

I am trying to use ApiextensionsV1beta1Api to create a custom resource definition though kubernetes python client

I am trying to use ApiextensionsV1beta1Api to create a custom resource definition though kubernetes python client
with kubernetes.client.ApiClient(configuration) as api_client:
self.client_custom_resource_def = kubernetes.client.ApiextensionsV1beta1Api(api_client)
spec = {"group": "kgosalia.com", "scope": "Namespaced",
"metadata": {"name": "kgosaliaconfigs.kgosalia.com"},
"versions": [{"name": "v1", "served": True, "storage": True}],
"names": {"kind": "CustomResourceDefinition", "plural": "kgosaliaconfigs"}}
def create_custom_resource_definition(self, spec):
body = kubernetes.client.V1beta1CustomResourceDefinition(spec=spec)
try:
api_response = self.client_custom_resource_def.create_custom_resource_definition(body)
pprint(api_response)
except ApiException as e:
print("Exception when calling ApiextensionsV1Api->create_custom_resource_definition: %s\n" % e)
When I run this I am getting a 422. Can you help me find the correct format to create spec and name object?
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"CustomResourceDefinition.apiextensions.k8s.io \"\" is invalid: metadata.name: Required value: name or generateName is required","reason":"Invalid","details":{"group":"apiextensions.k8s.io","kind":"CustomResourceDefinition","causes":[{"reason":"FieldValueRequired","message":"Required value: name or generateName is required","field":"metadata.name"}]},"code":422}
Appreciate your help, thank you!

pod identity on aks cluster crreation

Right now, it's impossible to have assigned user assigned identities on arm templates (and terraform) on cluster creation. I already tried a lot of things, and updates works great, after inserting manually with:
az aks pod-identity add --cluster-name my-aks-cn --resource-group myrg --namespace myns --name example-pod-identity --identity-resource-id /subscriptions/......
But, I want to have this done at once, with the deployment, so I need to insert the pod user identities to the cluster automatically. I also tried to run the command using the DeploymentScripts but the deployment scripts are not ready to use preview aks extersion.
My config looks like this:
{
"type": "Microsoft.ContainerService/managedClusters",
"apiVersion": "2021-02-01",
"name": "[variables('cluster_name')]",
"location": "[variables('location')]",
"dependsOn": [
"[resourceId('Microsoft.Network/virtualNetworks', variables('vnet_name'))]"
],
"properties": {
....
"podIdentityProfile": {
"allowNetworkPluginKubenet": null,
"enabled": true,
"userAssignedIdentities": [
{
"identity": {
"clientId": "[reference(resourceId('Microsoft.ManagedIdentity/userAssignedIdentities', 'managed-indentity'), '2018-11-30').clientId]",
"objectId": "[reference(resourceId('Microsoft.ManagedIdentity/userAssignedIdentities', 'managed-indentity'), '2018-11-30').principalId]",
"resourceId": "[resourceId('Microsoft.ManagedIdentity/userAssignedIdentities', 'managed-indentity')]"
},
"name": "managed-indentity",
"namespace": "myns"
}
],
"userAssignedIdentityExceptions": null
},
....
},
"identity": {
"type": "SystemAssigned"
}
},
I'm always getting the same issue:
"statusMessage": "{\"error\":{\"code\":\"InvalidTemplateDeployment\",\"message\":\"The template deployment 'deployment_test' is not valid according to the validation procedure. The tracking id is '.....'. See inner errors for details.\",\"details\":[{\"code\":\"PodIdentityAddonUserAssignedIdentitiesNotAllowedInCreation\",\"message\":\"Provisioning of resource(s) for container service cluster-12344 in resource group myrc failed. Message: {\\n \\\"code\\\": \\\"PodIdentityAddonUserAssignedIdentitiesNotAllowedInCreation\\\",\\n \\\"message\\\": \\\"PodIdentity addon does not support assigning pod identities on creation.\\\"\\n }. Details: \"}]}}",

The Product team has shared the answer here: https://github.com/Azure/aad-pod-identity/issues/1123
which says:
This is a known limitation in the existing configuration. We will fix
this in the V2 implementation.
For others who are facing the same issue, please refer to the GitHub issue above.

How can I connect to endpoint when attempting to export data from RDS to S3?

Objective
My objective is to export data from a Postgres RDS Instance to an s3 Bucket. I just want to prove that the concept works on my VPC, so I am using dummy data.
What I have tried so far
I followed the docs here using the console and cli.
Created an s3 bucket (I chose to block all public access)
Created an RDS Instance with the following settings:
Created on 2 public subnets
Public accessibility: No
Security group rules for outbound: CIDR/IP - Inbound 0.0.0.0/0
Security group rules for inbound: CIDR/IP - Inbound 0.0.0.0/0
Created a policy as shown in the example:
aws iam create-policy --policy-name rds-s3-export-policy --policy-document '{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "s3export",
"Action": [
"S3:PutObject"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::your-s3-bucket/*"
]
}
]
}'
Created an IAM Role like:
aws iam create-role --role-name rds-s3-export-role --assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "rds.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}'
Attached the policy to the role like:
aws iam attach-role-policy --policy-arn your-policy-arn --role-name rds-s3-export-role
Added the IAM Role to the DB like:
aws rds add-role-to-db-instance \
--db-instance-identifier my-db-instance \
--feature-name s3Export \
--role-arn your-role-arn \
--region your-region
Did all the requirements within PSQL like:
CREATE EXTENSION IF NOT EXISTS aws_s3 CASCADE;
CREATE TABLE sample_table (bid bigint PRIMARY KEY, name varchar(80));
INSERT INTO sample_table (bid,name) VALUES (1, 'Monday'), (2,'Tuesday'), (3, 'Wednesday');
SELECT aws_commons.create_s3_uri(
'dummy-data-bucket-path',
'',
'us-west-2'
) AS s3_uri_1 \gset
What does not work
When I try to make the actual export by:
SELECT * FROM aws_s3.query_export_to_s3('SELECT * FROM sample_table', :'s3_uri_1');
I get the error:
ERROR: could not upload to Amazon S3
DETAIL: Amazon S3 client returned 'Unable to connect to endpoint'.
CONTEXT: SQL function "query_export_to_s3" statement 1
Other things I have tried:
I have tried using Access analyzer for S3 but my bucket does not seem to appear in the list. I believe as the bucket itself does not have a policy attached to it.
How can I debug this issue? What am I doing wrong? I am happy to share further details if needed.

What I see from the documentation you are following does not assume that you are running this whole setup inside a VPC.
So for connecting from within the VPC(as you have blocked all the public access) , you need to have an endpoint policies for Amazon S3 attached.
for example from documentation sample policy :
The following is an example of an S3 bucket policy that allows access to a specific bucket, my_secure_bucket, from endpoint vpce-1a2b3c4d only.
{
"Version": "2012-10-17",
"Id": "Policy1415115909152",
"Statement": [
{
"Sid": "Access-to-specific-VPCE-only",
"Principal": "*",
"Action": "s3:*",
"Effect": "Deny",
"Resource": ["arn:aws:s3:::my_secure_bucket",
"arn:aws:s3:::my_secure_bucket/*"],
"Condition": {
"StringNotEquals": {
"aws:sourceVpce": "vpce-1a2b3c4d"
}
}
}
]
}

get task id's from kafka connect API to print in logs

I have a kafka connect sink code for which below json is passed as curl command to register tasks.
Please let me know if anyone has any idea on how to get the task id's of my connect. For example in below example, we have defined max tasks is 3, so I need to know
the name of 3 tasks for logs i.e. I need to know which line of my log belongs to which task.
In below example, I know I have 3 tasks - TestCheck-1, TestCheck-2 and TestCheck-3 based on the kafka connect logs. I want to know how to get the task names so that I can print them in my kafka connect log lines.
{
"name": "TestCheck",
"config": {
"topics": "topic1",
"connector.class": "ApplicationSinkTask Class package",
"tasks.max": "3",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"connector.url": "jdbc connection url",
"driver.name": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"username": "myusername",
"password": "mypassword",
"table.name": "test_table",
"database.name": "test",
}
}
When I register, I will get below details.
curl -X POST -H "Content-Type: application/json" --data #myjson.json http://service:8082/connectors
{"name":"TestCheck","config":{"topics":"topic1","connector.class":"ApplicationSinkTask Class package","tasks.max":"3","key.converter":"org.apache.kafka.connect.storage.StringConverter","value.converter":"org.apache.kafka.connect.storage.StringConverter","connector.url":"jdbc:sqlserver://datahubprod.database.windows.net:1433;","driver.name":"jdbc connection url","username":"myuser","password":"mypassword","table.name":"test_table","database.name":"test","name":"TestCheck"},"tasks":[{"connector":"TestCheck","task":0},{"connector":"TestCheck","task":1},{"connector":"TestCheck","task":2}],"type":null}

You can manage the connectors with the Kafka Connect Rest API. There's a whole heap of commands which you can find here
The example given in the above link shows you can retrieve all task for a given connector using the command
$ curl localhost:8083/connectors/local-file-sink/tasks
[
{
"id": {
"connector": "local-file-sink",
"task": 0
},
"config": {
"task.class": "org.apache.kafka.connect.file.FileStreamSinkTask",
"topics": "connect-test",
"file": "test.sink.txt"
}
}
]
You can use a language of your choice to send the curl command and import the json response into a variable/dictionary for further use, such as printing to a log. Here's a very simple example using python which will assign the whole output to a variable.
import requests
import json
connectors = 'http://localhost:8083/connectors'
p = requests.get(connectors)
data = p.json()
If you parse the data variable to a a dictionary, you can the access each element, i.e the task id
I hope this helps!

Not able to retrieve RedShift cluster Capacity details like Storage, Memory using Python script

I have tried to fetch my RedShift cluster details. I'm able to see many details about the cluster but few details got missed.
For Ex:- Details like Storageand Memory
The below is the code:-
redshiftClient = boto3.client('redshift', aws_access_key_id = role.credentials.access_key,
aws_secret_access_key = role.credentials.secret_key, aws_session_token = role.credentials.session_token, region_name='us-west-2')
#Getting all the clusters
clusters = redshiftClient.describe_clusters()
can you please check provide the way to get it.
Thanks.

The describe-clusters command does not return that type of information. The output of that command is:
{
"Clusters": [
{
"NodeType": "dw.hs1.xlarge",
"Endpoint": {
"Port": 5439,
"Address": "mycluster.coqoarplqhsn.us-east-1.redshift.amazonaws.com"
},
"ClusterVersion": "1.0",
"PubliclyAccessible": "true",
"MasterUsername": "adminuser",
"ClusterParameterGroups": [
{
"ParameterApplyStatus": "in-sync",
"ParameterGroupName": "default.redshift-1.0"
} ],
"ClusterSecurityGroups": [
{
"Status": "active",
"ClusterSecurityGroupName": "default"
} ],
"AllowVersionUpgrade": true,
"VpcSecurityGroups": \[],
"AvailabilityZone": "us-east-1a",
"ClusterCreateTime": "2013-01-22T21:59:29.559Z",
"PreferredMaintenanceWindow": "sat:03:30-sat:04:00",
"AutomatedSnapshotRetentionPeriod": 1,
"ClusterStatus": "available",
"ClusterIdentifier": "mycluster",
"DBName": "dev",
"NumberOfNodes": 2,
"PendingModifiedValues": {}
} ],
"ResponseMetadata": {
"RequestId": "65b71cac-64df-11e2-8f5b-e90bd6c77476"
}
}
You will need to retrieve Memory and Storage statistics from Amazon CloudWatch.
See your other question: Amazon CloudWatch is not returning Redshift metrics
If you actually want to retrieve information about a standard cluster (that is, the amount of storage and memory assigned to each node, rather than current memory and storage usage), that is not available from an API call. Instead see: Amazon Redshift Clusters

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Transforming and Reading json files in Azure Synapse notebook - pyspark

Related

I am trying to use ApiextensionsV1beta1Api to create a custom resource definition though kubernetes python client

pod identity on aks cluster crreation

How can I connect to endpoint when attempting to export data from RDS to S3?

get task id's from kafka connect API to print in logs

Not able to retrieve RedShift cluster Capacity details like Storage, Memory using Python script

Categories

Resources