exclusions doesn't work in AWS Glue ELT job s3 connection - pyspark

According to AWS Glue documentation, we can use exlusions to exclude files when the connection type is s3:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html
"exclusions": (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, "[\"**.pdf\"]" excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see Include and Exclude Patterns.
My s3 bucket likes following and I want to exclude test1 folder.
/mykkkkkk-test
test1/
testfolder/
11.json
22.json
test2/
1.json
test3/
2.json
test4/
3.json
test5/
4.json
I use following code to exclude test1 folder, but it will still ETL files under my test1 folder and it doesn't work
datasource0 = glueContext.create_dynamic_frame_from_options("s3",
{'paths': ["s3://mykkkkkk-test/"],
'exclusions': "[\"test1/**\"]",
'recurse':True,
'groupFiles': 'inPartition',
'groupSize': '1048576'},
format="json",
transformation_ctx = "datasource0")
Does the exclusions really work in ETL pyspark script? I also tried following but none works
'exclusions': "[\"test1/**\"]",
'exclusions': ["test1/**"],
'exclusions': "[\"test1\"]",

Try using the full path for exclusion.
datasource0 = glueContext.create_dynamic_frame.from_options(
's3',
{
"paths": [
's3://bucket/sample_data/'
],
"recurse" : True,
"exclusions" : "[\"s3://bucket/sample_data/temp/**\"]"
},
"json",
transformation_ctx = "datasource0")

Related

Transforming and Reading json files in Azure Synapse notebook

I have below Json in one of my storage account and I am able to read it by following the below code. I need help in reading the columns where "pod" has value "kube-apiserver-78" or "kube-apiserver-79" and username has "system:serviceaccount:xyz" or "system:serviceaccount:poq" : can someone help me how can I translate it below code.
df = spark.read.json('abfss://insights-logs-kube-audit#azogs.dfs.core.windows.net/resourceId=/SUBSCRIPTIONS/5IS/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=08/d=09/h=11/m=00/')
df.show()
Sample Json file in Storage container Which I read:
{ "operationName": "Microsoft.ContainerService/managedClusters/diagnosticLogs/Read", "category": "kube-audit", "ccpNamespace": "5f", "resourceId": "/SUBSCRIPTIONS/SID/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV", "properties": {"log":"{\"kind\":\"Event\",\"apiVersion\":\"audit.k8s.io/v1\",\"level\":\"Metadata\",\"auditID\":\"b7b1ca3\",\"stage\":\"ResponseComplete\",\"requestURI\":\"/apis/chaos-mesh.org/v1alpha1/namespaces/ve/httpchaos?limit=500\",\"verb\":\"list\",\"user\":{\"username\":\"system:serviceaccount:xyz\",\"uid\":\"3eb35e\",\"groups\":[\"system:serviceaccounts\",\"system:serviceaccounts:internal-services\",\"system:authenticated\"]},\"sourceIPs\":[\"100.100.100.100\"],\"userAgent\":\"ktl/v1.18.10 (linux/amd64) kubernetes/62c\",\"objectRef\":{\"resource\":\"httpchaos\",\"namespace\":\"vo\",\"apiGroup\":\"chaos-mesh.org\",\"apiVersion\":\"v1alpha1\"},\"responseStatus\":{\"metadata\":{},\"code\":200},\"requestReceivedTimestamp\":\"2022-05-23T13:45:13.140759Z\",\"stageTimestamp\":\"2022-05-23T13:45:13.146101Z\",\"annotations\":{\"authentication.k8s.io/legacy-token\":\"system:serviceaccount:ixyzr\",\"authorization.k8s.io/decision\":\"allow\",\"authorization.k8s.io/reason\":\"RBAC: allowed by ClusterRoleBinding \\\"admin\\\" of ClusterRole \\\"cluster-admin\\\" to ServiceAccount \\\"abc/xyz\\\"\"}}\n","stream":"stdout","pod":"kube-apiserver-78"}, "time": "2022-05-23T13:45:13.0000000Z", "Cloud": "AzureCloud", "Environment": "prod", "UnderlayClass": "hcp-underlay", "UnderlayName": "h-24"}
{ "operationName": "Microsoft.ContainerService/managedClusters/diagnosticLogs/Read", "category": "kube-audit", "ccpNamespace": "5f", "resourceId": "/SUBSCRIPTIONS/SID/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV", "properties": {"log":"{\"kind\":\"Event\",\"apiVersion\":\"audit.k8s.io/v1\",\"level\":\"Metadata\",\"auditID\":\"b7b1cax3\",\"stage\":\"ResponseComplete\",\"requestURI\":\"/apis/chaos-mesh.org/v1alpha1/namespaces/ve/httpchaos?limit=500\",\"verb\":\"list\",\"user\":{\"username\":\"system:serviceaccount:xyz\",\"uid\":\"3eb35e\",\"groups\":[\"system:serviceaccounts\",\"system:serviceaccounts:internal-services\",\"system:authenticated\"]},\"sourceIPs\":[\"100.100.100.100\"],\"userAgent\":\"ktl/v1.18.10 (linux/amd64) kubernetes/62c\",\"objectRef\":{\"resource\":\"httpchaos\",\"namespace\":\"vo\",\"apiGroup\":\"chaos-mesh.org\",\"apiVersion\":\"v1alpha1\"},\"responseStatus\":{\"metadata\":{},\"code\":200},\"requestReceivedTimestamp\":\"2022-05-23T13:45:13.140759Z\",\"stageTimestamp\":\"2022-05-23T13:45:13.146101Z\",\"annotations\":{\"authentication.k8s.io/legacy-token\":\"system:serviceaccount:ixyzr\",\"authorization.k8s.io/decision\":\"allow\",\"authorization.k8s.io/reason\":\"RBAC: allowed by ClusterRoleBinding \\\"admin\\\" of ClusterRole \\\"cluster-admin\\\" to ServiceAccount \\\"abc/xyz\\\"\"}}\n","stream":"stdout","pod":"kube-apiserver-78"}, "time": "2022-05-23T13:45:13.0000000Z", "Cloud": "AzureCloud", "Environment": "prod", "UnderlayClass": "hcp-underlay", "UnderlayName": "h-24"}
To query Json file After reading it convert it into temporal tables in Apache Spark and query them using Spark SQL.
To convert it into temporal table, use command:
df.createOrReplaceTempView("Name for temporal table")
Then query on this temporal table using Spark SQL.
SELECT * FROM "Name for temporal table"
WHERE (pod = 'kube-apiserver-78' or pod = 'kube-apiserver-79')
and (username = 'system:serviceaccount:xyz' or username = 'system:serviceaccount:poq')
Reference: Query JSON Files with Azure Synapse Analytics Notebooks

Airflow operator to copy many files (directory, prefix) from Google Cloud Storage bucket to local filesystem

There is an Airflow operator GCSToLocalFilesystemOperator to copy ONE file from GCS bucket to the local filesystem. But it supports only one file and it is not possible to copy many files for a given prefix.
There is a reverse operator LocalFilesystemToGCSOperator that allows to copy many files from local filesystem to the bucket, you do it simply with the star in the path "/*".
Do you know what is the best way to copy files by the prefix from a bucket to the local filesystem in Airflow? Am I missing something or it is not just implemented for some reason?
The solution I came up so far is compressing the files before putting it to the bucket, download as one file with airflow and unzip with BashOperator locally. I'm wondering if there is a better way.
I was able to successfully copy multiple files from GCS bucket to local filesystem(mapped) for a given prefix in Airflow using the below approach.
import datetime
from airflow import models
from airflow.operators import bash
from airflow.providers.google.cloud.hooks.gcs import GCSHook
from airflow.operators import python
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
BUCKET_NAME = 'qpalzm-bucket'
GCS_FILES = ['luffy.jpg', 'zoro.jpg']
LOCAL_PATH = '/home/airflow/gcs/data'
PREFIX = 'testfolder'
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
#
with models.DAG(
'multi_copy_gcs_to_local',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
def multi_copy(**kwargs):
hook = GCSHook()
for gcs_file in GCS_FILES:
#initialize file name and the local directory where it will be copied
filename = f'{LOCAL_PATH}/{gcs_file}'
#check if PREFIX is available and initialize the gcs file to be copied
if PREFIX:
object_name = f'{PREFIX}/{gcs_file}'
else:
object_name = f'{gcs_file}'
#perform gcs hook download
hook.download(
bucket_name = BUCKET_NAME,
object_name = object_name,
filename = filename
)
#execute multi_copy method
multi_copy_op = python.PythonOperator(
task_id='multi_gcs_to_local',
provide_context=True,
python_callable=multi_copy,
)
multi_copy_op
Output:

typeorm:migration create on New Project Does Not Recognize Entities - "No changes in database schema were found - cannot generate a migration."

I'm having trouble creating the initial migration for a nestjs-typeorm-mongo project.
I have cloned this sample project from nestjs that uses typeorm with mongodb. The project does work in that when I run it locally after putting a "Photo" document into my local mongo with db named "test" and collection "photos" then I can call to localhost:3000/photo and receive the photo documents.
Now I'm trying to create migrations with the typeorm cli using this command:
./node_modules/.bin/ts-node ./node_modules/typeorm/cli.js migration:generate -n initial
...but it's not working. I am not able to create an initial commit- even after setting "synchronize: false" in my app.module.ts file I always get the error:
No changes in database schema were found - cannot generate a migration. To create a new empty migration use "typeorm migration:create" command
when trying to generate a migration... 🤔
Other than changing synchronize to false, the only other change I made was adding an ormconfig.json file in the project root by running typeorm init --database mongodb:
{
"type": "mongodb",
"database": "test",
"synchronize": true,
"logging": false,
"entities": [
"src/**/*.entity.ts"
],
"migrations": [
"src/migration/**/*.ts"
],
"subscribers": [
"src/subscriber/**/*.ts"
],
"cli": {
"entitiesDir": "src",
"migrationsDir": "src/migration",
"subscribersDir": "src/subscriber"
}
}
Once you are using MongoDB, you don't have tables and have no need to create your collections ahead of time. Essentially, MongoDB schemas are created on the fly!
Under the hood, if the driver is MongoDB, the command typeorm migration:create is bypassed so it is useless in this case. You can check yourself the PR #3304 and Issue #2867.
However, there is an alternative called migrate-mongo which provides a way to archive incremental, reversible, and version-controlled way to apply schema and data changes. It’s well documented and actively developed.
migrate-mongo example
Run npm install -g migrate-mongo to install it.
Run migrate-mongo init to init the migrations tool. This will create a migrate-mongo-config.js configuration file and a migrations folder at the root of our project:
|_ src/
|_ migrations/
|- 20200606204524-migration-1.js
|- 20200608124524-migration-2.js
|- 20200808114324-migration-3.js
|- migrate-mongo.js
|- package.json
|- package-lock.json
Your migrate-mongo-config.js configuration file may look like:
// In this file you can configure migrate-mongo
const env = require('./server/config')
const config = {
mongodb: {
// TODO Change (or review) the url to your MongoDB:
url: env.mongo.url || "mongodb://localhost:27017",
// TODO Change this to your database name:
databaseName: env.mongo.dbname || "YOURDATABASENAME",
options: {
useNewUrlParser: true, // removes a deprecation warning when connecting
useUnifiedTopology: true, // removes a deprecating warning when connecting
// connectTimeoutMS: 3600000, // increase connection timeout up to 1 hour
// socketTimeoutMS: 3600000, // increase socket timeout up to 1 hour
}
},
// The migrations dir can be a relative or absolute path. Only edit this when really necessary.
migrationsDir: "migrations",
// The MongoDB collection where the applied changes are stored. Only edit this when really necessary.
changelogCollectionName: "changelog"
};
module.exports = config;
Run migrate-mongo create name-of-my-script to add a new migration script. A new file will be created with a corresponding timestamp.
/*
|_ migrations/
|- 20210108114324-name-of-my-script.js
*/
module.exports = {
function up(db) {
return db.collection('products').updateMany({}, { $set: { quantity: 10 } })
}
function down(db) {
return db.collection('products').updateMany({}, { $unset: { quantity: null } })
}
}
The database changelog: In order to know the current database version and which migration should apply next, there is a special collection that stores the database changelog with information such as migrations applied, and when where they applied.
To run your migrations, simply run the command: migrate-mongo up
You can find a full example in this article MongoDB Schema Migrations in Node.js

How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS.
The aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs
The template scala code from AWS:
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// #params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// #type: DataSource
// #args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
// #return: datasource0
// #inputs: []
val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
// #type: ApplyMapping
// #args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
// #return: applymapping1
// #inputs: [frame = datasource0]
val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
// #type: DataSink
// #args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
// #return: datasink2
// #inputs: [frame = applymapping1]
val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
Job.commit()
}
}
And the build.sbt I have started putting together for a local build:
name := "aws-glue-scala"
version := "0.1"
scalaVersion := "2.11.12"
updateOptions := updateOptions.value.withCachedResolution(true)
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"
The documentation for AWS Glue Scala API seems to outline similar functionality as is available in the AWS Glue Python library. So perhaps all that is required is to download and build the PySpark AWS Glue library and add it on the classpath? Perhaps possible since the Glue python library uses Py4J.
#Frederic gave a very helpful hint to get the dependency from s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar.
Unfortunately that version of glue-assembly.jar is already outdated and brings spark in versoin 2.1.
It's fine if you're using backward compatible features, but if you rely on latest spark version (and possibly latest glue features) you can get the appropriate jar from a Glue dev-endpoint under /usr/share/aws/glue/etl/jars/glue-assembly.jar.
Provided you have a dev-endpoint named my-dev-endpoint you can copy the current jar from it:
export DEV_ENDPOINT_HOST=`aws glue get-dev-endpoint --endpoint-name my-dev-endpoint --query 'DevEndpoint.PublicAddress' --output text`
scp -i dev-endpoint-private-key \
glue#$DEV_ENDPOINT_HOST:/usr/share/aws/glue/etl/jars/glue-assembly.jar .
Unfortunately, there are no libraries available for Scala glue API. Already contacted amazon support and they are aware about this problem. However, they didn't provide any ETA for delivering API jar.
As a workaround you can download the jar from S3. The S3 URI is s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar
See https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html
now it supports, a recent release from AWS.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

MongoDB to BigQuery

What is the best way to export data from MongoDB hosted in mlab to google bigquery?
Initially, I am trying to do one time load from MongoDB to BigQuery and later on I am thinking of using Pub/Sub for real time data flow to bigquery.
I need help with first one time load from mongodb to bigquery.
In my opinion, the best practice is building your own extractor. That can be done with the language of your choice and you can extract to CSV or JSON.
But if you looking to a fast way and if your data is not huge and can fit within one server, then I recommend using mongoexport. Let's assume you have a simple document structure such as below:
{
"_id" : "tdfMXH0En5of2rZXSQ2wpzVhZ",
"statuses" : [
{
"status" : "dc9e5511-466c-4146-888a-574918cc2534",
"score" : 53.24388894
}
],
"stored_at" : ISODate("2017-04-12T07:04:23.545Z")
}
Then you need to define your BigQuery Schema (mongodb_schema.json) such as:
$ cat > mongodb_schema.json <<EOF
[
{ "name":"_id", "type": "STRING" },
{ "name":"stored_at", "type": "record", "fields": [
{ "name":"date", "type": "STRING" }
]},
{ "name":"statuses", "type": "record", "mode": "repeated", "fields": [
{ "name":"status", "type": "STRING" },
{ "name":"score", "type": "FLOAT" }
]}
]
EOF
Now, the fun part starts :-) Extracting data as JSON from your MongoDB. Let's assume you have a cluster with replica set name statuses, your db is sample, and your collection is status.
mongoexport \
--host statuses/db-01:27017,db-02:27017,db-03:27017 \
-vv \
--db "sample" \
--collection "status" \
--type "json" \
--limit 100000 \
--out ~/sample.json
As you can see above, I limit the output to 100k records because I recommend you run sample and load to BigQuery before doing it for all your data. After running above command you should have your sample data in sample.json BUT there is a field $date which will cause you an error loading to BigQuery. To fix that we can use sed to replace them to simple field name:
# Fix Date field to make it compatible with BQ
sed -i 's/"\$date"/"date"/g' sample.json
Now you can compress, upload to Google Cloud Storage (GCS) and then load to BigQuery using following commands:
# Compress for faster load
gzip sample.json
# Move to GCloud
gsutil mv ./sample.json.gz gs://your-bucket/sample/sample.json.gz
# Load to BQ
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/*.json.gz" \
"mongodb_schema.json"
If everything was okay, then go back and remove --limit 100000 from mongoexport command and re-run above commands again to load everything instead of 100k sample.
ALTERNATIVE SOLUTION:
If you want more flexibility and performance is not your concern, then you can use mongo CLI tool as well. This way you can write your extract logic in a JavaScript and execute it against your data and then send output to BigQuery. Here is what I did for the same process but used JavaScript to output in CSV so I can load it much easier to BigQuery:
# Export Logic in JavaScript
cat > export-csv.js <<EOF
var size = 100000;
var maxCount = 1;
for (x = 0; x < maxCount; x = x + 1) {
var recToSkip = x * size;
db.entities.find().skip(recToSkip).limit(size).forEach(function(record) {
var row = record._id + "," + record.stored_at.toISOString();;
record.statuses.forEach(function (l) {
print(row + "," + l.status + "," + l.score)
});
});
}
EOF
# Execute on Mongo CLI
_MONGO_HOSTS="db-01:27017,db-02:27017,db-03:27017/sample?replicaSet=statuses"
mongo --quiet \
"${_MONGO_HOSTS}" \
export-csv.js \
| split -l 500000 --filter='gzip > $FILE.csv.gz' - sample_
# Load all Splitted Files to Google Cloud Storage
gsutil -m mv ./sample_* gs://your-bucket/sample/
# Load files to BigQuery
bq load \
--source_format=CSV \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/sample_*.csv.gz" \
"ID,StoredDate:DATETIME,Status,Score:FLOAT"
TIP: In above script I did small trick by piping output to able to split the output in multiple files with sample_ prefix. Also during split it will GZip the output so you can load it easier to GCS.
From a basic reading of MongoDB's documentation, it sounds like you can use mongoexport to dump your database as JSON. Once you've done that, refer to the BigQuery loading data topic for a description of how to create a table from JSON files after copying them to GCS.
You can read data from MongoDB and stream it to BigQuery. You can find an example in NodeJS here.
This is an extension of the linked example that prevents duplicated records (as long as they are still streaming buffer):
const { BigQuery } = require('#google-cloud/bigquery');
const bigqueryClient = new BigQuery();
...
const jsonData = // Array of documents from MongoDB
const inputRows = jsonData.map(row => ({
insertId: row._id,
json: row
}));
const insertOptions = {
raw: true
};
await bigqueryClient
.dataset(datasetId)
.table(tableId)
.insert(inputRows, insertOptions);