Column names return lowercase in aws glue - pyspark

I am new to AWS Glue, and have created job through crawler which point the source target CSV-file in S3-bucket.
The CSV-file contains following columns as:
userId jobTitleName firstName lastName preferredFullName employeeCode region
Now during the job execution, it throws following error
Key error: userid' not exist. As notified, the issue looks the case sensitive issue. so as per the glue-document, I created mapping for the schema
mappingsSchema=[('userid', 'integer', 'userId', 'integer'),
('jobtitlename', 'string', 'jobTitleName', 'string'),
('firstname', 'string', 'firstName', 'string'),
('lastname', 'string', 'lastName', 'string'),
('preferredfullName', 'string', 'preferredFullname', 'string'),
('employeecode', 'string', 'employeeCode', 'string'),
('region', 'string','region', 'string')]
mapped_dynamic_frame_read=dynamic_frame_read.apply_mapping(mappings = mappingsSchema, case_sensitive = True, transformation_ctx = "tfx")
##And converting to the spark df
df = mapped_dynamic_frame_read.toDF()
Still I'mm getting the same mentioned error.
How can resolve this kind of issue?

Hi #Emerson The issue was with mappings in which column names were wrongly specified against the schema definition. Now it's fixed and working fine.. Thanks

Related

AWS Glue CDK - create job type Spark (Glue 2.0)

I could not find any documentation on how to create a glue job with the type spark. The way examples and documentation suggests creates type python shell. Example:
glueETLJob = _glue.CfnJob(
self,
"glue_CDK_job",
command =_glue.CfnJob.JobCommandProperty(
name = "glue_CDK_job",
python_version= '3',
script_location = bucket + "/code/glue_CDK_job.py"
),
role= glueRole.role_arn,
max_retries= 0,
name= "glue_CDK_job",
timeout=30,
glue_version="1.0"
)
Creates a python shell with version 1, but I cannot set glue_version="2.0" because that only seems to exist in type Spark.
Does anyone know how to do create a glue job with type Spark and glue version 2.0 with CDK
Thanks
Turns out the name in theJobCommandProperty is not a id like others, but the type that I was looking for. So if anybody has the same issue it should look like:
glueETLJob = _glue.CfnJob(
self,
"glue_CDK_job",
command =_glue.CfnJob.JobCommandProperty(
name = "glueetl",
python_version= '3',
script_location = bucket + "/code/glue_CDK_job.py"
),
role= glueRole.role_arn,
max_retries= 0,
name= "glue_CDK_job",
timeout=30,
glue_version="2.0"
)
Got the answer from: https://github.com/aws/aws-cdk/issues/4480
I got the same issue, but I realized that the issue was caused by name in JobCommandProperty. Change glue_CDK_job to glueetl will be work.
You can check the documentation of AWS Cloud Formation.
https://github.com/awsdocs/aws-cloudformation-user-guide/blob/main/doc_source/aws-resource-glue-job.md
const processFifaDataJobName = 'process-data-fifa';
const PYTHON_VERSION = "3";
const GLUE_VERSION = "1.0";
const COMMAND_NAME = "glueetl";
const glueJobProcessFifaData = new glue.CfnJob(this, processFifaDataJobName, {
name: processFifaDataJobName,
role: role.roleArn,
command: {
name: COMMAND_NAME,
pythonVersion: PYTHON_VERSION,
scriptLocation: 's3://' + bucketName + '/Scripts/process-data.py'
},
glueVersion: GLUE_VERSION
});
For me above code worked. Although it is very similar what you are trying to do. But this created job with type spark.

How to load Postgres QgsVectorLayer

I have a QGIS script that I am trying to load a vector layer that is stored in a Postgres database. When I print the layer's isValid() method I get False. Here is my code:
from qgis.core import *
db_client = 'postgres'
db_host = 'localhost'
db_port = '5432'
db_name = 'database'
db_user = 'user'
db_password = 'pass123'
db_schema = 'public'
tablename = 'Geo_Layer'
geometrycol = 'geom'
tract_number_index = 3
QgsApplication.setPrefixPath('/usr/bin/qgis', True)
qgs = QgsApplication([], False)
qgs.initQgis()
geo_uri = QgsDataSourceUri()
geo_uri.setConnection(db_host, db_port, db_name, db_user, db_password)
geo_uri.setDataSource(db_schema, tablename, geometrycol, '', 'id')
geo_layer = QgsVectorLayer(geo_uri.uri(False), "Test", "postgres")
# Other configurations I have tried
# geo_layer = QgsVectorLayer(geo_uri.uri(), "Test", "postgres")
# geo_layer = QgsVectorLayer(geo_uri.uri(), "Test", "ogr")
# geo_layer = QgsVectorLayer(geo_uri.uri(False), "Test", "ogr")
print(geo_layer.isValid())
qgs.exitQgis()
I have provided the other QgsVectorLayer configurations I have tried. All print that the layer is not valid.
QGIS Version: 3.16.3-Hannover
Python Version: 3.8.5
Ubuntu Version: 20.04.02 LTS
I have check my credentials with DBeaver and I am able to connect.
I once faced this issue when my geometry column in postgis contains multiple geometry type. In this case you can first filter the column for geometry types, and then for each geometry type construct a layer for qgis:
for geom in geometry_types:
uri.setDataSource(schema, table, column, "GeometryType(%s)= '%s'" % (column, geom))
vlayer = QgsVectorLayer(uri.uri(), layer_name, "postgres")
print(vlayer.isValid())
You can check for the geometry types in postgis using following query:
SELECT DISTINCT(GeometryType("%s"::geometry)) FROM "%s";""" % (column, table)

How to insert 'NULL' values for 'int' column tupes in Aurora PostgreSQL db using Python boto3 client

I have a CSV file (MS SQL server table export) and I would like to import it to Aurora Serverless PostgreSQL database table. I did a basic preprocessing of the CSV file to replace all of the NULL values in it (i.e. '') to "NULL". The file looks like that:
CSV file:
ID,DRAW_WORKS
10000002,NULL
10000005,NULL
10000004,FLEXRIG3
10000003,FLEXRIG3
The PostgreSQL table has the following schema:
CREATE TABLE T_RIG_ACTIVITY_STATUS_DATE (
ID varchar(20) NOT NULL,
DRAW_WORKS_RATING int NULL
)
The code I am using to read and insert the CSV file is the following:
import boto3
import csv
rds_client = boto3.client('rds-data')
...
def batch_execute_statement(sql, sql_parameter_sets, transaction_id=None):
parameters = {
'secretArn': db_credentials_secrets_store_arn,
'database': database_name,
'resourceArn': db_cluster_arn,
'sql': sql,
'parameterSets': sql_parameter_sets
}
if transaction_id is not None:
parameters['transactionId'] = transaction_id
response = rds_client.batch_execute_statement(**parameters)
return response
transaction = rds_client.begin_transaction(
secretArn=db_credentials_secrets_store_arn,
resourceArn=db_cluster_arn,
database=database_name)
sql = 'INSERT INTO T_RIG_ACTIVITY_STATUS_DATE VALUES (:ID, :DRAW_WORKS);'
parameter_set = []
with open('test.csv', 'r') as file:
reader = csv.DictReader(file, delimiter=',')
for row in reader:
entry = [
{'name': 'ID','value': {'stringValue': row['RIG_ID']}},
{'name': 'DRAW_WORKS', 'value': {'longValue': row['DRAW_WORKS']}}
]
parameter_set.append(entry)
response = batch_execute_statement(
sql, parameter_set, transaction['transactionId'])
However, there is an error that gets returned suggests that there is a type mismatch:
Invalid type for parameter parameterSets[0][5].value.longValue,
value: NULL, type: <class 'str'>, valid types: <class 'int'>"
Is there a way to configure Aurora to accept NULL values for types such as int?
Reading the boto3 documentation more carefully I found that we can use isNull value set to True in case a field is NULL. The bellow code snippet shows how to insert null value to the database:
...
entry = [
{'name': 'ID','value': {'stringValue': row['ID']}}
]
if row['DRAW_WORKS'] == 'NULL':
entry.append({'name': 'DRAW_WORKS', 'value': {'isNull': True}})
else:
entry.append({'name': 'DRAW_WORKS_RATING', 'value': {'longValue': int(row['DRAW_WORKS'])}})
parameter_set.append(entry)

error in accessing tables from hive using apache drill

I am trying to read the data from my table abc which is in hive using Drill. For that i have created hive storage plugin with the configuration mentioned below
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://<ip>:<port>",
"fs.default.name": "hdfs://<ip>:<port>/",
"hive.metastore.sasl.enabled": "false",
"hive.server2.enable.doAs": "true",
"hive.metastore.execute.setugi": "true"
}
}
with this i am able to see the databases in hive, but when i try to access any table in the particular database
select * from hive.db.abc;
it throws the following error
org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
ERROR: From line 1, column 15 to line 1, column 18: Object 'abc' not
found within 'hive.db' SQL Query null [Error Id:
b6c56276-6255-4b5b-a600-746dbc2f3d67 on centos2.example.com:31010]
(org.apache.calcite.runtime.CalciteContextException) From line 1,
column 15 to line 1, column 18: Object 'abc' not found within
'hive.db' sun.reflect.NativeConstructorAccessorImpl.newInstance0():-2
sun.reflect.NativeConstructorAccessorImpl.newInstance():62
sun.reflect.DelegatingConstructorAccessorImpl.newInstance():45
java.lang.reflect.Constructor.newInstance():423
org.apache.calcite.runtime.Resources$ExInstWithCause.ex():463
org.apache.calcite.sql.SqlUtil.newContextException():800
org.apache.calcite.sql.SqlUtil.newContextException():788
org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError():4703
org.apache.calcite.sql.validate.IdentifierNamespace.resolveImpl():127
org.apache.calcite.sql.validate.IdentifierNamespace.validateImpl():177
org.apache.calcite.sql.validate.AbstractNamespace.validate():84
org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():947
org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():928
org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():2972
org.apache.drill.exec.planner.sql.SqlConverter$DrillValidator.validateFrom():267
org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():2957
org.apache.drill.exec.planner.sql.SqlConverter$DrillValidator.validateFrom():267
org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect():3216
org.apache.calcite.sql.validate.SelectNamespace.validateImpl():60
org.apache.calcite.sql.validate.AbstractNamespace.validate():84
org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():947
org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():928
org.apache.calcite.sql.SqlSelect.validate():226
org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression():903
org.apache.calcite.sql.validate.SqlValidatorImpl.validate():613
org.apache.drill.exec.planner.sql.SqlConverter.validate():190
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.validateNode():630
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.validateAndConvert():202
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan():174
org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan():146
org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan():84
org.apache.drill.exec.work.foreman.Foreman.runSQL():567
org.apache.drill.exec.work.foreman.Foreman.run():264
java.util.concurrent.ThreadPoolExecutor.runWorker():1149
java.util.concurrent.ThreadPoolExecutor$Worker.run():624
java.lang.Thread.run():748 Caused By
(org.apache.calcite.sql.validate.SqlValidatorException) Object 'abc'
not found within 'hive.db'
sun.reflect.NativeConstructorAccessorImpl.newInstance0():-2
sun.reflect.NativeConstructorAccessorImpl.newInstance():62
sun.reflect.DelegatingConstructorAccessorImpl.newInstance():45
java.lang.reflect.Constructor.newInstance():423
org.apache.calcite.runtime.Resources$ExInstWithCause.ex():463
org.apache.calcite.runtime.Resources$ExInst.ex():572
org.apache.calcite.sql.SqlUtil.newContextException():800
org.apache.calcite.sql.SqlUtil.newContextException():788
org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError():4703
org.apache.calcite.sql.validate.IdentifierNamespace.resolveImpl():127
org.apache.calcite.sql.validate.IdentifierNamespace.validateImpl():177
org.apache.calcite.sql.validate.AbstractNamespace.validate():84
org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():947
org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():928
org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():2972
org.apache.drill.exec.planner.sql.SqlConverter$DrillValidator.validateFrom():267
org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():2957
org.apache.drill.exec.planner.sql.SqlConverter$DrillValidator.validateFrom():267
org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect():3216
org.apache.calcite.sql.validate.SelectNamespace.validateImpl():60
org.apache.calcite.sql.validate.AbstractNamespace.validate():84
org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():947
org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():928
org.apache.calcite.sql.SqlSelect.validate():226
org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression():903
org.apache.calcite.sql.validate.SqlValidatorImpl.validate():613
org.apache.drill.exec.planner.sql.SqlConverter.validate():190
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.validateNode():630
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.validateAndConvert():202
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan():174
org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan():146
org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan():84
org.apache.drill.exec.work.foreman.Foreman.runSQL():567
org.apache.drill.exec.work.foreman.Foreman.run():264
java.util.concurrent.ThreadPoolExecutor.runWorker():1149
java.util.concurrent.ThreadPoolExecutor$Worker.run():624
java.lang.Thread.run():748
You should upgrade to the newer Hive version. For Drill 1.13 it is Hive 2.3.2 version? Starting from Drill-1.13 Drill leverages 2.3.2 version of Hive client [1].
Supporting of Hive 3.0 version is upcoming [2].
Also please follow the following guide with necessary Hive plugin configurations for your environment [3]. You could omit "hive.metastore.sasl.enabled", "hive.server2.enable.doAs" and "hive.metastore.execute.setugi" properties, since you have specified the default values [4]. Regarding "hive.metastore.uris" and "fs.default.name" you should specify the same values for them as in your hive-site.xml.
[1] https://drill.apache.org/docs/hive-storage-plugin
[2] https://issues.apache.org/jira/browse/DRILL-6604
[3] https://drill.apache.org/docs/hive-storage-plugin/#hive-remote-metastore-configuration
[4] https://github.com/apache/hive/blob/rel/release-2.3.2/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L824

How to run the getRole command using pymongo?

I want to check whether a role exists in a mongodb, before I create a new one . I tried to do it the following way:
result = self.client[database].command("getRole", name=app_name)
Unfortunately I get the following error:
msg = msg or "%s"
raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: no such command: 'getRole', bad cmd: '{ getRole: 1, name: "test" }'
I am referring to this database command: https://docs.mongodb.com/manual/reference/method/db.getRole/
For createRole I can execute the command: https://docs.mongodb.com/manual/reference/method/db.createRole/#db.createRole
Shell methods db.* are different from Database commands.
Using the roleInfo command you can get information for a particular role.
db.command({
'rolesInfo': {'role': 'noremove','db': 'test'},
'showPrivileges': True, 'showBuiltinRoles': True
})
The above command returns a result in this form when there is a matching role:
{'ok': 1.0,
'roles': [{'db': 'test',
'inheritedPrivileges': [{'actions': ['find', 'insert', 'update'],
'resource': {'collection': 'test', 'db': 'test'}}],
'inheritedRoles': [],
'isBuiltin': False,
'privileges': [{'actions': ['find', 'insert', 'update'],
'resource': {'collection': 'test', 'db': 'test'}}],
'role': 'noremove',
'roles': []}]}
When there is no matching role, you get this result:
{'ok': 1.0, 'roles': []}
Checking that a role exists falls to checking for the length of the "roles" list in the returned result as follow:
noremove_role = db.command({
'rolesInfo': {'role': 'noremove','db': 'test'},
'showPrivileges': True, 'showBuiltinRoles': True
})
if not len(noremove_role['roles']):
# create role
pass
Is there a better way?
Yes, in keeping with ask forgiveness not permission philosophy, create the role and handle the resulting exception from trying to add an existing role.
from pymongo.errors import DuplicateKeyError
import logging
logger = logging.getLogger()
try:
db.command(
'createRole', 'noremove',
privileges=[{
'actions': ['insert', 'update', 'find'],
'resource': {'db': 'test', 'collection': 'test'}
}],
roles=[])
except DuplicateKeyError:
logger.error('Role already exists.')
pass