Enabling Hive on Cygnus - fiware-cygnus

So far, my Orion subscriptions are appropriately creating hfds files on Cosmos instance. However, since (my project's requirements) I will be dealing with Wirecloud, to do so, seems that a good approach is to perform Hive queries to retrieve historical data.
Thus, How can I settle my Cygnus config files to automatically create tables and populate Hive?
Will using below configuration be enough?:
# Hive enabling
cygnusagent.sinks.hdfs-sink.hive = true
# Hive server version, 1 or 2 (ignored if hive is false)
cygnusagent.sinks.hdfs-sink.hive.server_version = 2
# Hive FQDN/IP address of the Hive server (ignored if hive is false)
cygnusagent.sinks.hdfs-sink.hive.host = x.y.z.w
# Hive port for Hive external table provisioning (ignored if hive is false)
cygnusagent.sinks.hdfs-sink.hive.port = 10000
Is this documentation up to date (i.e., http://fiware-cygnus.readthedocs.io/en/1.2.0/cygnus-ngsi/installation_and_administration_guide/ngsi_agent_conf/)?

I'm only missing:
cygnusagent.sinks.hdfs-sink.hive.db_type = default-db | namespace-db
The above command will allow you switching form the default Hive database, or a private database of your own.
Additionally, if using FIWARE Lab global instance of Cosmos, the hdfs_password must be equals to your OAuth2 token.

Related

parquet streaming of Azure Blob storage into databricks with unity catalog

Unity Catalog have recently been set up in my databricks account, and I am trying to stream from an Azure container containing parquet files to a service catalog, using a notebook that ran before.
I do however now get the following
py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.streaming.DataStreamReader.format(java.lang.String) is not whitelisted on class class org.apache.spark.sql.streaming.DataStreamReader
when trying to run the following spark command from my Notebook:
df = (spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option("cloudFiles.useNotifications", "false") # useNotifications determines if we efficiently scan the new files or if we set up a subscription to listen to new file events
.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns") # schemaEvolutionMode determines what happens when the schema changes
.option("cloudFiles.schemaLocation", schemaPath)
.load(dataPath)
)
where schemaPathand dataPath contain the paths to the parquet schema and data files.
The closest related error I have found is the following pre-Unity Catalog error, suggesting that I should disable table access control on my clusters:
https://kb.databricks.com/en_US/streaming/readstream-is-not-whitelisted
All table access control are disabled in my Admin Console.
Are there some other settings that should be set to ensure white-listing from Azure files now that Unity Catalog is set up?
------ Edit -----
Using a Single User cluster on Databricks runtime version 11.3 beta, I get the following error instead:
com.databricks.sql.cloudfiles.errors.CloudFilesIOException: Failed to write to the schema log at location
followed by the location to the azure schema in my storage location. I also get this error message by spawning new job clusters from azure datafactory.

PostgreSQL FDW on S3 object storage

I am having one requirement for SQL-like interface to query S3 object storage(It's not actually AWS S3, but S3 based protocol storage. We are having server name, bucket name, access_key and secret_access_key available with us).
I need to know whether there exists any PostgreSQL extension which can leverage foreign data wrapper(FDW) feature to run query against S3 data.
Here, we don't want to copy the data(csv files) from S3 to postgreSQL server. We have external PostgreSQL server, using any extension, we directly want to query that S3 data using that PostgreSQL engine and get the result on PostgreSQL server.
A very similar requirement link was found as below:
https://www.cdata.com/kb/tech/amazons3-jdbc-postgresql-fdw.rst
But it is using it's own cdata JDBC driver. We don't want anything proprietary.
Is there anything opensource available? If yes, how can we achieve that?
We are using 12.2 PostgreSQL version.

Azure polybase external data source creation error

I am trying to create an external table in synapse analytics, but I am facing error while creating the external data source.
Below is the code:
CREATE MASTER KEY ENCRYPTION BY PASSWORD='xxxxxxxxxxxx'; -- executed
CREATE DATABASE SCOPED CREDENTIAL storageCred WITH -- executed
IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'xxxxxxxxxxxxxx';
CREATE EXTERNAL DATA SOURCE adls WITH -- execution failed
( TYPE = HADOOP,
LOCATION = 'abfss://staging#devedw2021.dfs.core.windows.net',
CREDENTIAL = storageCred
)
The syntax looks right for the External Data Source, but the problem may be with the Database Scoped Credential. I spent a LOT of time on this, and the only way I could get this to work was with Account name and key:
CREATE DATABASE SCOPED CREDENTIAL CausewayAdlsCredentials
WITH
IDENTITY = '<storage_account_name>' ,
SECRET = '<storage_account_key>'
;
One word of warning: beware the documentation. There are several different locations that discuss this problem, and they have conflicting messaging or refer to old versions. This one is OK, but only section C worked for me.

External Table on DELTA format files in ADLS Gen 1

We have number of databricks DELTA tables created on ADLS Gen1. and also, there are external tables built on top each of those tables in one of the databricks workspace.
similarly, I am trying to create same sort of external tables on the same DELTA format files,but in different workspace.
I do have read only access via Service principle on ADLS Gen1. So I can read DELTA files through spark data-frames, as in given below:
read_data_df = spark.read.format("delta").load('dbfs:/mnt/data/<foldername>')
I can even able to create hive external tables, but I do see following warning while reading data from the same table:
Error in SQL statement: AnalysisException: Incompatible format detected.
A transaction log for Databricks Delta was found at `dbfs:/mnt/data/<foldername>/_delta_log`,
but you are trying to read from `dbfs:/mnt/data/<foldername>` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://learn.microsoft.com/azure/databricks/delta/index
;
If I create external table 'using DELTA', then I see a different access error as in:
Caused by: org.apache.hadoop.security.AccessControlException:
OPEN failed with error 0x83090aa2 (Forbidden. ACL verification failed.
Either the resource does not exist or the user is not authorized to perform the requested operation.).
failed with error 0x83090aa2 (Forbidden. ACL verification failed.
Either the resource does not exist or the user is not authorized to perform the requested operation.).
Does it mean that I would need full access, rather just READ ONLY?, on those underneath file system?
Thanks
Resolved after upgrading to Databricks Runtime environment to runtime version DBR-7.3.

EMR Spark Fails to Save Dataframe to S3

I am using the RunJobFlow command to spin up a Spark EMR cluster. This command sets the JobFlowRole to an IAM Role which has the policies AmazonElasticMapReduceforEC2Role and AmazonRedshiftReadOnlyAccess. The first policy contains an action to allow all s3 permissions.
When the EC2 instances spin up, they assume this IAM role, and generate temporary credentials via STS.
The first thing which I do is read a table from my Redshift cluster into a Spark Dataframe using the com.databricks.spark.redshift format and using the same IAM Role to unload the data from redshift as I did for the EMR JobFlowRole.
So far as I understand, this runs an UNLOAD command on Redshift to dump into the S3 bucket I specify. Spark then loads the newly unloaded data into a Dataframe. I use the recommended s3n:// protocol for the tempdir option.
This command works great, and it always successfully loads the data into the Dataframe.
I then run some transformations and attempt to save the dataframe in the csv format to the same S3 bucket Redshift Unloaded into.
However, when I try to do this, it throws the following error
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively)
Okay. So I don't know why this happens, but I tried to hack around it by setting the recommended hadoop configuration parameters. I then used DefaultAWSCredentialsProviderChain to load the AWSAccessKeyID and AWSSecretKey and set via
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <CREDENTIALS_ACCESS_KEY>)
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <CREDENTIALS_SECRET_ACCESS_KEY>)
When I run it again it throws the following error:
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
Okay. So that didn't work. I then removed setting the hadoop configurations and hardcoded an IAM user's credentials in the s3 url via s3n://ACCESS_KEY:SECRET_KEY#BUCKET/KEY
When I ran this it spit out the following error:
java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long
So it tried to create a bucket.. which is definitely not what we want it to do.
I am really stuck on this one and would really appreciate any help here! It works fine when I run it locally, but completely fails on EMR.
The problem was the following:
EC2 Instance Generated Temporary Credentials on EMR Bootstrap Phase
When I queried Redshift, I passed the aws_iam_role to theDatabricks driver. The driver then re-generated temporary credentials for that same IAM role. This invalidated the credentials the EC2 instance generated.
I then tried to upload to S3 using the old credentials (and the credentials which were stored in the instance's metadata)
It failed because it was trying to use out-of-date credentials.
The solution was to remove redshift authorization via aws_iam_role and replace it with the following:
val credentials = EC2MetadataUtils.getIAMSecurityCredentials
...
.option("temporary_aws_access_key_id", credentials.get(IAM_ROLE).accessKeyId)
.option("temporary_aws_secret_access_key", credentials.get(IAM_ROLE).secretAccessKey)
.option("temporary_aws_session_token", credentials.get(IAM_ROLE).token)
On amazon EMR, try usong the prefix s3:// to refer to an object in S3.
It's a long story.