I would like to load a dataframe from my Azure Data Lake Storage Gen2 and write it to an SQL
dedicated database that I created in Synapse.
This is what I did:
df = spark.read.format("delta").load(BronzePath)
df.write.format("com.databricks.spark.sqldw").option("url", jdbcUrl).save()
And I have the following error:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.sqldw.
Doing:
df.write.mode("overwrite").saveAsTable("MyTable")
creates the table in Spark default database (blue cross). That is not what I need. I want to have my table in the dedicated database (blue arrow):
Post more of your code including jdbc url, if it's different than this guide. I don't see the code to set the storage key in conf and you seem to be using different way to save as well.
# Otherwise, set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
# Get some data from an Azure Synapse table.
df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.load()
# Load data from an Azure Synapse query.
df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("query", "select x, count(*) as cnt from table group by x") \
.load()
# Apply some transformations to the data, then use the
# Data Source API to write the data back to another table in Azure Synapse.
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()
Also read
Supported save modes for batch writes and
Write semantics
in FAQ.
creates the table in Spark default database (blue cross). That is not what I need. I want to have my table in the dedicated database (blue arrow):
As described here "Spark will create a default local Hive metastore (using Derby) for you."
So when you don't give it a path/jdbcurl (df.write.mode("overwrite").saveAsTable("MyTable")), it'll save to local Hive.
Related
I want to create an automated job that can copy the entire database to a different one, both are in AWS RDS Postgres, how can I do that?
Thanks.
You can use Database create/restore snapshot.
Here is the example for command line:
aws rds create-db-snapshot \
--db-instance-identifier mydbinstance \
--db-snapshot-identifier mydbsnapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier mynewdbinstance \
--db-snapshot-identifier mydbsnapshot
The same APIs such as CreateDBSnapshot are available for multiple languages via AWS SDK.
I have had success in the past running a script that dumps data from one Postgres server and pipes it into another server. It was basically like this pseudo-code:
psql target-database -c "truncate foo"
pg_dump source-database --data-only --table=foo | psql target-database
The pg_dump command outputs normal SQL commands that can be piped into a receiving psql command, which then inserts the data.
To understand how this works, run pg_dump on one table and then take a look at the output. You'll need to tweak the command to get exactly what you want (eg using --no-owner to avoid sending access configurations).
I am trying to import some 5 records from a MySQL table "employees" and import it to hdfs by creating a hive table in Sqoop Import command. But the query is running for a long time and I am unable to see any result.
I have started Hadoop services and it is up and running. I can create tables in Hive manually also.
sqoop import --connect jdbc:mysql://localhost:3306/DBName
--username UserID --password Pwd --split-by EMPLOYEE_ID
-e "select EMPLOYEE_ID,FIRST_NAME from employees where EMPLOYEE_ID <=105 and $CONDITIONS"
--target-dir /user/hive/warehouse/mysqldb.db/employees
--fields-terminated-by ","
--hive-import
--create-hive-table
--hive-table mysqldb.employees -m2
What is going on here?
Sqoop Query ->
URL ->
I have a table MYSCHEMA.TEST_SNOWFLAKE_ROLE_T in Snowflake created using the role CONSOLE_USER.
MYSCHEMA has a FUTURE GRANTS associated with it, which grants the following privileges to the role BATCH_USER for any table created under the schema MYSCHEMA - DELETE, INSERT, REFERENCES, SELECT, TRUNCATE, UPDATE.
The role BATCH_USER also has CREATE STAGE and USAGE privileges on the schema MYSCHEMA.
A second user belonging to the role BATCH_USER tries to insert data into the same table from a dataframe, using the following Spark SQL (Databricks), but fails with an insufficient privileges error message.
df.write.mode(op_mode) \
.format("snowflake") \
.options(**self.sfoptions) \
.option("dbtable", snowflake_tbl_name) \
.option("truncate_table", "on") \
.save
The following error message appears:
Py4JJavaError: An error occurred while calling o908.save.
: net.snowflake.client.jdbc.SnowflakeSQLException: SQL access control error
: Insufficient privileges to operate on table 'TEST_SNOWFLAKE_ROLE_T')
The role CONSOLE_USER has ownership rights on the table, hence the role BATCH_USER would not be able to drop the table, but adding the option option("truncate_table", "on") should have prevented automatic overwrite of the Table schema.
I've gone through the available Snowflake and Databricks documentation several times, but can't seem to figure out what is causing the insufficient privilege issue.
Any help is much appreciated!
I figured it out eventually.
The error occured because the table was created by the role CONSOLE_USER, which retained ownership privileges on the table.
The Spark connector for Snowflake uses a staging table for writing the data. If the data loading operation is successful, the original target table is dropped and the staging table is renamed to the original target table’s name.
Now, in order to rename a table or swap two tables, the role used to perform the operation must have OWNERSHIP privileges on the table(s). In the situation above, the ownership was never transferred to the role BATCH_USER, hence the error.
df.write.mode(op_mode) \
.format("snowflake") \
.options(**self.sfoptions) \
.option("dbtable", snowflake_tbl_name) \
.option("truncate_table", "on") \
.option("usestagingtable", "off") \
.save
The solution was to avoid using a staging table altogether, although going by the documentation, Snowflake recommends using one, pretty strongly.
This is a good reference for troubleshooting custom privileges:
https://docs.snowflake.net/manuals/user-guide/security-access-control-overview.html#role-hierarchy-and-privilege-inheritance
Is the second batch_user inheriting any privileges?
Check on this by asking the user in their session to see what privileges they have on the table: https://docs.snowflake.net/manuals/sql-reference/sql/show-grants.html
What are the grants listed for the Batch_user having access issues to the following:
SHOW GRANTS ON
SHOW GRANTS OF ROLE
SHOW FUTURE GRANTS IN SCHEMA { }
Was a role specified for the second batch_user when they tried to write to "dbtable"?
"When a user attempts to execute an action on an object, Snowflake compares the privileges available in the user’s session against the privileges required on the object for that action. If the session has the required privileges on the object, the action is allowed." ref:
Try: https://docs.snowflake.net/manuals/sql-reference/sql/use-role.html
3.Since you mentioned Future Grants were used on the objects created - FUTURE be ing limited to SECURITYADMIN via https://community.snowflake.com/s/question/0D50Z00009MDCBv/can-a-role-have-rights-to-grant-future-rights
I have written aws glue job where i am trying to read snowflake tables as spark dataframe and also trying to write a spark dataframe into the snowflake tables. My job is failing stating "Insufficient privileges to operate on schema" in both scenario.
But when i am directly writing insert statement on snowflake cli, i am able to insert data. So basically i have insert privilege.
So why my job is failing when i am trying to insert data from dataframe or reading data from snowflake table as a dataframe?
Below is my code to write data into snowflake table.
sfOptions = {
"sfURL" : "xt30972.snowflakecomputing.com",
"sfAccount" : "*****",
"sfUser" : "*****",
"sfPassword" : "****",
"sfDatabase" : "*****",
"sfSchema" : "******"
}
df=spark.read.format("csv").option("header","false").option("delimiter",',').load(aws s3 file_name)
df2.write.format("net.snowflake.spark.snowflake") \
.options(**sfOptions) \
.option("dbtable", table_name) \
.mode("append") \
.save()
When you are using Snowflake CLI, I assume that you switch to a proper role to execute SELECT or INSERT. On Spark, you need to manually switch to the role that has SELECT/INSERT grants before operating on a table. You do this by issuing below.
Utils.runQuery(sfOptions, "USE ROLE <your_role>")
This will switch the role for the duration of your Spark session.
Also, please note that Snowflake's access structure is hierarchy based. That means you need to have "usage" privileges on the database and schema that houses the table you are trying to use. Please make sure that you have all the right grants to the role using to SELECT or INSERT.
I tried to import all tables using the sqoop into one of the directories.But one of the tables has no primary key.This is the code I executed.
sqoop import-all-tables --connect "jdbc:mysql://quickstart.cloudera/retail_db"
--username=retail_dba
--password=cloudera
--warehouse-dir /user/cloudera/sqoop_import/
I am getting the following error:
Error during import: No primary key could be found for table
departments_export. Please specify one with --split-by or perform a
sequential import with '-m 1'.
By seeing
sqoop import without primary key in RDBMS
I understood that we can just use --split-by for a single table import.Is there a way i can specify --splity-by for Import-all-tables command. Is there a way I can use more than one mapper for the multi-table import with no primary-key.
you need to use --autoreset-to-one-mapper:
Tables without primary key will be imported with one mapper and others with primary key with default mappers (4 - if not specified in sqoop command)
As #JaimeCr said you can't use --split-by with import-all-tables but this just a quote from sqoop guide in context of error you got:
If a table does not have a primary key defined and the --split-by> <col> is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 or --m 1 option or the --autoreset-to-one-mapper option is used.
The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.
sqoop import-all-tables --connect "jdbc:mysql://quickstart.cloudera/retail_db" \
--username=retail_dba \
--password=cloudera \
--autoreset-to-one-mapper \
--warehouse-dir /user/cloudera/sqoop_import/