Synapse spark select column with space - scala

I am trying to read synapse table, which has spaces in column names.
Read table is working until I am selecting columns without spaces or special characters:
%%spark
val df = spark.read.synapsesql("<Pool>.<schema>.<table>").select("TYPE", "Year").limit(100)
df.show()
OUTPUT:
+------+----+
| TYPE|Year|
+------+----+
|BOUGHT|LAST|
|BOUGHT|LAST|
|BOUGHT|LAST|
|BOUGHT|LAST|
When I start selecting columns with spaces I am getting errors. I have tried many variants:
.select(col("""`Country Code`"""))
.select(col("`Country Code`"))
.select(col("""[Country Code]"""))
.select(col("Country Code"))
.select($"`Country Code`")
.select("`Country Code`")
.select("""`Country Code`""")
will return this error:
ERROR: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name 'Country'.
If I ommit ` in select for example:
.select("[Country Code]")
ERROR: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name '[Country Code]'.
With back-tick spark in synapse just take only first word as column.
Any experience?

The select on its own will work, adding show (or any other action like count) will not. There does seem to be an issue with the Synapse synapsesql API. The Invalid column name 'country' error is coming from the SQL engine because it seems like there is no way to pass square brackets back to it. Also parquet files do not support spaces in column names so it's probably connected.
The workaround is to simply not use spaces in column names. Fix up the tables in a previous Synapse pipeline step if required. I'll have a look into it but may be no other answer.
If you want to rename existing columns in the database you can use sp_rename, eg
EXEC sp_rename 'dbo.countries.country Type', 'countryType', 'COLUMN';
This code has been tested on a Synapse dedicated SQL pool.
That particular API (sysnapsesql.read) cannot handle views unfortunately. You would have to materialise it eg using a CTAS in a prior Synapse Pipeline step. The API useful for simple patterns (get table -> process -> put back) but is pretty limited. You can't even manage table distribution (hash, round_robin, replicate) or indexing (clustered columnstore, clustered index, heap) or partitioning but you never know they might add to it one day. I'll be keeping on eye out during the next MS conference anyway.

I have created the function to run query using JDBC. Thanks this I was able to read from view. I have added saplme code how to get password from KeyVault, using TokenLibrary.
def spark_query(db, query):
jdbc_hostname = "<synapse_db>.sql.azuresynapse.net"
user = "<spark_db_client>"
password = "<strong_password>"
# password_from_kv = TokenLibrary.getSecret("<Linked_Key_Vault_Service_Name>", "<Key_Vault_Key_Name>", "<Key_Vault_Name>")
return spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://{jdbc_hostname }:1433;databaseName={db};user={user};password={password}") \
.option("query", query) \
.load()
Then I have created VIEW with column names without spaces:
CREATE VIEW v_my_table
AS
SELECT [Country code] as country_code from my_table
Granted access to <spark_db_client>:
GRANT SELECT ON v_my_table to <spark_db_client>
After whole preparation I was able to read table from VIEW and save to spark database:
query = """
SELECT country_code FROM dbo.v_my_table
"""
df = spark_query(db="<my_database>", query=query)
spark.sql("CREATE DATABASE IF NOT EXISTS spark_poc")
df.write.mode("overwrite").saveAsTable("spark_poc.my_table")
df.registerTempTable("my_table")
This are <placeholder_variables>

Related

How to create a temporary table in snowflake based on pyspark dataframe

I can read the snowflake table in pyspark dataframe using sqlContext
sql = f"""select * from table1""";
df = sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.option("query", sql)
.load()
How do I create a temporary table in snowflake (using pyspark code) and insert values from this pyspark dataframe (df)?
just save as usual, with snowflake format
snowflake_options = {
...
'sfDatabase': 'dbabc',
'dbtable': 'tablexyz',
...
}
(df
.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.save()
)
I don't believe this can be done. At least not the way you want.
You can, technically, create a temporary table; but persisting it is something that I have had a great deal of difficulty finding how to do (i.e. I haven't). If you run the following:
spark.sparkContext._jvm.net.snowflake.spark.snowflake.Utils.runQuery(snowflake_options, 'create temporary table tmp_table (id int, value text)')
you'll notice that it successfully returns a java object indicating the temp table was created successfully; but once you try and run any further statements on it, you'll get nasty errors that mean it no longer exists. Somehow we mere mortals would need to find a way to access and persist the Snowflake session through the jvm api. That being said, I also think that would run contrary to the Spark paradigm.
If you really need the special-case performance boost of running transformations on Snowflake instead of bringing it all into Spark, just keep everything in Snowflake to begin with by either
Using CTEs in the query, or
Using the runQuery api described above to create "temporary" permanent/transient tables and designing Snowflake queries that insert directly to those and then clean them up (DROP them) when you are done.

Appending spark dataframe to hive table with different columnn order

I'm using pyspark with HiveWarehouseConnector in HDP3 cluster.
There was a change in the schema so I updated my target table using the "alter table" command and added the new columns to the last positions of it by default.
Now I'm trying to use the following code to save spark dataframe to it but the columns in the dataframe have alphabetical order and i'm getting the error message below
df = spark.read.json(df_sub_path)
hive.setDatabase('myDB')
df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").mode('append').option('table','target_table').save()
and the error message taced to:
Caused by: java.lang.IllegalArgumentException: Hive column:
column_x cannot be found at same index: 77 in
dataframe. Found column_y. Aborting as this may lead to
loading of incorrect data.
Is there any dynamic way of appending the dataframe to correct location in the hive table? It is important as I expect more columns to be added to the target table.
You can read the target column without rows to get the columns. Then, using select, you can order the column correctly and append it:
target = hive.executeQuery('select * from target_Table where 1=0')
test = spark.createDataFrame(source.collect())
test = test.select(target.columns)

Is there a way to describe an external/spectrum table via redshift?

In AWS Athena you can write
SHOW CREATE TABLE my_table_name;
and see a SQL-like query that describes how to build the table's schema. It works for tables whose schema are defined in AWS Glue. This is very useful for creating tables in a regular RDBMS, for loading and exploring data views.
Interacting with Athena in this way is manual, and I would like to automate the process of creating regular RDBMS tables that have the same schema as those in Redshift Spectrum.
How can I do this through a query that can be run via psql? Or is there another way to get this via the aws-cli?
Redshift Spectrum does not support SHOW CREATE TABLE syntax, but there are system tables that can deliver same information. I have to say, it's not as useful as the ready to use sql returned by Athena though.
The tables are
svv_external_schemas - gives you information about glue database mapping and IAM roles bound to it
svv_external_tables - gives you the location information, and also data format and serdes used
svv_external_columns - gives you the column names, types and order information.
Using that data, you could reconstruct the table's DDL.
For example to get the list of columns and their types in the CREATE TABLE format one can do:
select distinct
listagg(columnname || ' ' || external_type, ',\n')
within group ( order by columnnum ) over ()
from svv_external_columns
where tablename = '<YOUR_TABLE_NAME>'
and schemaname = '<YOUR_SCHEM_NAME>'
the query give you the output similar to:
col1 int,
col2 string,
...
*) I am using listagg window function and not the aggregate function, as apparently listagg aggregate function can only be used with user defined tables. Bummer.
I had been doing something similar to #botchniaque's answer in the past, but recently stumbled across a solution in the AWS-Labs' amazon-redshift-utils code package that seems to be more reliable than my hand-spun queries:
amazon-redshift-utils: v_generate_external_tbl_ddl
If you don't have the ability to create a view backed with the ddl listed in that package, you can run it manually by removing the CREATE statement from the start of the query. Assuming you can create it as a view, usage would be:
SELECT ddl
FROM admin.v_generate_external_tbl_ddl
WHERE schemaname = '<external_schema_name>'
-- Optionally include specific table references:
-- AND tablename IN ('<table_name_1>', '<table_name_2>', ..., '<table_name_n>')
ORDER BY tablename, seq
;
They added show external table now.
SHOW EXTERNAL TABLE external_schema.table_name [ PARTITION ]
SHOW EXTERNAL TABLE my_schema.my_table;
https://docs.aws.amazon.com/redshift/latest/dg/r_SHOW_EXTERNAL_TABLE.html

Trim/whitespace issue when load data from Db2 source to Postgresql DB using Talend Open source

We are seeing issue in table value which are populated from DB2 (source) to Postgres (Target).
I have including here all the job details for each component.
Based on the above approach and once the data has been populated, when we run the below query in the Postgres DB.
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where cust_gssn_cd='XY03666699' ;
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where cust_cntry_cd='847' ;
There will be no records were returned however, when we run the same query with Trim as below it works.
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where trim(cust_gssn_cd)='XY03666699' ;
SELECT * FROM VMRCTTA1.VMRRCUST_SUMM where trim(cust_cntry_cd)='847' ;
Below are the ways we have tried to overcome this but no luck.
Used tmap between source and target component.
Used trim in source component under Advanced setting.
Change the datatype in Postgres DB of cust_cntry_cd from char(5) to Character varying, this will allow value without any length restriction.
Please suggest what is missing as we have this issue in almost all the table where we have character/varchar columns.
We are using TOS.
The data type is probably character(5) in DB2.
That means that the trailing spaces are part of the column and will be migrated. You have to compare with
cust_cntry_cd = '847 '
or cast the right argument to character(5):
cust_cntry_cd = CAST ('847' AS character(5))
Maybe you could delete all spaces in the advanced settings of the tDB2Input component.
Like the screen :

Postgresql, query results to new table

Windows/NET/ODBC
I would like to get query results to new table on some handy way which I can see through data adapter but I can't find a way to do it.
There is no much examples around to satisfy beginner's level on this.
Don't know temporary or not but after seeing results that table is no more needed so I can delete it 'by hand' or it can be deleted automatically.
This is what I try:
mCmd = New OdbcCommand("CREATE TEMP TABLE temp1 ON COMMIT DROP AS " & _
"SELECT dtbl_id, name, mystr, myint, myouble FROM " & myTable & " " & _
"WHERE myFlag='1' ORDER BY dtbl_id", mCon)
n = mCmd.ExecuteNonQuery
This run's without error and in 'n' I get correct number of matched rows!!
But with pgAdmin I don't see those table no where?? No matter if I look under opened transaction or after transaction is closed.
Second, should I define columns for temp1 table first or they can be made automatically based on query results (that would be nice!).
Please minimal example to illustrate me what to do based on upper code to get new table filled with query results.
A shorter way to do the same thing your current code does is with CREATE TEMPORARY TABLE AS SELECT ... . See the entry for CREATE TABLE AS in the manual.
Temporary tables are not visible outside the session ("connection") that created them, they're intended as a temporary location for data that the session will use in later queries. If you want a created table to be accessible from other sessions, don't use a TEMPORARY table.
Maybe you want UNLOGGED (9.2 or newer) for data that's generated and doesn't need to be durable, but must be visible to other sessions?
See related: Is there a way to access temporary tables of other sessions in PostgreSQL?