Spark Dataframe Write to Postgres Column Name in Double Quotes - postgresql

I am trying to write a Dataframe into Postgres, where the column names in the Dataframe are codes in UPPERCASE. But the table in Postgres has column names in lowercase. When the
dataframe.write.jdbc -> java.sql.BatchUpdateException: Batch entry 0 INSERT INTO xxxxxxxx ("USE_CASE_ID","CUSTOMER_CODE","HOLDOUT","REFERENCE_ID","TAG_FIELDS","COMMS_RUN_ID","PRIMARY_OFFER_ID") VALUES
ERROR: column "USE_CASE_ID" of relation "xxxxxxxx" does not exist
How can I make this work for any database in the future. I am not sure why Spark SQL puts "" for column names?
Another problem is, for unit testing I used H2 database, which expects the column names to be in UPPERCASE. So i ll have to satisfy for multiple databases.

Related

Ora2PG REPLACE_AS_BOOLEAN property and exclude one column by replacing

I'm using ora2pg to import an oracle db schema into a postgresql db schema. I configured all in the correct way and I'm able to dump the oracle dn into the postgresql db.
In the schema that I'm converting I have some columns number(1,0) that I need to convert as boolean in the pg schema.
At first I used this configuration
REPLACE_AS_BOOLEAN NUMBER:1
so every column with this type will be conmverted as boolean in the pg db.
The problem is that I have a column in the oracle schema defined as number(1,0). This column has to remain numeric and maintain the same type on the pg schema, so it hasn't to be converted as boolean.
This means that i changed the property in this manner
REPLACE_AS_BOOLEAN TABLE1:COLUMN1 TABLE2:COLUMN2 TABLE3:COLUMN3
I have a lot of columns that they have to be converted as boolean and the definition of this property became very long.
Is there a method to define the REPLACE_AS_BOOLEAN property to replace all the column with type number(1,0), but with some exception for one or some of them?
I had to wrote the property with the list of all the tables name and columns name

Why is Pyspark/Snowflake insensitive to underscores in column names?

Working in Python 3.7 and pyspark 3.2.0, we're writing out a PySpark dataframe to a Snowflake table using the following statement, where mode usually equals 'append'.
df.write \
.format('snowflake') \
.option('dbtable', table_name) \
.options(**sf_configs) \
.mode(mode)\
.save()
We've made the surprising discovery that this write can be insensitive to underscores in column names -- specifically, a dataframe with the column "RUN_ID" is successfully written out to a table with the column "RUNID" in Snowflake, with the column mapping accordingly. We're curious why this is so (I'm in particular wondering if the pathway runs through a LIKE statement somewhere, or if there's something interesting in the Snowflake table definition) and looking for documentation of this behavior (assuming that it's a feature, not a bug.)
According to the docs, the snowflake connector defaults to using column order instead of name, see parameter column_mapping.
The connector must map columns from the Spark data frame to the Snowflake table. This can be done based on column names (regardless of
order), or based on column order (i.e. the first column in the data
frame is mapped to the first column in the table, regardless of column
name).
By default, the mapping is done based on order. You can override that by setting this parameter to name, which tells the connector to
map columns based on column names. (The name mapping is
case-insensitive.)

Inserting value with multiple lines are different in pyspark and simple sql query via jdbc (hive)

If you run this sql query (using jdbc, hive server):
--create table test.testing (a string) stored as ORC;
insert into test.testing values ("line1\nline2");
I want to get 1 record but I'll get 2 reconds in tables.
If you run this sql query but using pyspark:
spark.sql("""insert into test.testing values ('pysparkline1\npysparkline2')""")
I'll get 1 record in table
How I can add multiple row data in table column using "insert ... values (...)" statement to manage this problem via jdbc?
P.S. Query type "INSERT... from SELECT" is not suitable and i can not change line delimeter in create query

pandas read_sql convers column names to lower case - is there a workaroud?

related: pandas read_sql drops dot in column names
I use pandas.read_sql to create a data frame from an sql query from a postgres database.
some column aliases\names use mixed case, and I want it to propagate to the data frame.
however, pandas (or the underlining engine - SQLAlchemy as much as I know) return only lower case field names.
is there a workaround?
(besides using a lookup table and fix the values afterwards)
Postgres normalizes unquoted column names to lower case. If you have such a table:
create table foo ("Id" integer, "PointInTime" timestamp);
PostgreSQL will obey the case, but you will have to specify table names quoted as such:
select "Id", "PointInTime" from foo;
A better solution is to add column aliases, eg:
select name as "Name", value as "Value" from parameters;
And Postgres will return properly cased column names. If the problem is SQLAlchemy or pandas, then this will not suffice.

Reserved word "CLUSTER" used as Column name in Oracle 10g

These is an existing table with column name "CLUSTER". I have to query this table to retrieve values of column "CLUSTER". I am getting missing expression error, since CLUSTER is a reserved word in Oracle. Since oracle has allowed to create a column by name CLUSTER, there should be way to retrieve the same. How can query for this column?
PS - I don't have an option to rename the column.
Thanks in advance.
Just use double quotes to refer to that column, like:
select "CLUSTER" from table;
Also, make sure you match the case in the column name.