How to write decimal type to redshift using awsglue? - pyspark

I am trying to write a variety of columns to redshift from a dynamic frame using the DynamicFrameWriter.from_jdbc_conf method, but all DECIMAL fields end up as a column of NULLs.
The ETL pulls in from some redshift views, and eventually writes back to a redshift table using the aforementioned method. In general this works for other datatypes, but adding a statement as simple as SELECT CAST(12.34 AS DECIMAL(4, 2)) AS decimal_test results in a column full of NULLs. In pyspark, I can print out the column and see the decimal values immediately before they are written as NULLs to the redshift table. When I look at the schema on redshift, I can see the column decimal_test has a type of NUMERIC(4, 2). What could be causing this failure?

Related

Does copying data from Postgres to Bigquery in Cloud Fusion cause columns with Numeric Datatypes to move decimal places?

I am copying over 20,000 rows from a table in Postgres to Bigquery. In both Postgres and Bigquery the datatype is Numeric (13,2) or Numeric (12,8) for columns that are Numeric.
It manages to copy over the correct values from Postgres for all the columns except the ones with Numeric datatype. It would move a decimal place or won't capture the full number.
I am not sure if Numeric types are need to be redefined in bigquery or if I need to add a wrapper.

select all columns except two in q kdb historical database

In output I want to select all columns except two columns from a table in q/kdb historical database.
I tried running below query but it does not work on hdb.
delete colid,coltime from table where date=.z.d-1
but it is failing with below error
ERROR: 'par
(trying to update a physically partitioned table)
I referred https://code.kx.com/wiki/Cookbook/ProgrammingIdioms#How_do_I_select_all_the_columns_of_a_table_except_one.3F but no help.
How can we display all columns except for two in kdb historical database?
The reason you are getting par error is due to the fact that it is a partitioned table.
The error is documented here
trying to update a partitioned table
You cannot directly update, delete anything on a partitioned table ( there is a separate db maintenance script for that)
The query you have used as fix is basically selecting the data first in-memory (temporarily) and then deleting the columns, hence it is working.
delete colid,coltime from select from table where date=.z.d-1
You can try the following functional form :
c:cols[t] except `p
?[t;enlist(=;`date;2015.01.01) ;0b;c!c]
Could try a functional select:
?[table;enlist(=;`date;.z.d);0b;{x!x}cols[table]except`colid`coltime]
Here the last argument is a dictionary of column name to column title, which tells the query what to extract. Instead of deleting the columns you specified this selects all but those two, which is the same query more or less.
To see what the functional form of a query is you can run something like:
parse"select colid,coltime from table where date=.z.d"
And it will output the arguments to the functional select.
You can read more on functional selects at code.kx.com.
Only select queries work on partitioned tables, which you resolved by structuring your query where you first selected the table into memory, then deleted the columns you did not want.
If you have a large number of columns and don't want to create a bulky select query you could use a functional select.
?[table;();0b;{x!x}((cols table) except `colid`coltime)]
And show all columns except a subset of columns. The column clause expects a dictionary hence I am using the function {x!x} to convert my list to a dictionary. See more information here
https://code.kx.com/q/ref/funsql/
As nyi mentioned, if you want to permanently delete columns from an historical database you can use the deleteCol function in the dbmaint tools https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md

Audit tables on Redshift

Is there a way to get statistics on a table in Redshift like the way we can get on a dataframe in python by using df.describe() as follows-
+-------+---------------------------+-----------------+----------------------+--------------------+---------------+
|summary|           col 1           |      col2       |col3                  |           col4     |col5           |
+-------+---------------------------+-----------------+----------------------+--------------------+---------------+
|  count|                      26716|              869|                 26716|               26716|          26716|
|   mean|                        0.0|          49409.0|                  null|                null|           null|
| stddev|                        0.0|24096.28685088223|                  null|                null|           null|
|    min|                          0|             7745|  pqr                 |xyz                 |abcd  |
|    max|                          0|            91073|  pqr                 |xyz                 |abcd           |
+-------+---------------------------+-----------------+----------------------+--------------------+---------------+
I have a use case to run statistics like above on tables in Redshift on regular basis. I can get column names and data types for a table from PG_TABLE_DEF and I am looking to run Redshift in-built functions such as min(), max(), count(), mean() etc. over the columns identified from the table. Not sure if this is the right approach but if there is a better approach, please share your thoughts.

Redshift select * vs select single column

I'm having the following Redshift performance issue:
I have a table with ~ 2 billion rows, which has ~100 varchar columns and one int8 column (intCol). The table is relatively sparse, although there are columns which have values in each row.
The following query:
select colA from tableA where intCol = ‘111111’;
returns approximately 30 rows and runs relatively quickly (~2 mins)
However, the query:
select * from tableA where intCol = ‘111111’;
takes an undetermined amount of time (gave up after 60 mins).
I know pruning the columns in the projection is usually better but this application needs the full row.
Questions:
Is this just a fundamentally bad thing to do in Redshift?
If not, why is this particular query taking so long? Is it related to the structure of the table somehow? Is there some Redshift knob to tweak to make it faster? I haven't yet messed with the distkey and sortkey on the table, but it's not clear that those should matter in this case.
The main reason why the first query is faster is because Redshift is a columnar database. A columnar database
stores table data per column, writing a same column data into a same block on the storage. This behavior is different from a row-based database like MySQL or PostgreSQL. Based on this, since the first query selects only colA column, Redshift does not need to access other columns at all, while the second query accesses all columns causing a huge disk access.
To improve the performance of the second query, you may need to set "sortkey" to colA column. By setting sortkey to a column, that column data will be stored in sorted order on the storage. It reduces the cost of disk access when fetching records with a condition including that column.

postgresql: alter multiple columns

My database has severals table with some column type 'money'. I would like to alter all these columns (in different tables) in a single statement rather than change type column by column, to avoid omissions.
You'll have to repeat the altering query for every column.
You might want to create a program code to do that for you. You know, with loops.
In order for the database to alter all the tables atomically you should enclose all the altering queries in a transaction (PostgreSQL supports transactional DDL).