How to load 533 columns of data into snowflake table? - snowflake-schema

We have a table with 533 columns with a lot of LOB columns that have to be moved to snowflake. Since our source transformation system having an issue to manage 533 columns in one job. We have split ted the columns into 2 jobs. The first job will insert 283 columns and the second job needs to update the remaining column.
We are using one copy command and upsert command respectively for these two jobs.
copy command
copy into "ADIUATPERF"."APLHSTRO"."FNMA1004_APPR_DEMO" (283 columns) from #"ADIUATPERF"."APLHSTRO".fnma_talend_poc/jos/outformatted.csv
--file_format = (format_name = '"ADIUATPERF"."APLHSTRO".CSV_DQ_HDR0_NO_ESC_CARET');
FILE_FORMAT = (DATE_FORMAT='dd-mm-yyyy', TIMESTAMP_FORMAT='dd-mm-yyyy',TYPE=CSV, ESCAPE_UNENCLOSED_FIELD = NONE,
SKIP_HEADER=1, field_delimiter ='|', RECORD_DELIMITER = '\\n', FIELD_OPTIONALLY_ENCLOSED_BY = '"',
NULL_IF = ('')) PATTERN='' on_error = 'CONTINUE',FORCE=true;
Upsert command
MERGE INTO db.schema._table as target
USING
(SELECT t.$1
from #"ADIUATPERF"."APLHSTRO".fnma_talend_poc/jos/fnma1004_appr.csv
--file_format = (format_name = '"ADIUATPERF"."APLHSTRO".CSV_DQ_HDR0_NO_ESC_CARET');
(FILE_FORMAT =>'CSV', ESCAPE_UNENCLOSED_FIELD => NONE,
SKIP_HEADER=>1, field_delimiter =>'|', RECORD_DELIMITER => '\\n', FIELD_OPTIONALLY_ENCLOSED_BY => '"',
NULL_IF => (''))
) source ON target.document_id = source.document_id
WHEN MATCHED THEN
--update lst_updated
UPDATE SET <columns>=<values>;
I would like to know if we have any other option ?

I would recommend that you run the COPY INTO for both of your split files into temp/transient tables, first. And then execute a single CTAS statement using the JOIN between those 2 tables on document_id. Don't MERGE from a flat file. You could, optionally, run a MERGE on the 2nd temp table into the first table (not temp), if you wished, but I think a straight CTAS from 2 "half" tables might be faster for you.

Related

Loading data from Oracle table using spark JDBC is extremely slow

I am trying to read 500 millions records from a table using spark jdbc and then performance join on that tables .
When i execute a sql from sql developer it takes 25 Minutes .
But when i load this using spark JDBC it takes forever last time it ran for 18 hours and then i cancelled it .
I am using AWS-GLUE for this .
this is how i read using spark jdbc
df = glueContext.read.format("jdbc")
.option("url","jdbc:oracle:thin://abcd:1521/abcd.com")
.option("user","USER_PROD")
.option("password","ffg#Prod")
.option("numPartitions", 15)
.option("partitionColumn", "OUTSTANDING_ACTIONS")
.option("lowerBound", 0)
.option("upperBound", 1000)
.option("dbtable","FSP.CUSTOMER_CASE")
.option("driver","oracle.jdbc.OracleDriver").load()
customer_casedf=df.createOrReplaceTempView("customer_caseOnpremView")
I have used partitionColumn OUTSTANDING_ACTIONS and here is data distribution
Column 1 is partitionColumn and second is their occurrence
1 8988894
0 4227894
5 2264259
9 2263534
8 2262628
2 2261704
3 2260580
4 2260335
7 2259747
6 2257970
This is my Join where customer_caseOnpremView table loading is taking more than 18 hours and othere two tables takes 1 minutes
ThirdQueryResuletOnprem=spark.sql("SELECT CP.CLIENT_ID,COUNT(1) NoofCases FROM customer_caseOnpremView CC JOIN groupViewOnpremView FG ON FG.ID = CC.OWNER_ID JOIN client_platformViewOnpremView CP ON CP.CLIENT_ID = SUBSTR(FG.PATH, 2, INSTR(FG.PATH, '/') + INSTR(SUBSTR(FG.PATH, 1 + INSTR(FG.PATH, '/')), '/') - 2) WHERE FG.STATUS = 'ACTIVE' AND FG.TYPE = 'CLIENT' GROUP BY CP.CLIENT_ID")
Please suggest how to make it fast .
I have no of worker from 10 to 40
I have used Executor type standard to GP2 biggest one but no impact on job
As your query has lot of filters you don't even need to bring in the whole dataset and then apply filter on it. But you can push this query down to db engine which will in turn filter the data and return back the result for Glue job.
This can be done as explained in https://stackoverflow.com/a/54375010/4326922 and below is an example for mysql which can be applied for oracle too with few changes.
query= "(select ab.id,ab.name,ab.date1,bb.tStartDate from test.test12 ab join test.test34 bb on ab.id=bb.id where ab.date1>'" + args['start_date'] + "') as testresult"
datasource0 = spark.read.format("jdbc").option("url", "jdbc:mysql://host.test.us-east-2.rds.amazonaws.com:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", query).option("user", "test").option("password", "Password1234").load()

How to create multiple temp views in spark using multiple data frame

I have 10 data frame and i want to create multiple temp view so that I can perform sql operations on it using createOrReplaceTempView command in pyspark
This is probably what you're after.
source_tables = [
'sql.production.dbo.table1',
'sql.production.dbo.table2',
'sql.production.dbo.table3',
'sql.production.dbo.table4',
'sql.production.dbo.table5',
'sql.production.dbo.table6',
'sql.production.dbo.table7',
'sql.production.dbo.table8',
'sql.production.dbo.table9',
'sql.production.dbo.table10'
]
for source_table in source_tables:
try:
view_name = source_table.replace('.', '_')
# Lowercase all column names
df = df.toDF(*[c.lower() for c in df.columns])
df.createOrReplaceTempView(view_name)
except Exception as e:
print(e)

Postgres - limit number of rows COPY FROM

Is there a way to limit the Postgres COPY FROM syntax to only the first row? There doesn't seem to be an option listed in the documentation.
I know there's that functionality in SQL Server, see FIRSTROW AND LASTROW options below:
BULK INSERT sometable
FROM 'E:\filefromabove.txt
WITH
(
FIRSTROW = 2,
LASTROW = 4,
FIELDTERMINATOR= '|',
ROWTERMINATOR = '\n'
)
You could use the PROGRAM option to preprocess the file to read from the standard output.
To load only the first line use
Unix/Linux/Mac
COPY sometable from PROGRAM 'head -1 filefromabove.txt' ;
Windows
COPY sometable from PROGRAM 'set /p var= <filefromabove.txt && echo %var%' ;

# in Caché between columns

I have a SQL query and I would like to insert a hashtag between one column and another to be able to reference in Excel, using an import option in fields delimited by #. Anyone have an idea how to do it? A query is as follows:
SELECT FC.folha, folha->folhames,folha->folhaano, folha->folhaseq, folha->folhadesc, folha->TipoCod as Tipo_Folha,
folha->FolhaFechFormatado as Folha_Fechada, folha->DataPagamentoFormatada as Data_Pgto,
Servidor->matricula, Servidor->nome, FC.rubrica,
FC.Rubrica->Codigo, FC.Rubrica->Descricao, FC.fator, FC.TipoRubricaFormatado as TipoRubrica,
FC.ValorFormatado,FC.ParcelaAtual, FC.ParcelaTotal
FROM RHFolCalculo FC WHERE folha -> FolhaFech = 1
AND folha->folhaano = 2018
and folha->folhames = 06
and folha->TipoCod->codigo in (1,2,3,4,6,9)
You are generating delimited output from the query, so the first row should be a header row, with all following rows the data rows. You will really only have one column due to concat. So remove the alias from the columns, output the first row like so (using the alias here) . . .
SELECT 'folha#folhames#folhaano#folhaseq#folhadesc#Tipo_Folha#
Folha_Fechada#Data_Pgto#
matricula#nome#rubrica#
Codigo#Descricao#fator#TipoRubrica#
ValorFormatado#ParcelaAtual#ParcelaTotal'
UNION
SELECT FC.folha || '#' || folha->folhames || '#' || folha->folhaano . . .
The UNION will give the remaining rows. Note some conversion may be necessary on the columns data if not all strings.

pgloader cannot import while using TARGET COLUMNS

I am having a hard time getting pgloader to work while trying to use the TARGET COLUMNS optional arguments.
LOAD CSV
FROM INLINE
HAVING FIELDS
(
npi,
...
)
INTO postgresql://user:pass!n#pg2/nadb?tablename=tempload
(
npi
)
WITH skip header = 1,
fields optionally enclosed by '"',
fields escaped by double-quote,
fields terminated by ','
SET work_mem to '64MB'
BEFORE LOAD EXECUTE
tempload.sql;
If I don't use the target columns then it works just fine. tempload has the exact same columns as data.csv.
Everytime i run it it hangs up at this point:
2016-06-09T17:17:33.749000-05:00 DEBUG
select i.relname,
n.nspname,
indrelid::regclass,
indrelid,
indisprimary,
indisunique,
pg_get_indexdef(indexrelid),
c.conname,
pg_get_constraintdef(c.oid)
from pg_index x
join pg_class i ON i.oid = x.indexrelid
join pg_namespace n ON n.oid = i.relnamespace
left join pg_constraint c ON c.conindid = i.oid
where indrelid = 'tempload'::regclass
I'm at a total loss. Like I said, it works fine if I don't use TARGET COLUMNS, so I really don't believe it is the data.
I get the same thing with release 3.2 and the docker image.
Turns out the issue has to do with the amount of memory. I changed to SET work_mem = '512' and it started to get past that point. I guess this has to do with the face that I have 330 columns to import.