CSV file to redshift using talend - postgresql

Windows 8.1
talend version:5.6
JOb design:
tFileinputDelimited >> tredshiftoutput
I am loading 1 millon data from csv file to redshift. After loading of 5 lakshs data I am getting these errors:
Exception in component tRedshiftOutput_1
org.postgresql.util.PSQLException: ERROR: /rds/bin/padb.1.0.867/data/exec/58/0: failed to map segment from shared object: Cannot allocate memory
Detail:
error: /rds/bin/padb.1.0.867/data/exec/58/0: failed to map segment from shared object: Cannot allocate memory
code: 1015
context: dlopen(/rds/bin/padb.1.0.867/data/exec/58/0,RTLD_LAZY)
query: 4234372
location: exec_plan.cpp:2213
process: padbmaster [pid=15630]
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2096)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1829)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:510)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:386)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:332)
at project_1.red_mysqltest_0_1.red_mysqltest.tFileInputDelimited_1Process(red_mysqltest.java:1056)
at project_1.red_mysqltest_0_1.red_mysqltest.runJobInTOS(red_mysqltest.java:1802)
at project_1.red_mysqltest_0_1.red_mysqltest.main(red_mysqltest.java:1646)
[statistics] disconnected
How to resolve these errors??

To use COPY command you need to copy all the csv file in S3.
To copy files in S3 you can use tSystem component with below command -
"aws s3 cp /home/filename.csv s3://data/"
then you can use tRedshiftRow to use COPY command which copy the data from S3 to table. If you are not using S3 then just direct pass path of the file.
"COPY tablename
FROM 's3://location/filename.csv'
credentials
'aws_access_key_id= enter_access_key;
aws_secret_access_key=enter_aws_secret_access_ket'
CSV DELIMITER ';' IGNOREHEADER 1 BLANKSASNULL EMPTYASNULL MAXERROR 10;"
For more detail see this COPY

Related

Synapse Spark exception handling - Can't write to log file

I have written PySpark code to hit a REST API and extract the contents in an XML format and later wrote to Parquet in a data lake container.
I am trying to add logging functionality where I not only write out errors but updates of actions/process we execute.
I am comparatively new to Spark I have been relying on online articles and samples. All explain the error handling and logging through "1/0" examples and saving logs in the default folder structure (not in ADLS account/container/folder) which do not help at all. Most of the code written in Pure Python doesn't run as-is.
Could I get some assistance with setting up the following:
Push errors to a log file under a designated folder sitting under a data lake storage account/container/folder hierarchy".
Catching REST specific exceptions.
This is a sample of what I have written:
''''
LogFilepath = "abfss://raw#.dfs.core.windows.net/Data/logging/data.log"
#LogFilepath2 = "adl://.azuredatalakestore.net/raw/Data/logging/data.log"
print(LogFilepath)
try:
1/0
except Exception as e:
print('My Error...' + str(e))
with open(LogFilepath, "a") as f:
f.write("An error occured: {}\n".format(e))
''''
I have tried it both ABFSS and ADL file paths with no luck. The log file is already available in the storage account/container/folder.
I have reproduced the above using abfss path in with open() function but it gave me the below error.
FileNotFoundError: [Errno 2] No such file or directory: 'abfss://synapsedata#rakeshgen2.dfs.core.windows.net/datalogs.logs'
As per this Documentation
we can use open() on ADLS file with a path like /synfs/{jobId}/mountpoint/{filename}.
For that, first we need to mount the ADLS.
Here I have mounted it using ADLS linked service. you can mount either by Storage account access key or SAS as per your requirement.
mssparkutils.fs.mount(
"abfss://<container_name>#<storage_account_name>.dfs.core.windows.net",
"/mountpoint",
{"linkedService":"<ADLS linked service name>"}
)
Now use the below code to achieve your requirement.
from datetime import datetime
currentDateAndTime = datetime.now()
jobid=mssparkutils.env.getJobId()
LogFilepath='/synfs/'+jobid+'/synapsedata/datalogs.log'
print(LogFilepath)
try:
1/0
except Exception as e:
print('My Error...' + str(e))
with open(LogFilepath, "a") as f:
f.write("Time : {}- Error : {}\n".format(currentDateAndTime,e))
Here I am writing date time along with the error and there is no need to create the log file first. The above code will create and append the error.
If you want to generate the logs daily, you can generate date file names log files as per your requirement.
My Execution:
Here I have executed 2 times.

PostgreSQL pg_dump creates sql script, but it is not a sql script: is there a way to get pg_dump to create a standard sql script?

I'm running pg_dump to create a script to automate the creation of a system like this:
pg_dump --dbname=postgresql://postgres:ohdsi#127.0.0.1:5432/OHDSI -t webapi.* > webapi.sql
This creates a sql script, but it is not really a sql script as it has code in it like what is shown below.
When this script is run as a sql script, it fails giving the error shown below.
Is there a way to get pg_dump to create a sql script that is standard sql and can be executed as a sql script?
Code sample from sql generated by pg_dump:
COPY webapi.cohort_version (asset_id, comment, description, version, asset_json, archived, created_by_id, created_date) FROM stdin;
\.
--
-- Data for Name: concept_of_interest; Type: TABLE DATA; Schema: webapi; Owner: ohdsi_admin_user
--
COPY webapi.concept_of_interest (id, concept_id, concept_of_interest_id) FROM stdin;
1 4329847 4185932
2 4329847 77670
3 192671 4247120
4 192671 201340
Error seen when running the script generated by pg_dump:
--
-- Name: penelope_laertes_uni_pivot id; Type: DEFAULT; Schema: webapi; Owner: ohdsi_admin_user
--
ALTER TABLE ONLY webapi.penelope_laertes_uni_pivot ALTER COLUMN id SET DEFAULT nextval('webapi.penelope_laertes_uni_pivot_id_seq'::regclass)
--
-- Data for Name: achilles_cache; Type: TABLE DATA; Schema: webapi; Owner: ohdsi_admin_user
--
COPY webapi.achilles_cache (id, source_id, cache_name, cache) FROM stdin
Error executing: COPY webapi.achilles_cache (id, source_id, cache_name, cache) FROM stdin
. Cause: org.postgresql.util.PSQLException: ERROR: COPY from stdin failed: COPY commands are only supported using the CopyManager API.
Where: COPY achilles_cache, line 1
Exception in thread "main" java.lang.RuntimeException: org.apache.ibatis.jdbc.RuntimeSqlException: Error executing: COPY webapi.achilles_cache (id, source_id, cache_name, cache) FROM stdin
. Cause: org.postgresql.util.PSQLException: ERROR: COPY from stdin failed: COPY commands are only supported using the CopyManager API.
Where: COPY achilles_cache, line 1
at org.yaorma.database.Database.executeSqlScript(Database.java:344)
at org.yaorma.database.Database.executeSqlScript(Database.java:332)
at org.nachc.tools.fhirtoomop.tools.build.postgres.build.A04_CreateAtlasWebApiTables.exec(A04_CreateAtlasWebApiTables.java:29)
at org.nachc.tools.fhirtoomop.tools.build.postgres.build.A04_CreateAtlasWebApiTables.main(A04_CreateAtlasWebApiTables.java:19)
Caused by: org.apache.ibatis.jdbc.RuntimeSqlException: Error executing: COPY webapi.achilles_cache (id, source_id, cache_name, cache) FROM stdin
. Cause: org.postgresql.util.PSQLException: ERROR: COPY from stdin failed: COPY commands are only supported using the CopyManager API.
Where: COPY achilles_cache, line 1
at org.apache.ibatis.jdbc.ScriptRunner.executeLineByLine(ScriptRunner.java:109)
at org.apache.ibatis.jdbc.ScriptRunner.runScript(ScriptRunner.java:71)
at org.yaorma.database.Database.executeSqlScript(Database.java:342)
... 3 more
Caused by: org.postgresql.util.PSQLException: ERROR: COPY from stdin failed: COPY commands are only supported using the CopyManager API.
Where: COPY achilles_cache, line 1
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:329)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:315)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:291)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:286)
at org.apache.ibatis.jdbc.ScriptRunner.executeStatement(ScriptRunner.java:190)
at org.apache.ibatis.jdbc.ScriptRunner.handleLine(ScriptRunner.java:165)
at org.apache.ibatis.jdbc.ScriptRunner.executeLineByLine(ScriptRunner.java:102)
... 5 more
--- EDIT ------------------------------------
The --inserts method in the accepted answer gave me exactly what I needed.
I ended up doing this:
pg_dump --inserts --dbname=postgresql://postgres:ohdsi#127.0.0.1:5432/OHDSI -t webapi.* > webapi.sql
The client tool you are using to restore the dump cannot deal with the data from the (nonstandard) COPY command being mixed into the script. You need psql to restore such a dump.
You can use the --inserts option of pg_dump to create a dump that contains INSERT statements rather than COPY. That will be slower to restore, but will work with more client tools.
However, your wish to get a standard SQL script is hopeless. PostgreSQL extends the standard in many ways, so a database cannot be dumped with a standard SQL script. Note, for example, that indexes are not defined by the SQL standard. If you are looking to transfer a PostgreSQL dump to a different RDBMS, you will be disappointed. That is more difficult.

Error while loading parquet format file into Amazon Redshift using copy command and manifest file

I'm trying to load parquet file using manifest file and getting below error.
query: 124138ailed due to an internal error. File 'https://s3.amazonaws.com/sbredshift-east/data/000002_0 has an invalid version number: )
Here is my copy command
copy testtable from 's3://sbredshift-east/manifest/supplier.manifest'
IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123'
FORMAT AS PARQUET
manifest;
here is my manifest file
**{
"entries":[
{
"url":"s3://sbredshift-east/data/000002_0",
"mandatory":true,
"meta":{
"content_length":1000
}
}
]
}**
I'm able to load the same file using copy command by specifying the file name.
copy testtable from 's3://sbredshift-east/data/000002_0' IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123' FORMAT AS PARQUET;
INFO: Load into table 'supplier' completed, 800000 record(s) loaded successfully.
COPY
What could be wrong in my copy statement?
This error happens when the content_length value is wrong. You have to specify the correct content_length. You could check it executing an s3 ls command.
aws s3 ls s3://sbredshift-east/data/
2019-12-27 11:15:19 539 sbredshift-east/data/000002_0
The 539 (file size) should be the same than the content_lenght value in your manifest file.
I don't know why they are using this meta value when you don't need it in the direct copy command.
¯\_(ツ)_/¯
The only way I've gotten parquet copy to work with manifest file is to add the meta key with the content_length.
From what I can gather in my error logs, the COPY command for parquet (w/ manifest) might first be reading the files using Redshift Spectrum as an external table. If that's the case, this hidden step does require the content_step which contradicts their initial statement about COPY commands.
https://docs.amazonaws.cn/en_us/redshift/latest/dg/loading-data-files-using-manifest.html

Error while copying data or importing data from CSV into PostgreSQL

I have a CSV file -- > named -- myWidechar1.csv and the data in it is as follows:
1~"ϴAnthony"~"Grosse"~"1980-02-23"~"65000.00"
2~"❤Alica"~"Fatnowna"~"1963-11-14"~"45000.00"
3~"☎Stella"~"Rossenhain"~"1992-03-02"~"120000.00"
Copy Command in PostgreSQL is as follows:
Copy dbo.myWidechar From 'D:\temp\myWidechar1.csv' DELIMITER '~' null as 'null' encoding 'windows-1251' CSV; select 1;
But problem is I am getting the below error while importing data into PostgreSQL:
ERROR: invalid byte sequence for encoding "WIN1251": 0x00 CONTEXT:
COPY mywidechar, line 3 SQL state: 22021
I need the data to remain same even in PostgreSQL as I am migrating data from SQL Server 2000 instance.
Can some one please help me in resolving this issue ?

How to access hive data using spark

I have table stored as text file e.g employee in hive and I want to access it using spark.
First i have set sql context object using
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Then i have created table
scala>sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(
id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
',' LINES TERMINATED BY '\n'")
Further i was trying to load the contents of text file by using
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
I am getting error as
SET hive.support.sql11.reserved.keywords=false
FAILED: SemanticException Line 1:23 Invalid path ''employee.txt'': No files
matching path file:/home/username/employee.txt
If i have to place the textfile in current directory where the spark-shell is running how to do that ?
Do you run hive on hadoop?
try to use absolute path... if this doesn't work, try to load your file to hdfs and then give absolute path to your file (hdfs location) .
Try doing the below steps
Start spark-shell in local mode a Eg:spark-shell --master local[*]
Give the file full path for loading file
Eg:file:///home/username/employee.txt