Just as the title says, trying to move some data from Redshift to S3 via Sqoop:
sqoop-import -Dmapreduce.job.user.classpath.first=true --connect "jdbc:redshift://redshiftinstance.us-east-1.redshift.amazonaws.com:9999/stuffprd;database=ourDB;user=username;password=password;" --table ourtable -m 1 --as-avrodatafile --target-dir s3n://bucket/folder/folder1/
All drivers are in the proper folders however the error being throw is:
ERROR tool.BaseSqoopTool: Got error creating database manager: java.io.IOException: No manager for connect string:
Not sure if you already have got the answer to this, but you need to add the following to your sqoop command:
--driver com.amazon.redshift.jdbc42.Driver
--connection-manager org.apache.sqoop.manager.GenericJdbcManager
I can't help with the error but I recommend you not do it this way. Sqoop will try retrieve the table as SELECT * and all results will have to pass through the leader node. This will be much slower than using UNLOAD to export the data directly to S3 in parallel. You can then convert the unloaded text files to Avro using Sqoop.
Related
I have created a visual job in AWS Glue where I extract data from Snowflake and then my target is a postgresql database in AWS.
I have been able to connect to both Snowflak and Postgre, I can preview data from both.
I have also been able to get data from snoflake, write to s3 as csv and then take that csv and upload it to postgre.
However when I try to get data from snowflake and push it to postgre I get the below error:
o110.pyWriteDynamicFrame. null
So it means that you can get the data from snowflake in a Datafarme and while writing the data from this datafarme to postgres, you are failing.
You need to check was glue logs to get more understanding why is this failing while writing the data into postgres.
Please check if you have the right version of jars (needed by postgres) compatible with scala(on was glue side).
I downloaded the postgresql .dmp file from the chembl database.
I want to import this into gcp cloudsql.
When I run it with the console and gcloud command, I get the following error:
Importing data into Cloud SQL instance...failed.
ERROR: (gcloud.sql.import.sql) [ERROR_RDBMS] exit status 1
The input is a PostgreSQL custom-format dump.
Use the pg_restore command-line client to restore this dump to a database.
Can I import custom-format dmp files without using the pg_restore command?
https://cloud.google.com/sql/docs/postgres/import-export/importing
There is a description of pg_restore in the document on that site, but I didn't get it right.
In the case of custom-format files, is it necessary to pg_restore after uploading them to the cloud shell?
According to the CloudSQL docs:
Only plain SQL format is supported by the Cloud SQL Admin API.
The custom format is allowed if the dump file is intended for use with pg_restore.
If you cannot use pg_restore for some reason, I would spin up a local Postgres instance (i.e., on your laptop) and use pg_restore to restore the database.
After loading into your local database, you can use pg_dump to dump to file in plaintext format, then load into CloudSQL with the console or gcloud command.
I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.
I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5
Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.
As you can see on the title, I want to know the similar function with MySQL's trigger function. What actually I want to do is importing data from IBM Netezza Databases using sqoop incremental mode. Below is the sqoop scripts what I'm going to use.
sqoop job --create dhjob01 -- import --connect jdbc:netezza://10.100.3.236:5480/TEST \
--username admin --password password \
--table testm \
--incremental lastmodified \
--check-column 'modifiedtime' --last-value '1995-07-18' \
--target-dir /user/dhlee/nz_sqoop_test \
-m 1
As the official Sqoop documentation says, I can gather data from RDBs with incremental mode by making a sqoop import job and execute it recursively.
Anyway the point is, I need a function like MySQL trigger so that I can update the modified date whenever tables in Netezza are updated. And if you have any great idea that I can gather the data incrementally, please tell me. Thank you.
Unfortunately there isn't anything similar to triggers available. I would recommend modifying the relevant UPDATE commands to include setting a column to CURRENT_TIMESTAMP
In Netezza you have something even better:
- Deleted records is still possible to see http://dwgeek.com/netezza-recover-deleted-rows.html/
- the INSERT- and DELETE-TXID are a rising number (and visible on all records as described above)
- updates are really a delete plus an insert
Can you follow me?
enter image description here
This is the screen shot that I've got after I inserted and deleted some rows.
It's been a hour still Sqoop Import has started and it's not getting complete
command:
sqoop import --connect jdbc://localhost/testDb --username root -P --table student
I had been working on sqoop import for a week and when I faced this earlier, I cleared out temporary files and it was working fine. Now, the issue is occuring again.
Writing jar file: /tmp/sqoop-bigdata/compile/5c******/student.jar
MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
INO mapreduce.ImportJobBase: Beginning import of student
There are a few places sqoop can get stuck: Scheduling in Yarn or MRv1, source database is slow, destination is slow. I'd check your MR/Yarn logs to see if the job progresses. Also, reach out to the sqoop mailing lists for more help.