I'm trying to write data to Redshift using PySpark.
When I created the session, I could read files from Redshift and S3, which is what I wanted.
However I got error when trying to write back to Redshift.
Do you know what should I change in the script?
Here is how I defined the session:
spark = SparkSession \
.builder \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
.config("spark.driver.extraClassPath", ":".join(jars)) \
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.awsAccessKeyId", os.environ['AWS_ACCESS_KEY_ID']) \
.config("spark.hadoop.fs.s3a.awsSecretAccessKey", os.environ['AWS_SECRET_ACCESS_KEY']) \
.config("spark.hadoop.fs.s3a.path.style.access", True)\
.config("com.amazonaws.services.s3.enableV4", True)\
.config("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true")\
.config("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true")\
.getOrCreate()
I could read from Redshift alright:
test = spark.read.format("jdbc").option("url","jdbc:redshift://host.redshift.amazonaws.com:5439/db").option("driver","com.amazon.redshift.jdbc42.Driver").option("dbtable","schema.table").option("user", DWH_DB_USER).option("password", DWH_DB_PASSWORD).load()
But got error while trying to run this:
test.write.format("jdbc").option("url","jdbc:redshift://host.redshift.amazonaws.com:5439/db").option("driver","com.amazon.redshift.jdbc42.Driver").option("dbtable","schema.table").option("user", DWH_DB_USER).option("password", DWH_DB_PASSWORD).mode('append').save()
The error I've got is:
Py4JJavaError: An error occurred while calling o412.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.sql.SQLException: [Amazon](500310) Invalid operation: The session is read-only;
at com.amazon.redshift.client.messages.inbound.ErrorResponse.toErrorException(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.handleErrorResponse(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.handleMessage(Unknown Source)
at com.amazon.jdbc.communications.InboundMessagesPipeline.getNextMessageOfClass(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.doMoveToNextClass(Unknown Source)
at com.amazon.
......
This seems to be a bug where the JDBC driver sets your session as read-only seemingly at random.
My workaround is to add the following at the beginning of the query:
set ReadOnly = 0;
Related
I'm trying to write a simple DataFrame in parquet format to Azure Blob Storage.
Note that the following code snippets work in local, so my guess is that it has to be something related with Azure libraries. I tried as well with delta format and it works (even if it uses parquet under the hood).
Using Spark 3.1.1, Scala 2.12.10, OpenJDK 1.8.0_292.
I set up my Spark session as usual, something like:
$SPARK_HOME/bin/spark-shell \
(...cluster settings...) \
--conf spark.hadoop.fs.azure.account.key.<account>.blob.core.windows.net="${AZURE_BLOB_STORAGE_KEY}" \
--conf spark.hadoop.fs.AbstractFileSystem.wasb.impl=org.apache.hadoop.fs.azure.Wasb \
--conf spark.hadoop.fs.wasb.impl=org.apache.hadoop.fs.azure.NativeAzureFileSystem \
--conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore \
--packages org.apache.hadoop:hadoop-azure:2.7.0,com.azure:azure-storage-blob:12.8.0,com.azure:azure-storage-common:12.8.0,com.microsoft.azure:azure-storage:2.0.0,io.delta:delta-core_2.12:0.8.0
(...other irrelevant settings...)
I tried other versions for azure-storage-blob, azure-storage-common and azure-storage packages, all resulting in the same problem.
To reproduce the problem I create a simple dataframe and write it to the storage:
val columns = Seq("language", "users_count")
val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
val rdd = spark.sparkContext.parallelize(data)
val df = spark.createDataFrame(rdd).toDF(columns: _*)
df.show
// +--------+-----------+
// |language|users_count|
// +--------+-----------+
// | Java| 20000|
// | Python| 100000|
// | Scala| 3000|
// +--------+-----------+
df.write.parquet("wasb://<container>#<account>.blob.core.windows.net/<path>")
When writing on parquet format I get the com.microsoft.azure.storage.StorageException: One of the request inputs is not valid exception:
21/09/21 13:38:14 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 83) (10.244.6.3 executor 6): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: One of the request inputs is not valid.
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2482)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem$FolderRenamePending.execute(NativeAzureFileSystem.java:424)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1997)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:531)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:502)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:260)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:79)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:280)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
... 9 more
Caused by: com.microsoft.azure.storage.StorageException: One of the request inputs is not valid.
at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:162)
at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:307)
at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:177)
at com.microsoft.azure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:764)
at org.apache.hadoop.fs.azure.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:399)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2449)
... 20 more
Any hints or ideas on what is causing it or what to do to make it work?
Thank you!
Two things helped me to write to Azure storage using WASB protocol:
storage container should be Data Lake Gen1 (tried Gen2 and failed)
you need to add the following dependency/jar:
org.codehaus.jackson
jackson-mapper-lgpl
1.9.13
I had to use WASB (due to my version of hadoop (2.9.2) does not support ABFS), but if you have hadoop-2.10.1+, please use ABFS.
I want to read data from Postgresql using JDBC and store it in pyspark dataframe. When I want to preview the data in dataframe with methods like df.show(), df.take(), they return an error saying caused by: java.lang.ClassNotFoundException: org.postgresql.Driver. But df.printschema() would return info of the DB table perfectly.
Here is my code:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master("spark://spark-master:7077")
.appName("read-postgres-jdbc")
.config("spark.driver.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar")
.config("spark.executor.memory", "1g")
.getOrCreate()
)
sc = spark.sparkContext
df = (
spark.read.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://postgres/postgres")
.option("table", 'public."ASSET_DATA"')
.option("dbtable", _select_sql)
.option("user", "airflow")
.option("password", "airflow")
.load()
)
df.show(1)
Error log:
Py4JJavaError: An error occurred while calling o44.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.21.0.6, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
Edited 7/24/2021
The script was executed on JupyterLab in a separated docker container from the Standalone Spark cluster.
You are not using the proper option.
When reading the doc, you see this :
Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
This option is for the driver. This is the reason why the acquisition of the schema works, it is an action done on the driver side. But when you run a spark command, this command is executed by the workers (or executors). They need also to have the .jar to access postgres.
If your postgres driver ("/opt/workspace/postgresql-42.2.18.jar") does not need any dependencies, then you can add it to the worker using spark.jars - I know mysql does not require depencies for example but I never tried postgres. If it needs dependencies, then it is better to call directly the package from maven using spark.jars.packages option. (see the link of the doc for help)
You can also try adding:
.config("spark.executor.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar"
So that the jar is included for your executors as well.
I am trying to extract data from Redshift and insert into a S3 using an newly created Glue table on that S3 location.
versions
Pyspark - 2.4.0
EMR-emr-5.21.0
My write looks like something like below:
date_filtered_df.coalesce(int(args.numpartitions)) \
.write \
.mode("overwrite") \
.format("parquet") \
.insertInto("{}.{}_stg".format(args.database, args.table))
That is a newly created table just before this insert and it points to a completly empty location.
But the code is erroring with
19/12/09 14:37:48 WARN TaskSetManager: Lost task 576.1 in stage 1.0 (TID 3125, ip-172-31-31-203.ec2.internal, executor 401): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://bucketname/trusted/databasename/dw/2019-12-08/tbalename/.hive-staging_hive_2019-12-09_12-49-15_136_2979557082816709535-1/-ext-10000/sales_dt=20051130/biz_unit_code=CS/geo_code=AMER/part-00576-84bcde9a-4fc0-402b-8aab-8d71e06f8c43.c000
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:123)
at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236)
at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245)
... 10 more
Any help is appreciated.
I am trying to write a dataframe to cassandra using pyspark but its thworing me an error:
py4j.protocol.Py4JJavaError: An error occurred while calling o74.save.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 6 in stage 3.0 failed 4 times, most recent failure: Lost task 6.3
in stage 3.0 (TID 24, ip-172-31-11-193.us-west-2.compute.internal,
executor 1): java.lang.NoClassDefFoundError:
com/twitter/jsr166e/LongAdder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsUpdater.(OutputMetricsUpdater.scala:153)
at org.apache.spark.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:75)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:209)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:197)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:183)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Below is my code for write:
DataFrame.write.format(
"org.apache.spark.sql.cassandra"
).mode(
'append'
).options(
table="student1",
keyspace="university"
).save()
I have added the below mentioned spark-caasandra connector in spark-default.conf
spark.jars.packages datastax:spark-cassandra-connector:2.4.0-s_2.11
I am able to read the data from cassandra but issue is with write.
I am not an expert of Spark, but this might help:
These errors are commonly thrown when the Spark Cassandra Connector or
its dependencies are not on the runtime classpath of the Spark
Application. This is usually caused by not using the prescribed
--packages method of adding the Spark Cassandra Connector and its dependencies to the runtime classpath.
Source:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#why-cant-the-spark-job-find-spark-cassandra-connector-classes-classnotfound-exceptions-for-scc-classes
I am trying to use Hortonworks hbase connector for spark 2.0 to work with hbase (https://github.com/hortonworks-spark/shc/tree/v1.1.0-2.0)
With the provided example in the above link,
val spark = SparkSession
.builder()
.appName(getClass.toString)
.getOrCreate()
def withCatalog(cat: String, spark: SparkSession): DataFrame = {
spark
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
val df = withCatalog(cat, spark)
df.printSchema()
df.show(20, false)
Schema:
val cat =
s"""{
|"table":{"namespace":"test", "name":"test_src_data", "tableCoder":"PrimitiveType"},
|"rowkey":"tfkod_description",
|"columns":{
|"col0":{"cf":"rowkey", "col":"tfkod_description", "type":"string"},
|"src_stream_desc":{"cf":"src_data", "col":"src_desc", "type":"string"}
|}
|}""".stripMargin
After I do spark2-submit the job runs and print only the schema. Later all the excutors are existing and stuck forever.
Last Message in log:
Existing executor 41 has been removed (new total is 1)
But I could successfully work with Hbase in sequential way i.e put or BulkPut but not RDD or DF (with any of hbase connector) way to work in spark.
Is there anything wrong in hbase / spark config due to which spark executor not able to work in parallel ? or something missing in worker nodes ?
Error Message from Worker:
19/05/13 11:36:44 ERROR ipc.AbstractRpcClient: SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:642)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:166)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:769)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:766)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:766)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:920)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:889)
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1222)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651)
at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:201)
at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:180)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:346)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:320)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
... 25 more