rdd to df conversion using toDF() is giving error - pyspark

My code is:
```
sample1 = df_pat_jour_status_other_occurances.rdd.map(lambda x: (x.lh_pat_id, x.src_key, x.Journey_Status)).toDF()
type(sample1)
```
```
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1010.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1010.0 (TID 32154, LTIN214271.cts.com, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
```

Cause:
Slave and driver are using different versions in your case.
Fix:
Install Python 3.8 version in slaves.
Modify the spark/conf/spark-env.sh file and add PYSPARK_PYTHON=/usr/local/lib/python3.8

Related

Error while connecting snowflake to glue using custom jdbc connector and connection?

I am trying to connect AWS Glue with Snowflake by using JDBC custom connector and connection. However after I have created the connection and run my job and call the toDF() method to convert dynamic frame to Pyspark Dataframe I get the following error
File "/tmp/Snowflake_Test_Job.py", line 22, in <module>
df1 = Snowflake_New_Connector_node1.toDF()
File "/opt/amazon/lib/python3.7/site-packages/awsglue/dynamicframe.py", line 147, in toDF
return DataFrame(self._jdf.toDF(self.glue_ctx._jvm.PythonUtils.toSeq(scala_options)), self.glue_ctx)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/opt/amazon/lib/python3.7/site-packages/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o95.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (179.39.34.49 executor 2): org.apache.spark.util.TaskCompletionListenerException: null
Previous exception in task: Glue ETL Marketplace: JDBC driver failure, driverClassName: net.snowflake.client.jdbc.SnowflakeDriver, error message: No suitable driver found for rendered JDBC URL
The URL I am using is according to the snowflake documentation
I am using the following JDBC jar file snowflake-jdbc-3.13.22.jar
Not sure what is going wrong here? If someone has faced a similar problem and found solution to this please share?
I tried creating connection with other JAR files as well however when I try it gives the same error. I tried creating connection from Glue Job code as well but was getting same error. The Ideal result should have been execution of the query on snowflake.

Getting java.lang.OutOfMemoryError while running app:connectedAndroidTest to execute flutter integration tests

Error details below
Execution optimizations have been disabled for task ':app:compressDevelopmentDebugAssets' to ensure correctness due to the following reasons:
- Gradle detected a problem with the following location: 'C:\Users\LAP\Documents\myapp\build\app\intermediates\merged_assets\developmentDebug\out'. Reason: Task ':app:compressDevelopmentDebugAssets' uses this output of task ':app:copyFlutterAssetsDevelopmentDebug' without declaring an explicit or implicit dependency. This can lead to incorrect results being produced, depending on what order the tasks are executed. Please refer to https://docs.gradle.org/7.4/userguide/validation_problems.html#implicit_dependency for more details about this problem.
> Task :app:compressDevelopmentDebugAssets FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':app:compressDevelopmentDebugAssets'.
> A failure occurred while executing com.android.build.gradle.internal.tasks.CompressAssetsWorkAction
> java.lang.OutOfMemoryError (no error message)
I have already added below in my build.gradle
dexOptions {
javaMaxHeapSize "4G"
}
End goal is to be able to run the integration tests I have on firebase test lab
Following this doc steps been shared on firebase
https://github.com/flutter/flutter/tree/main/packages/integration_test#android-device-testing

Spark "Task serialization failed: java.lang.OutOfMemoryError"

I try write in parquet files 300 rows each contain column BinaryType with length ~ 14 000 000.
And as result i got this Exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.OutOfMemoryError java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
Bur this work when i try write 100 and 200 rows with same value.
spark submit configuration:
<param>--driver-memory 20G</param>
<param>--num-executors 100</param>
<param>--executor-memory 20G</param>
<param>--executor-cores 3</param>
<param>--driver-cores 3</param>
<param>--conf spark.driver.maxResultSize=10G</param>
<param>--conf spark.executor.memoryOverhead=10G</param>
<param>--conf spark.driver.memoryOverhead=10G</param>
<param>--conf spark.network.timeout=1200000</param>
<param>--conf spark.memory.offHeap.enabled=true</param>
<param>--conf spark.sql.files.maxRecordsPerFile=2</param>
I really can't understand how 14 mb per row can throw an OutOfMemoryError when trying to write. If i ran show on the final DataFrame, it works without errors.
used persist(StorageLevel.MEMORY_AND_DISK_SER) at final DataFrame also didnt help

Spark error when running TPCDS benchmark datasets - Could not find dsdgen

Im trying to build the TPCDS benchmark datasets, by following this website.
https://xuechendi.github.io/2019/07/12/Prepare-TPCDS-For-Spark
when I run this:
scala> [troberts#master1 spark-sql-perf]$ spark-shell --master yarn --deploy-mode cliers /home/troberts/spark-sql-perf/target/scala-2.11/spark-sql-perf_2.11-0.5.1-SNAPSHOT.jar -i TPCDPreparation.scala
I get this error? Im wondering if its something to do with permissions as the file dsdgen definitely exists at that location on each of the worker nodes /home/troberts/spark-sql-perf/tpcds-kit/tools
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Aborting TaskSet 0.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.0 in stage 0.0 (TID 0, worker1.mycluster.com, executor 1): java.lang.RuntimeException: Could not find dsdgen at /home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen or //home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen. Run install
at scala.sys.package$.error(package.scala:27)
Any ideas appreciated.
Cheers
Could not find dsdgen at /home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen or //home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen
You need to have TPCDS installed first.
spark-sql-perf docs from tool you've used:
Before running any query, a dataset needs to be setup by creating a Benchmark object.
Generating the TPCDS data requires dsdgen built and available on the machines.
We have a fork of dsdgen that you will need.
The fork includes changes to generate TPCDS data to stdout, so that this library can pipe them directly to Spark, without intermediate files.
Therefore, this library will not work with the vanilla TPCDS kit.
TPCDS kit needs to be installed on all cluster executor nodes under the same path!
Please, configure TPCDC toolkit from databricks

How do I read sequence data in Scala in Spark

This is my first time to attempt to read sequence format data in Scala, it would be greatly appreciated if someone can help me with the right command.
data:
hdfs dfs -cat orders03132_seq/part-m-00000 | head
SEQ!org.apache.hadoop.io.LongWritableordeG�Y���&���]E�#��
My command:
sc.sequenceFile("orders03132_seq/part-m-00000", classOf[Int], classOf[String]).first
Error:
18/03/13 16:59:28 ERROR Executor: Exception in task 0.0 in stage 1.0
(TID 1) java.lang.RuntimeException: java.io.IOException: WritableName
can't load class: orders
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2103)
Thank you very much in advance.
You would need to read it as a Hadoop File. You can do this with something like:
sc.hadoopFile[K, V, SequenceFileInputFormat[K,V]]("path/to/file")
Refer documentation here.