Spark read failed due to duplicate column - scala

I am trying to read parquet file from S3 in databricks, using scala.
below is the simple read code
val df = spark.read.parquet(s"/mnt/$MountName/tstamp=2020_03_25")
display(df)
MountName is the dbfs where data is mounted from S3.
But I am getting error which is due to duplicate key in file.
SparkException: Job aborted due to stage failure: Task 0 in stage 813.0 failed 4 times, most recent failure: Lost task 0.3 in stage 813.0 (TID 79285, 10.179.245.218, executor 0): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/Alibaba_data/tstamp=2020_03_25/ts-1585154320710.parquet.gz.
Caused by: java.lang.RuntimeException: Found duplicate field(s) "subtype": [subtype, subType] in case-insensitive mode
Now i need to overcome it. May be making the read case sensitive or by dropping the column while read, or by any other means if suggested.
Suggestion please.

Try with case sensitivity enabled.
spark.sql.caseSensitive should be set to true.

Related

Apache Zeppelin Can't Write Deltatable to Spark

I'm attempting to run the following commands using the "%spark" interpreter in Apache Zeppelin:
val data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
Which yields this output (truncated to omit repeat output):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 192.168.64.3, executor 2): java.io.FileNotFoundException: File file:/tmp/delta-table/_delta_log/00000000000000000000.json does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
...
I'm unable to figure out why this is happening at all as I'm too unfamiliar with Spark. Any tips? Thanks for your help.

Reading url via pyspark in Databricks notebook

I am unable to read the content of a URL via pySpark in Databricks Notebooks(Version 8.3, Spark 3.1.1). I have tried almost all the possibilities but unable to find out the exact problem. Here is my code.
from pyspark import SparkFiles
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
spark.sparkContext.addFile(url)
df1 = spark.read.text("file://"+SparkFiles.get('8028d38a.tps'))
df1.show()
Here is the error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43) (10.139.64.4 executor 0): com.databricks.sql.io.FileReadException: Error while reading file file:/local_disk0/spark-95887d0f-a955-4075-86ac-520a51f0c64e/userFiles-9204e03a-a0fd-4999-9f40-9d9c3cc599a6/8028d38a.tps. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
I have referred reading data from URL using spark databricks platform as an example. Did anyone face the similar problem?
This is the best i've found from youtube pyspark for everyone playlist
!curl "https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps" >> 8028d38a.tps
As workaround , we can read respective location panda dataframe and covert into pyspark dataframe for further process .
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
import pandas as pd
df = spark.createDataFrame(pd.read_csv(url))
display(df)
Screen print :
If you want to skip first row if that is invalid one ,

ParquetDecodingException using pyspark

I saved a parquet file, then loaded it and tried to join it with another dataframe. I used the regular read/write parquet files.
Then I got the following error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 925 in stage 37.0 failed 1 times, most recent failure: Lost
task 925.0 in stage 37.0 (TID 6376, localhost, executor driver):
org.apache.parquet.io.ParquetDecodingException: Can not read value at
113652 in block 0 in file
file:/my_parquet_path/part-00031-b9b3442d-8459-4591-956c-9ef2299095cd-c000.snappy.parquet
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
page Page [bytes.size=1048635, valueCount=30917, uncompressedSize=1048635]
in col [my_field_name, list, element] optional binary element (UTF8)
java.io.IOException: FAILED_TO_UNCOMPRESS(5)
Any idea why does it happen? How the filter the problematic rows?

Spark Sql to read from Hive orc partitioned table giving array out of bound exception

I have created an ORC table in Hive with partitions.The data is loaded in HDFS using Apache pig in ORC format. Then Hive table is created on top of that. Partition columns are year,month and day. When i tried to read that table using spark sql , i am getting array out of bound exception. Please find below the code and error message.
Code:
myTable = spark.table("testDB.employee")
myTable.count()
Error:
ERROR Executor: Exception in task 8.0 in stage 10.0 (TID 66)
java.lang.IndexOutOfBoundsException: toIndex = 47
The datatypes in this table are String,timestamp & double. When i tried to select all the columns using select statement with the spark sql query, i am getting class cast exception as given below.
py4j.protocol.Py4JJavaError: An error occurred while calling
o536.showString. : org.apache.spark.SparkException: Job aborted due to
stage failure: Task 0 in stage 12.0 failed 1 times, most recent
failure: Lost task 0.0 in stage 12.0 (TID 84, localhost, executor
driver): java.lang.ClassCastException: org.apache.hadoop.io.Text
cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
After this i tried to cast to timestamp using the snippet code given below. But after that also i am getting the array out of bound exception.
df2 = df.select('dt',unix_timestamp('dt', "yyyy-MM-dd HH:mm:ss") .cast(TimestampType()).alias("timestamp"))
If you don't specify the partition filter, it could cause this problem. On my side, when I specify the date beween filter, it resolves this out of bound exception.

MongoSpark E11000 error when writing to a MongoDB Replica Set

I am using a Spark2 application which uses the following command from com.mongodb.spark.MongoSpark to write a DataFrame to a three-node-MongoDB Replica Set:
//The real command is similar to this one, depending on options
//set to the DataFrame and the DataFrameWriter object about MongoDB configurations,
//such as the writeConcern
var df: DataFrameWriter[Row] = spark.sql(sql).write
.option("uri", theUri)
.option("database", theDatabase)
.option("collection", theCollection)
.option("replaceDocument", "false")
.mode("append")
[...]
MongoSpark.save(df)
The fact is that although I am sure the source data, which comes from a Hive table, has a unique primary key, when Spark application is running I get a duplicate key error:
2019-01-14 13:01:08 ERROR: Job aborted due to stage failure: Task 51 in stage 19.0 failed 8 times,
most recent failure: Lost task 51.7 in stage 19.0 (TID 762, mymachine, executor 21):
com.mongodb.MongoBulkWriteException: Bulk write operation error on server myserver.
Write errors: [BulkWriteError{index=0, code=11000,
message='E11000 duplicate key error collection:
ddbb.tmp_TABLE_190114125615 index: idx_unique dup key: { : "00120345678" }', details={ }}].
at com.mongodb.connection.BulkWriteBatchCombiner.getError(BulkWriteBatchCombiner.java:176)
at com.mongodb.connection.BulkWriteBatchCombiner.throwOnError(BulkWriteBatchCombiner.java:205)
[...]
I have tried setting write concern to "3" or even "majority". Furthermore, the timeout has been set to 4/5 seconds, but sometimes this duplicate key error still appears.
I would like to know how to configurate the load in order not to obtain duplicate entries when writing to the Replica Set.
Any suggestions? Thanks in advance!