Reading url via pyspark in Databricks notebook - pyspark

I am unable to read the content of a URL via pySpark in Databricks Notebooks(Version 8.3, Spark 3.1.1). I have tried almost all the possibilities but unable to find out the exact problem. Here is my code.
from pyspark import SparkFiles
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
spark.sparkContext.addFile(url)
df1 = spark.read.text("file://"+SparkFiles.get('8028d38a.tps'))
df1.show()
Here is the error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43) (10.139.64.4 executor 0): com.databricks.sql.io.FileReadException: Error while reading file file:/local_disk0/spark-95887d0f-a955-4075-86ac-520a51f0c64e/userFiles-9204e03a-a0fd-4999-9f40-9d9c3cc599a6/8028d38a.tps. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
I have referred reading data from URL using spark databricks platform as an example. Did anyone face the similar problem?

This is the best i've found from youtube pyspark for everyone playlist
!curl "https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps" >> 8028d38a.tps

As workaround , we can read respective location panda dataframe and covert into pyspark dataframe for further process .
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
import pandas as pd
df = spark.createDataFrame(pd.read_csv(url))
display(df)
Screen print :
If you want to skip first row if that is invalid one ,

Related

Apache Zeppelin Can't Write Deltatable to Spark

I'm attempting to run the following commands using the "%spark" interpreter in Apache Zeppelin:
val data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
Which yields this output (truncated to omit repeat output):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 192.168.64.3, executor 2): java.io.FileNotFoundException: File file:/tmp/delta-table/_delta_log/00000000000000000000.json does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
...
I'm unable to figure out why this is happening at all as I'm too unfamiliar with Spark. Any tips? Thanks for your help.

Spatial with SparkSQL/Python in Synapse Spark Pool using apache-sedona?

I would like to run spatial queries on large data sets; e.g. geopandas would be too slow.
Inspiration I found here: https://anant-sharma.medium.com/apache-sedona-geospark-using-pyspark-e60485318fbe
In Spark Pool of Synapse Analytics I prepared (via Azure Portal):
Apache Spark Pool / Settings / Packages / Requirement files:
requirement.txt:
azure-storage-file-share
geopandas
apache-sedona
Apache Spark Pool / Settings / Packages / Workspace packages:
geotools-wrapper-geotools-24.1.jar
sedona-sql-3.0_2.12-1.2.0-incubating.jar
Apache Spark Pool / Settings / Packages / Spark configuration
config.txt:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
In Pyspark Notebook
print(spark.version)
print(spark.conf.get("spark.kryo.registrator"))
print(spark.conf.get("spark.serializer"))
The output was:
3.1.2.5.0-58001107
org.apache.sedona.core.serde.SedonaKryoRegistrator
org.apache.spark.serializer.KryoSerializer
Then I tried:
from pyspark.sql import SparkSession
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
spark = SparkSession.builder.master("local[*]").appName("Sedona App").config("spark.serializer", KryoSerializer.getName).config("spark.kryo.registrator", SedonaKryoRegistrator.getName).getOrCreate()
SedonaRegistrator.registerAll(spark)
But it failed:
Py4JJavaError: An error occurred while calling o636.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: org.apache.spark.SparkException: Failed to register classes with Kryo
A simple check that stuff is correctly installed would probaly allow this:
%%sql
SELECT ST_Point(0,0);
Please help with getting the spatial functions registered in pyspark running in Synapse notebook!
As per the repro from my end, I'm able to successfully run the above commands without any issue.
I just installed the requirement.txt file contains apache-sedona and downloaded below two jar files:
sedona-python-adapter-3.0_2.12–1.0.0-incubating.jar
geotools-wrapper-geotools-24.0.jar
Note: config.txt file is not required.

Spark read failed due to duplicate column

I am trying to read parquet file from S3 in databricks, using scala.
below is the simple read code
val df = spark.read.parquet(s"/mnt/$MountName/tstamp=2020_03_25")
display(df)
MountName is the dbfs where data is mounted from S3.
But I am getting error which is due to duplicate key in file.
SparkException: Job aborted due to stage failure: Task 0 in stage 813.0 failed 4 times, most recent failure: Lost task 0.3 in stage 813.0 (TID 79285, 10.179.245.218, executor 0): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/Alibaba_data/tstamp=2020_03_25/ts-1585154320710.parquet.gz.
Caused by: java.lang.RuntimeException: Found duplicate field(s) "subtype": [subtype, subType] in case-insensitive mode
Now i need to overcome it. May be making the read case sensitive or by dropping the column while read, or by any other means if suggested.
Suggestion please.
Try with case sensitivity enabled.
spark.sql.caseSensitive should be set to true.

'Unsupported encoding: DELTA_BYTE_ARRAY' while writing parquet data to csv using pyspark

I want to convert parquet files in binary format to csv files. I am using the following commands in spark.
sqlContext.setConf("spark.sql.parquet.binaryAsString","true")
val source = sqlContext.read.parquet("path to parquet file")
source.coalesce(1).write.format("com.databricks.spark.csv").option("header","true").save("path to csv")
This works when i start spark in HDFS server and run these commands. When I try copying the same parquet file to my local system and start pyspark and run these commands it is giving error.
I am able to set binary as string property to true and able to read parquet files in my local pyspark. But when I execute the command to write to csv, it gives the following error.
2018-10-01 14:45:11 WARN ZlibFactory:51 - Failed to load/initialize
native-zlib library 2018-10-01 14:45:12 ERROR Utils:91 - Aborting task
java.lang.UnsupportedOperationException: Unsupported encoding:
DELTA_BYTE_ARRAY
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:577)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:627)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:47)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:550)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:536)
at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:536)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:164)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:263)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:161)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
What should be done to resolve this error in local machine as the same works in hdfs? Any idea to resolve this would be of great help. Thank you.
You can try disabling the VectorizedReader.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
This is not a solution but it is a workaround.
Consequences of disabling it will be https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html
Problem:
Getting an exception in Spark 2.x reading parquet files where some columns are DELTA_BYTE_ARRAY encoded.
Exception:
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
Solution:
If turn off the vectorized reader property, reading these files works fine.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
Explanation:
These files are written with the Parquet V2 writer, as delta byte array encoding is a Parquet v2 feature. The Spark 2.x vectorized reader does not appear to support that format.
Issue already created on apache’s jira. To solve this particular work around.
Cons of using this solution.
Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. But spark 2.x doesn’t support this feature for parquet version two so we need to rely on this solution until further releases.
Adding these 2 flags helped me overcome the error.
parquet.split.files false
spark.sql.parquet.enableVectorizedReader false

apache zeppelin fails on reading csv using pyspark

I'm using Zeppelin-Sandbox 0.5.6 with Spark 1.6.1 on Amazon EMR.
I am reading csv file located on s3.
The problem is that sometimes I'm getting error reading the file. I need to restart the interpreter several times until it works. nothing in my code changes. I can't restore it, and can't tell when it's happening.
My code goes as following:
defining dependencies:
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.4.0")
using spark-csv:
%pyspark
import pyspark.sql.functions as func
df = sqlc.read.format("com.databricks.spark.csv").option("header", "true").load("s3://some_location/some_csv.csv")
error msg:
Py4JJavaError: An error occurred while calling o61.load. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
in stage 0.0 (TID 3, ip-172-22-2-187.ec2.internal):
java.io.InvalidClassException: com.databricks.spark.csv.CsvRelation;
local class incompatible: stream classdesc serialVersionUID =
2004612352657595167, local class serialVersionUID =
6879416841002809418
...
Caused by: java.io.InvalidClassException:
com.databricks.spark.csv.CsvRelation; local class incompatible
Once I'm reading the csv into the dataframe, the rest of the code works fine.
Any advice?
Thanks!
You need to execute spark adding the spark-csv package to it like this
$ pyspark --packages com.databricks:spark-csv_2.10:1.2.0
Now the spark-csv will be in your classpath