Can not read parquet file from minio S3 with Pyspark

Can not read parquet file from minio S3 with Pyspark - pyspark

I have very simple Pyspark job to read parquet file from Minio S3 bucket.
Minio+Jupyter Notebooks are running in docker-compose
spark = SparkSession.builder.getOrCreate()
spark.sparkContext._jsc\
.hadoopConfiguration().set("fs.s3a.access.key", "***********")
spark.sparkContext._jsc\
.hadoopConfiguration().set("fs.s3a.secret.key", "***********")
spark.sparkContext._jsc\
.hadoopConfiguration().set("fs.s3a.endpoint", "http://127.0.0.1:9000")
spark.sparkContext._jsc\
.hadoopConfiguration().set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext._jsc\
.hadoopConfiguration().set("spark.hadoop.fs.s3a.path.style.access", "true")
spark.sparkContext._jsc\
.hadoopConfiguration().set("fs.s3a.multipart.size", "104857600")
measures = spark.read.parquet("s3a://measures/6200703043294113.parquet")
while running getting error:
Py4JJavaError: An error occurred while calling o153.parquet.
: java.lang.IllegalArgumentException
From the other side - I can read the same parquet file locally from file system.
Am I missing something?

Related

Pyspark write dataframe to bigquery [error gs]

I'm trying to write a dataframe to a bigquery table. I have set the sparkSession with the required parameters. However, at the moment of doing the write I get an error:
Py4JJavaError: An error occurred while calling o114.save.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
The code is the following one:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark2 = SparkSession.builder\
.config("spark.jars", "/Users/xyz/Downloads/gcs-connector-hadoop2-latest.jar") \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0")\
.config("google.cloud.auth.service.account.json.keyfile", "/Users/xyz/Downloads/MyProject-cd7627f8ef9b.json") \
.getOrCreate()
spark2.conf.set("parentProject", "xyz")
data=spark2.createDataFrame(
[
("AAA", 51),
("BBB", 23),
],
['codiPuntSuministre', 'valor']
)
spark2.conf.set("temporaryGcsBucket","bqconsumptions")
data.write.format('bigquery') \
.option("credentialsFile", "/Users/xyz/Downloads/MyProject-xyz.json")\
.option('table', 'consumptions.c1') \
.mode('append') \
.save()
df=spark2.read.format("bigquery").option("credentialsFile", "/Users/xyz/Downloads/MyProject-xyz.json")\
.load("consumptions.c1")
I don't get any error if taking out the write from the code, so the error comes when trying to write and may be with something related to the auxiliar bucket to operate with bigquery

the error here suggests that it is not able to recognize the filesystem , you can use the below link for adding the support for gs filesystem , it happens because when you write to bigquery the files are loaded to google cloud storage bucket temporarily and then it is loaded into the bigquery table .
spark._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

BigQuery connector ClassNotFoundException in PySpark on Dataproc

I'm trying to run a script in PySpark, using Dataproc.
The script is kind of a merge between this example and what I need to do, as I wanted to check if everything works. Obviously, it doesn't.
The error I get is:
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat
I made sure I have all the jars, added some new jars as suggested in other similar posts. I also checked the SPARK_HOME variable.
Below you can see the code; the error appears when trying to instantiate table_data.
"""BigQuery I/O PySpark example."""
from __future__ import absolute_import
import json
import pprint
import subprocess
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)

As pointed out in the example, you need to include BigQuery connector jar when submitting the job.
Through Dataproc jobs API:
gcloud dataproc jobs submit pyspark --cluster=${CLUSTER} \
/path/to/your/script.py \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
or spark-submit from inside the cluster:
spark-submit --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
/path/to/your/script.py

How to overwrite a partition in apache spark 2.3 while still writing to parquet with insertInto method

I saw this example code to overwrite a partition through spark 2.3 really nicely
dfPartition.coalesce(coalesceNum).write.mode("overwrite").format("parquet").insertInto(tblName)
My issue is that even after adding .format("parquet") it is not being written as parquet rather .c000 .
The compaction and overwriting of the partition if working but not the writing as parquet.
Fullc code here
val sparkSession = SparkSession.builder //.master("local[2]")
.config("spark.hadoop.parquet.enable.summary-metadata", "false")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("parquet.compression", "snappy")
.enableHiveSupport() //can just comment out hive support
.getOrCreate
sparkSession.sparkContext.setLogLevel("ERROR")
println("Created hive Context")
val currentUtcDateTime = new DateTime(DateTimeZone.UTC)
//to compact yesterdays partition
val partitionDtKey = currentUtcDateTime.minusHours(24).toString("yyyyMMdd").toLong
val dfPartition = sparkSession.sql(s"select * from $tblName where $columnPartition=$hardCodedPartition")
if (!dfPartition.take(1).isEmpty) {
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
dfPartition.coalesce(coalesceNum).write.format("parquet").mode("overwrite").insertInto(tblName)
sparkSession.sql(s"msck repair table $tblName")
Helpers.executeQuery("refresh " + tblName, "impala", resultRequired = false)
}
else {
"echo invalid partition"
}
here is the question where I got the suggestion to use this code Overwrite specific partitions in spark dataframe write method.
What I like about this method is not having to list the partition columns which is really good nice. I can easily use it in many cases
Using scala 2.11 , cdh 5.12 , spark 2.3
Any suggestions

The extension .c000 relates to the executor who did the file, not to the actual file format. The file could be parquet and end with .c000, or .snappy, or .zip... To know the actual file format, run this command:
hadoop dfs -cat /tmp/filename.c000 | head
where /tmp/filename.c000 is the hdfs path to your file. You will see some strange simbols, and you should see parquet there somewhere if its actually a parquet file.

pyspark : AnalysisException when reading csv file

I am new to pyspark . I am migrating my project to pyspark . I am trying to read csv file from S3 and create df out of it. file name is assigned to variable cfg_file and I am using key variable for reading from S3. I am able to do same using pandas but get AnalysisException when I read using spark . I am using boto lib for S3 connection
df = spark.read.csv(StringIO.StringIO(Key(bucket,cfg_file).get_contents_as_string()), sep=',')
AnalysisException: u'Path does not exist: file:

Not able to load file from HDFS in spark Dataframe

I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/.
I would like to load this file from HDFS to spark Dataframe. So I tried this
val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()
and then
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._
spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
.show()
But fails at runtime with below exception Stack trace:
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.
Additionally, this works perfectly fine with plain rdd
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file
I found and tried this as well but no luck :(
I am missing something? Why spark is looking at my local file system & not the HDFS?
I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.
EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0

Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.
https://issues.apache.org/jira/browse/SPARK-15899

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Can not read parquet file from minio S3 with Pyspark - pyspark

Related

Pyspark write dataframe to bigquery [error gs]

BigQuery connector ClassNotFoundException in PySpark on Dataproc

How to overwrite a partition in apache spark 2.3 while still writing to parquet with insertInto method

pyspark : AnalysisException when reading csv file

Not able to load file from HDFS in spark Dataframe

Categories

Resources