I am new to pyspark . I am migrating my project to pyspark . I am trying to read csv file from S3 and create df out of it. file name is assigned to variable cfg_file and I am using key variable for reading from S3. I am able to do same using pandas but get AnalysisException when I read using spark . I am using boto lib for S3 connection
df = spark.read.csv(StringIO.StringIO(Key(bucket,cfg_file).get_contents_as_string()), sep=',')
AnalysisException: u'Path does not exist: file:
Related
I am getting below error while trying to save the dataframe as a csv file in cloudera
scala> df.write.csv("/home/cloudera/Desktop/thakur2")
<console>:31: error: value csv is not a member of
org.apache.spark.sql.DataFrameWriter
df.write.csv("/home/cloudera/Desktop/thakur2")
csv() method is not available on DataFrameWriter class for spark version <2.0.
If you are using spark version<2.0 on cloudera then you should use spark-csv library.
I have an Azure Data Lake gen1 and an Azure Data Lake gen2 (Blob Storage w/hierarchical) and I am trying to create a Databricks notebook (Scala) that reads 2 files and writes a new file back into the Data Lake. In both Gen1 and Gen2 I am experiencing the same issue where the file name of the output csv I have specified is getting saved as a directory and inside that directory it's writing 4 files "committed, started, _SUCCESS, and part-00000-tid-
For the life of me, I can't figure out why it's doing it and not actually saving the csv to the location.
Here's an example of the code I've written. If I do a .show() on the df_join dataframe then it outputs the correct looking results. But the .write is not working correctly.
val df_names = spark.read.option("header", "true").csv("/mnt/datalake/raw/names.csv")
val df_addresses = spark.read.option("header", "true").csv("/mnt/datalake/raw/addresses.csv")
val df_join = df_names.join(df_addresses, df_names.col("pk") === df_addresses.col("namepk"))
df_join.write
.format("com.databricks.spark.csv")
.option("header", "true")
.mode("overwrite")
.save("/mnt/datalake/reports/testoutput.csv")
If I understand for your needs correctly, you just want to write the Spark DataFrame data to a single csv file named testoutput.csv into Azure Data Lake, not a directory named testoutput.csv with some partition files.
So you can not directly realize it via use these Spark functions like DataFrameWriter.save, because actually the dataframe writer writes data to HDFS based on Azure Data Lake. The HDFS persists data as a directory named yours and some partition files. Please see some documents about HDFS like The Hadoop FileSystem API Definition to know it.
Then, per my experience, you can try to use Azure Data Lake SDK for Jave within your Scala program to directly write data from DataFrame to Azure Data Lake as a single file. And you can refer to some samples https://github.com/Azure-Samples?utf8=%E2%9C%93&q=data-lake&type=&language=java.
The reason why it's creating a directory with multiple files, is because each partition is saved and written to the data lake individually. To save a single output file you need to re partition your dataframe
Let's use the dataframe API
confKey = "fs.azure.account.key.srcAcctName.blob.core.windows.net"
secretKey = "==" #your secret key
spark.conf.set(confKey,secretKey)
blobUrl = 'wasbs://MyContainerName#srcAcctName.blob.core.windows.net'
Coalesce your dataframe
df_join.coalesce(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.mode("overwrite")
.save("blobUrl" + "/reports/")
Change the file name
files = dbutils.fs.ls(blobUrl + '/reports/')
output_file = [x for x in files if x.name.startswith("part-")]
dbutils.fs.mv(output_file[0].path, "%s/reports/testoutput.csv" % (blobUrl))
Try this :
df_join.to_csv('/dbfs/mnt/....../df.csv', sep=',', header=True, index=False)
I have a dictionary saved in .pkl format using the following code in python 3.X
import pickle as cpick
OutputDirectory="My data file path"
with open("".join([OutputDirectory, 'data.pkl']), mode='wb') as fp:
cpick.dump(data_dict, fp, protocol=cpick.HIGHEST_PROTOCOL)
I want to read this file in pyspark. Can you suggest me how to do that? Currently I'm using spark 2.0 & python 2.7.13
I have table stored as text file e.g employee in hive and I want to access it using spark.
First i have set sql context object using
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Then i have created table
scala>sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(
id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
',' LINES TERMINATED BY '\n'")
Further i was trying to load the contents of text file by using
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
I am getting error as
SET hive.support.sql11.reserved.keywords=false
FAILED: SemanticException Line 1:23 Invalid path ''employee.txt'': No files
matching path file:/home/username/employee.txt
If i have to place the textfile in current directory where the spark-shell is running how to do that ?
Do you run hive on hadoop?
try to use absolute path... if this doesn't work, try to load your file to hdfs and then give absolute path to your file (hdfs location) .
Try doing the below steps
Start spark-shell in local mode a Eg:spark-shell --master local[*]
Give the file full path for loading file
Eg:file:///home/username/employee.txt
I am reading sas file from azure blob . Converting it to csv and trying to upload csv to azure blob . However for small files in MBs I am able to do the same successfully with the following spark scala code .
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import com.github.saurfang.sas.spark._
val sqlContext = new SQLContext(sc)
val df=sqlContext.sasFile("wasbs://container#storageaccount/input.sas7bdat")
df.write.format("csv").save("wasbs://container#storageaccount/output.csv");
But for large files in GB it gives me Analysis exception wasbs://container#storageaccount/output.csv file already exists exception. I have tried overwrite also . But no luck . Any help would be appriciated
Actually, you could not overwrite an existing file on HDFS normally, even for small files in MBs.
Please try to use the code below to overwrite, please check your spark version because there are some differences to use the methed for different spark version.
df.write.format("csv").mode("overwrite").save("wasbs://container#storageaccount/output.csv");
I don't know the code above using overwrite mode whether you had tried as you said.
So there is another way to do it that first delete the existing files befer do the overwrite operation.
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("<hdfs://<namenodehost>/ or wasb[s]://<containername>#<accountname>.blob.core.windows.net/<path> >"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }
And there is a spark topic discussed similar issue, please see http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html.