FileNotFoundException in Azure Synapse when trying to delete a Parquet file and replace it with a dataframe created from the original file - pyspark

I am trying to delete an existing Parquet file and replace it with data in a dataframe that read the data in the original Parquet file before deleting it. This is in Azure Synapse using PySpark.
So I created the Parquet file from a dataframe and put it in the path:
full_file_path
I am trying to update this Parquet file. From what I am reading, you can't edit a Parquet file so as a workaround, I am reading the file into a new dataframe:
df = spark.read.parquet(full_file_path)
I then create a new dataframe with the update:
df.createOrReplaceTempView("temp_table")
df_variance = spark.sql("""SELECT * FROM temp_table WHERE ....""")
and the df_variance dataframe is created.
I then delete the original file with:
mssparkutils.fs.rm(full_file_path, True)
and the original file is deleted. But when I do any operation with the df_variance dataframe, like df_variance.count(), I get a FileNotFoundException error. What I am really trying to do is:
df_variance.write.parquet(full_file_path)
and that is also a FileNotFoundException error. But I am finding that any operation I try to do with the df_variance dataframe is producing this error. So I am thinking it might have to do with the fact that the original full_file_path has been deleted and that the df_variance dataframe maintains some sort of reference to the (now deleted) file path, or something like that. Please help. Thanks.

Spark dataframes aren't collections of rows. Spark dataframes use "deferred execution". Only when you call
df_variance.write
is a spark job run that reads from the source, performs your transformations, and writes to the destination.
A Spark dataframe is really just a query that you can compose with other expressions before finally running it.
You might want to move on from parquet to delta. https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-what-is-delta-lake

Related

PySpark - Read CSV and ignore file header (not using pandas)

I have a problem that I hope you can help me with.
The text file that looks like this:
Report Name :
column1,column2,column3
this is row 1,this is row 2, this is row 3
I am leveraging Synapse Notebooks to try to read this file into a dataframe. If I try to read the csv file using spark.read.csv() it thinks that the column name is "Report Name : ", which is obviously incorrect.
I know that the Pandas csv reader has a 'skipRows[1]' function but unfortunately I cannot read the file directly with Pandas, as I am getting some strange networking errors. I can however convert a PySpark dataframe to a Pandas dataframe via: df.toPandas()
I'd like to be able to solve this with straight PySpark dataframes.
Surely someone else has encountered this issue! Help!
I have tried every variation of reading files, and drop, etc. but the schema has already been defined when the first dataframe was created, with 1 column (Report Name : ).
Not sure what to do now..
Copied answer from similar question: How to skip lines while reading a CSV file as a dataFrame using PySpark?
import csv
from pyspark.sql.types import StringType
df = sc.textFile("test.csv")\
.mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'column1')\
.toDF(['column1','column2','column3'])
Microsoft got back to me with an answer that worked! When using pandas csv reader, and you use the path to the source file you want to read. It requires an endpoint to blob storage (not adls gen2). I only had an endpoint that read dfs in the URI and not blob. After I added the endpoint to blob storage, the pandas reader worked great! Thanks for looking at my thread.

Adding an additional column containing file name to pyspark dataframe

I am iterating through csv files in a folder using for loop and performing some operations on each csv (getting the count of rows for each unique id and storing all these outputs into a pyspark dataframe). Now my requirement is to add the name of the file as well to the dataframe for each iteration. Can anyone suggest some way to do this
you can get the file name as a column using the function pyspark.sql.functions.input_file_name, and if your files have the same schema, and you want to apply the same processing pipeline, then don't need to loop on these files, you can read them using a regex:
df = spark.read.csv("path/to/the/files/*.csv", header=True, sep=";") \
.withColumn("file_name", input_file_name())

Hive SaveAsTable creates a new Parquet table file for every run

I have the following Scala code that I use to write data from a json file to a table in Hive.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setAppName("App").setMaster("local")
import org.apache.spark.sql.hive._
val hiveContext = new HiveContext(sc)
val stg_comments = hiveContext.read.schema(buildSchema()).json(<path to json file)
comment.write.mode("append").saveAsTable(<table name>)
My json data has newline and carriage return characters in it's field values and hence, I cannot simply insert records in Hive (because Hive tables by default do not store newline and carriage returns in the data values) and hence, I need to use SaveAsTable option. The issue here is that every time a json file is read and new records are appended to the existing table, a new parquet file is created in the table directory in Hive warehouse directory. This leads to really small small parquet files in the directory. I would like the data to be appended to the existing parquet file. Do we know how to do that? Thanks!
This is an expected behavior. There is no append-to-existing file option here. Each job has its own set of tasks, each task has its own output file. repartitioning before rewrite can reduce number of files written, but not prevent creating new files.
If number of files becomes a problem, you have to run a separate job to read existing small files and merge into larger chunks.

How to save data in parquet format and append entries

I am trying to follow this example to save some data in parquet format and read it. If I use the write.parquet("filename"), then the iterating Spark job gives error that
"filename" already exists.
If I use SaveMode.Append option, then the Spark job gives the error
".spark.sql.AnalysisException: Specifying database name or other qualifiers are not allowed for temporary tables".
Please let me know the best way to ensure new data is just appended to the parquet file. Can I define primary keys on these parquet tables?
I am using Spark 1.6.2 on Hortonworks 2.5 system. Here is the code:
// Option 1: peopleDF.write.parquet("people.parquet")
//Option 2:
peopleDF.write.format("parquet").mode(SaveMode.Append).saveAsTable("people.parquet")
// Read in the parquet file created above
val parquetFile = spark.read.parquet("people.parquet")
//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT * FROM people.parquet")
I believe if you use .parquet("...."), you should use .mode('append'),
not SaveMode.Append:
df.write.mode('append').parquet("....")

Convert csv.gz files into Parquet using Spark

I need to implement converting csv.gz files in a folder, both in AWS S3 and HDFS, to Parquet files using Spark (Scala preferred). One of the columns of the data is a timestamp and I only have a week of dataset. The timestamp format is:
'yyyy-MM-dd hh:mm:ss'
The output that I desire is that for every day, there is a folder (or partition) where the Parquet files for that specific date is located. So there would 7 output folders or partitions.
I only have a faint idea of how to do this, only sc.textFile is on my mind. Is there a function in Spark that can convert to Parquet? How do I implement this in S3 and HDFS?
Thanks for you help.
If you look into the Spark Dataframe API, and the Spark-CSV package, this will achieve the majority of what you're trying to do - reading in the CSV file into a dataframe, then writing the dataframe out as parquet will get you most of the way there.
You'll still need to do some steps on parsing the timestamp and using the results to partition the data.
old topic but ill think it is important to answer even old topics if not answered right.
in spark version >=2 csv package is already included before that you need to import databricks csv package to your job e.g. "--packages com.databricks:spark-csv_2.10:1.5.0".
Example csv:
id,name,date
1,pete,2017-10-01 16:12
2,paul,2016-10-01 12:23
3,steve,2016-10-01 03:32
4,mary,2018-10-01 11:12
5,ann,2018-10-02 22:12
6,rudy,2018-10-03 11:11
7,mike,2018-10-04 10:10
First you need to create the hivetable so that the spark written data is compatible with the hive schema. (this might be not needed anymore in future versions)
create table:
create table part_parq_table (
id int,
name string
)
partitioned by (date string)
stored as parquet
after youve done that you can easy read the csv and save the dataframe to that table.The second step overwrites the column date with the dateformat like"yyyy-mm-dd". For each of the value a folder will be created with the specific lines in it.
SCALA Spark-Shell example:
spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
First two lines are hive configurations which are needed to create a partition folder which not exists already.
var df=spark.read.format("csv").option("header","true").load("/tmp/test.csv")
df=df.withColumn("date",substring(col("date"),0,10))
df.show(false)
df.write.format("parquet").mode("append").insertInto("part_parq_table")
after the insert is done you can directly query the table like "select * from part_parq_table".
The folders will be created in the tablefolder on default cloudera e.g. hdfs:///users/hive/warehouse/part_parq_table
hope that helps
BR
Read csv file /user/hduser/wikipedia/pageviews-by-second-tsv
"timestamp" "site" "requests"
"2015-03-16T00:09:55" "mobile" 1595
"2015-03-16T00:10:39" "mobile" 1544
The following code uses spark2.0
import org.apache.spark.sql.types._
var wikiPageViewsBySecondsSchema = StructType(Array(StructField("timestamp", StringType, true),StructField("site", StringType, true),StructField("requests", LongType, true) ))
var wikiPageViewsBySecondsDF = spark.read.schema(wikiPageViewsBySecondsSchema).option("header", "true").option("delimiter", "\t").csv("/user/hduser/wikipedia/pageviews-by-second-tsv")
Convert String-timestamp to timestamp
wikiPageViewsBySecondsDF= wikiPageViewsBySecondsDF.withColumn("timestampTS", $"timestamp".cast("timestamp")).drop("timestamp")
or
wikiPageViewsBySecondsDF= wikiPageViewsBySecondsDF.select($"timestamp".cast("timestamp"), $"site", $"requests")
Write into parquet file.
wikiPageViewsBySecondsTableDF.write.parquet("/user/hduser/wikipedia/pageviews-by-second-parquet")