How to read a checkpointed DataFrame in PySpark - pyspark

If I checkpoint a DataFrame like the below. How can I read it back in?
df1 = spark.createDataFrame([('Abraham','Lincoln')], ['first_name', 'last_name'])
df1.checkpoint()
Something like....
reload = spark.read.something('checkpoints/87b411a8-19e3-402a-86a7-cfac0a4a6d14/rdd-40/*')
I see that the checkpoint file that is written to the checkpoints folder is partitioned but can't tell what the file type is.

Related

FileNotFoundException in Azure Synapse when trying to delete a Parquet file and replace it with a dataframe created from the original file

I am trying to delete an existing Parquet file and replace it with data in a dataframe that read the data in the original Parquet file before deleting it. This is in Azure Synapse using PySpark.
So I created the Parquet file from a dataframe and put it in the path:
full_file_path
I am trying to update this Parquet file. From what I am reading, you can't edit a Parquet file so as a workaround, I am reading the file into a new dataframe:
df = spark.read.parquet(full_file_path)
I then create a new dataframe with the update:
df.createOrReplaceTempView("temp_table")
df_variance = spark.sql("""SELECT * FROM temp_table WHERE ....""")
and the df_variance dataframe is created.
I then delete the original file with:
mssparkutils.fs.rm(full_file_path, True)
and the original file is deleted. But when I do any operation with the df_variance dataframe, like df_variance.count(), I get a FileNotFoundException error. What I am really trying to do is:
df_variance.write.parquet(full_file_path)
and that is also a FileNotFoundException error. But I am finding that any operation I try to do with the df_variance dataframe is producing this error. So I am thinking it might have to do with the fact that the original full_file_path has been deleted and that the df_variance dataframe maintains some sort of reference to the (now deleted) file path, or something like that. Please help. Thanks.
Spark dataframes aren't collections of rows. Spark dataframes use "deferred execution". Only when you call
df_variance.write
is a spark job run that reads from the source, performs your transformations, and writes to the destination.
A Spark dataframe is really just a query that you can compose with other expressions before finally running it.
You might want to move on from parquet to delta. https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-what-is-delta-lake

Reading the latest s3 file in spark structure streaming

I have a spark structure streaming code, that reads the JSON file from the s3 bucket and writes it back to s3.
Input file path format:
val inputPath = s3://<path>/2022-08-26
Output file path format:
val outputPath = s3://<path>/2022-08-26
Code:
val spark = SparkSession.builder().appName("raw_data").enableHiveSupport().getOrCreate()
val df = spark.readStream.option("startingPosition","earliest").schema(LogSchema).json(inputPath)
val query = df.
writeStream.
outputMode("append").
partitionBy("day").
format("parquet").
option("path", "s3://<path>/raw_data/data/").
option("checkpointLocation", "s3://<path>/raw_data/checkpoint/").
trigger(Trigger.ProcessingTime("300 seconds")).
start()
Issue Facing:
We want to read the latest files received on the s3 bucket partition by the current day(not the old one).
Writing the file on the s3 bucket should be partitioned on the current day.
Please help me to resolve the above issue.
You can get this behaviour using hive on top of S3 where you can create hive partition by day/date. You can create a hive table where location is pointed to S3. you can then read and write based on hive table partition. In your case that partition will be currentdate.
You can refer this article
https://medium.com/analytics-vidhya/hive-using-s3-and-scala-af5524302758

Spark DataFrame is not saved in Delta format

I want to save Spark DataFrame in Delta format to S3, however, for some reason, the data is not saved. I debugged all the processing steps there was data and right before saving it, I ran count on the DataFrame which returned 24 rows. But as soon as save is called no data appears in the resulting folder. What could be the reason for it?
This is how I save the data:
df
.select(schema)
.repartition(partitionKeys.map(new ColumnName(_)): _*)
.sortWithinPartitions(sortByKeys.map(new ColumnName(_)): _*)
.write
.format("delta")
.partitionBy(partitionKeys: _*)
.mode(saveMode)
.save("s3a://etl-qa/data_feed")
There is a quick start from Databricks that explains how to read and write from and to a delta lake.
If the Dataframe you are trying to save is called df you need to execute:
df.write.format("delta").save(s3path)

spark-shell load existing hive table by partition?

In spark-shell, how do I load an existing Hive table, but only one of its partitions?
val df = spark.read.format("orc").load("mytable")
I was looking for a way so it only loads one particular partition of this table.
Thanks!
There is no direct way in spark.read.format but you can use where condition
val df = spark.read.format("orc").load("mytable").where(yourparitioncolumn)
unless until you perform an action nothing is loaded, since load (pointing to your orc file location ) is just a func in DataFrameReader like below it doesnt load until actioned.
see here DataFrameReader
def load(paths: String*): DataFrame = {
...
}
In above code i.e. spark.read.... where is just where condition when you specify this, again data wont be loaded immediately :-)
when you say df.count then your parition column will be appled on data path of orc.
There is no function available in Spark API to load only partition directory, but other way around this is partiton directory is nothing but column in where clause, here you can right simple sql query with partition column in where clause which will read data only from partition directoty. See if that will works for you.
val df = spark.sql("SELECT * FROM mytable WHERE <partition_col_name> = <expected_value>")

Convert csv.gz files into Parquet using Spark

I need to implement converting csv.gz files in a folder, both in AWS S3 and HDFS, to Parquet files using Spark (Scala preferred). One of the columns of the data is a timestamp and I only have a week of dataset. The timestamp format is:
'yyyy-MM-dd hh:mm:ss'
The output that I desire is that for every day, there is a folder (or partition) where the Parquet files for that specific date is located. So there would 7 output folders or partitions.
I only have a faint idea of how to do this, only sc.textFile is on my mind. Is there a function in Spark that can convert to Parquet? How do I implement this in S3 and HDFS?
Thanks for you help.
If you look into the Spark Dataframe API, and the Spark-CSV package, this will achieve the majority of what you're trying to do - reading in the CSV file into a dataframe, then writing the dataframe out as parquet will get you most of the way there.
You'll still need to do some steps on parsing the timestamp and using the results to partition the data.
old topic but ill think it is important to answer even old topics if not answered right.
in spark version >=2 csv package is already included before that you need to import databricks csv package to your job e.g. "--packages com.databricks:spark-csv_2.10:1.5.0".
Example csv:
id,name,date
1,pete,2017-10-01 16:12
2,paul,2016-10-01 12:23
3,steve,2016-10-01 03:32
4,mary,2018-10-01 11:12
5,ann,2018-10-02 22:12
6,rudy,2018-10-03 11:11
7,mike,2018-10-04 10:10
First you need to create the hivetable so that the spark written data is compatible with the hive schema. (this might be not needed anymore in future versions)
create table:
create table part_parq_table (
id int,
name string
)
partitioned by (date string)
stored as parquet
after youve done that you can easy read the csv and save the dataframe to that table.The second step overwrites the column date with the dateformat like"yyyy-mm-dd". For each of the value a folder will be created with the specific lines in it.
SCALA Spark-Shell example:
spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
First two lines are hive configurations which are needed to create a partition folder which not exists already.
var df=spark.read.format("csv").option("header","true").load("/tmp/test.csv")
df=df.withColumn("date",substring(col("date"),0,10))
df.show(false)
df.write.format("parquet").mode("append").insertInto("part_parq_table")
after the insert is done you can directly query the table like "select * from part_parq_table".
The folders will be created in the tablefolder on default cloudera e.g. hdfs:///users/hive/warehouse/part_parq_table
hope that helps
BR
Read csv file /user/hduser/wikipedia/pageviews-by-second-tsv
"timestamp" "site" "requests"
"2015-03-16T00:09:55" "mobile" 1595
"2015-03-16T00:10:39" "mobile" 1544
The following code uses spark2.0
import org.apache.spark.sql.types._
var wikiPageViewsBySecondsSchema = StructType(Array(StructField("timestamp", StringType, true),StructField("site", StringType, true),StructField("requests", LongType, true) ))
var wikiPageViewsBySecondsDF = spark.read.schema(wikiPageViewsBySecondsSchema).option("header", "true").option("delimiter", "\t").csv("/user/hduser/wikipedia/pageviews-by-second-tsv")
Convert String-timestamp to timestamp
wikiPageViewsBySecondsDF= wikiPageViewsBySecondsDF.withColumn("timestampTS", $"timestamp".cast("timestamp")).drop("timestamp")
or
wikiPageViewsBySecondsDF= wikiPageViewsBySecondsDF.select($"timestamp".cast("timestamp"), $"site", $"requests")
Write into parquet file.
wikiPageViewsBySecondsTableDF.write.parquet("/user/hduser/wikipedia/pageviews-by-second-parquet")