Pyspark saving is not working when called from inside a foreach - pyspark

I am building a pipeline that receives messages from Azure EventHub and save into databricks delta tables.
All my tests with static data went well, see the code below:
body = 'A|B|C|D\n"False"|"253435564"|"14"|"2019-06-25 04:56:21.713"\n"True"|"253435564"|"13"|"2019-06-25 04:56:21.713"\n"
tableLocation = "/delta/tables/myTableName"
spark = SparkSession.builder.appName("CSV converter").getOrCreate()
csvData = spark.sparkContext.parallelize(body.split('\n'))
df = spark.read \
.option("header", True) \
.option("delimiter","|") \
.option("quote", "\"") \
.option("nullValue", "\\N") \
.option("inferShema", "true") \
.option("mergeSchema", "true") \
.csv(csvData)
df.write.format("delta").mode("append").save(tableLocation)
However in my case, each eventhub message is a CSV string, and they may come from many sources. So each message must be processed separatelly, because each message may end up saved in different delta tables.
When I try to execute this same code inside a foreach statement, It doesn't work. There are no errors shown at the logs, and I cant find any table saved.
So maybe I am doing something wrong when calling the foreach. See the code below:
def SaveData(row):
...
The same code above
dfEventHubCSV.rdd.foreach(SaveData)
I tried to do this on a streaming context, but I sadly went through the same problem.
What is in the foreach that makes it behave different?
Below the full code I am running:
import pyspark.sql.types as t
from pyspark.sql import SQLContext
--row contains the fields Body and SdIds
--Body: CSV string
--SdIds: A string ID
def SaveData(row):
--Each row data that is going to be added to different tables
rowInfo = GetDestinationTableData(row['SdIds']).collect()
table = rowInfo[0][4]
schema = rowInfo[0][3]
database = rowInfo[0][2]
body = row['Body']
tableLocation = "/delta/" + database + '/' + schema + '/' + table
checkpointLocation = "/delta/" + database + '/' + schema + "/_checkpoints/" + table
spark = SparkSession.builder.appName("CSV").getOrCreate()
csvData = spark.sparkContext.parallelize(body.split('\n'))
df = spark.read \
.option("header", True) \
.option("delimiter","|") \
.option("quote", "\"") \
.option("nullValue", "\\N") \
.option("inferShema", "true") \
.option("mergeSchema", "true") \
.csv(csvData)
df.write.format("delta").mode("append").save(tableLocation)
dfEventHubCSV.rdd.foreach(SaveData)

Well, at the end of all, as always it is something very simple, but I dind't see this anywere.
Basically when you perform a foreach and the dataframe you want to save is built inside the loop. The worker unlike the driver, won't automatically setup the "/dbfs/" path on the saving, so if you don't manually add the "/dbfs/", it will save the data locally in the worker and you will never find the saved data.
That is why my loops weren't working.

Related

Filter rows of snowflake table while reading in pyspark dataframe

I have a huge snowflake table. I want to do some transformation on the table in pyspark. My snowflake table has a column called 'snapshot'. I only want to read the current snapshot data in pyspark dataframe and do transformation on the filtered data.
So, Is there a way to apply filtering the rows while reading the snowflake table in spark dataframe (I don't want to read the entire snowflake table in memory because it is not efficient) or do I need to read entire snowflake table (in spark dataframe) and then apply filter to get the latest snapshot something as below?
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
snowflake_database="********"
snowflake_schema="********"
source_table_name="********"
snowflake_options = {
"sfUrl": "********",
"sfUser": "********",
"sfPassword": "********",
"sfDatabase": snowflake_database,
"sfSchema": snowflake_schema,
"sfWarehouse": "COMPUTE_WH"
}
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("dbtable",snowflake_database+"."+snowflake_schema+"."+source_table_name) \
.load()
df = df.where(df.snapshot == current_timestamp()).collect()
There are forms of filters (filter or where functionality of Spark DataFrame) that Spark doesn't pass to the Spark Snowflake connector. That means, in some situations, you may get more records than you expect.
The safest way would be to use a SQL query directly:
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("query","SELECT X,Y,Z FROM TABLE1 WHERE SNAPSHOT==CURRENT_TIMESTAMP()") \
.load()
Of course, if you want to use filter/where functionality of Spark DataFrame, check the Query History in Snowflake UI to see if the query generated has the right filter applied.

In Spark, how to write header in a file, if there is no row in a dataframe?

I want to write a header in a file if there is no row in dataframe, Currently when I write an empty dataframe to a file then file is created but it does not have header in it.
I am writing dataframe using these setting and command:
Dataframe.repartition(1) \
.write \
.format("com.databricks.spark.csv") \
.option("ignoreLeadingWhiteSpace", False) \
.option("ignoreTrailingWhiteSpace", False) \
.option("header", "true") \
.save('/mnt/Bilal/Dataframe');
I want the header row in the file, even if there is no data row in a dataframe.
if you want to have just header file. you can use fold left to create each column with white space and save that as your csv. I have not used pyspark but this is how it can be done in scala. majority of the code should be reusable you will have to just work on converting it to pyspark
val path ="/user/test"
val newdf=df.columns.foldleft(df){(tempdf,cols)=>
tempdf.withColumn(cols, lit(""))}
create a method for writing the header file
def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
//format header file path
val fileName = "yourfileName.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)
for (h <- colNames) {
writer.write(h + ",")
}
writer.write("\n")
writer.close()
}
call it on your DF
createHeaderFile(path, newdf.columns)
I had the same problem with you, in Pyspark. When dataframe was empty (e.g after a .filter() transformation) then the output was one empty csv without header.
So, I created a custom method which checks if the ouput CSVs is one empty CSV. If yes, then it only adds the header.
import glob
import csv
def add_header_in_one_empty_csv(exported_path, columns):
list_of_csv_files = glob.glob(os.path.join(exported_path, '*.csv'))
if len(list_of_csv_files) == 1:
csv_file = list_of_csv_files[0]
with open(csv_file, 'a') as f:
if f.readline() == b'':
header = ','.join(columns)
f.write(header)
Example:
# Create a dummy Dataframe
df = spark.createDataFrame([(1,2), (1, 4), (3, 2), (1, 4)], ("a", "b"))
# Filter in order to create an empty Dataframe
filtered_df = df.filter(df['a']>10)
# Write the df without rows and no header
filtered_df.write.csv('output.csv', header='true')
# Add the header
add_header_in_one_empty_csv('output.csv', filtered_df.columns)
Same problem occurred to me. What I did is to use pandas for storing empty dataframes instead.
if df.count() == 0:
df.coalesce(1).toPandas().to_csv(join(output_folder, filename_output), index=False)
else:
df.coalesce(1).write.format("csv").option("header","true").mode('overwrite').save(join(output_folder, filename_output))

pyspark 2.4.x structured streaming foreachBatch not running

I am working with spark 2.4.0 and python 3.6. I am developing a python program with pyspark structured streaming actions. The program runs two readstream reading from two sockets, and after made a union of these two streaming dataframe. I tried spark 2.4.0 and 2.4.3 but nothing changed.
Then I perform a unique writestream in order to write just one output streaming dataframe. THAT WORKS WELL.
However, since I need to write also a non streaming dataset for all the micro-batches, I coded a foreachBatch call inside the writestream. THAT DOESN'T WORK.
I put spark.scheduler.mode=FAIR in spark.defaults.conf. I am running through spark-submit, but even though I tried with python3 directly, it doesn't work at all. It looks like as it didn't execute the splitStream function referred in the foreachBatch. I tried adding some print in the splitStream function, without any effects.
I made many attempting, but nothing changed, I submitted via spark-submit and by python. I am working on a spark standalone cluster.
inDF_1 = spark \
.readStream \
.format('socket') \
.option('host', host_1) \
.option('port', port_1) \
.option("maxFilesPerTrigger", 1) \
.load()
inDF_2 = spark \
.readStream \
.format('socket') \
.option('host', host_2) \
.option('port', port_2) \
.option("maxFilesPerTrigger", 1) \
.load() \
.coalesce(1)
inDF = inDF_1.union(inDF_2)
#--------------------------------------------------#
# write streaming raw dataser R-01 plateMeasures #
#--------------------------------------------------#
def splitStream(df, epoch_id):
df \
.write \
.format('text') \
.outputMode('append') \
.start(path = outDir0)
listDF = df.collect()
print(listDF)
pass
stageDir = dLocation.getLocationDir('R-00')
outDir0 = dLocation.getLocationDir(outList[0])
chkDir = dLocation.getLocationDir('CK-00')
query0 = programName + '_q0'
q0 = inDF_1 \
.writeStream \
.foreachBatch(splitStream) \
.format('text') \
.outputMode('append') \
.queryName(query0) \
.start(path = stageDir
, checkpointLocation = chkDir)
I am using foreachBatch because I need to write several sinks for each input microbatch.
Thanks a lot to everyone could try to help me.
I have tried this in my local machine and works for Spark > 2.4.
df.writeStream
.foreachBatch((microBatchDF, microBatchId) => {
microBatchDF
.withColumnRenamed("value", "body")
.write
.format("console")
.option("checkpointLocation","checkPoint")
.save()
})
.start()
.awaitTermination()

Spark JDBC returning dataframe only with column names

I am trying to connect to a HiveTable using spark JDBC, with the following code:
val df = spark.read.format("jdbc").
option("driver", "org.apache.hive.jdbc.HiveDriver").
option("user","hive").
option("password", "").
option("url", jdbcUrl).
option("dbTable", tableName).load()
df.show()
but the return I get is only an empty dataframe with modified columns name, like this:
--------------|---------------|
tableName.uuid|tableName.name |
--------------|---------------|
I've tried to read the dataframe in a lot of ways, but it always results the same.
I'm using JDBC Hive Driver, and this HiveTable is located in an EMR cluster. The code also runs in the same cluster.
Any help will be really appreciated.
Thank you all.
Please set fetchsize in option it should work.
Dataset<Row> referenceData
= sparkSession.read()
.option("fetchsize", "100")
.format("jdbc")
.option("url", jdbc.getJdbcURL())
.option("user", "")
.option("password", "")
.option("dbtable", hiveTableName).load();

PySpark - read recursive Hive table

I have a Hive table that has multiple sub-directories in HDFS, something like:
/hdfs_dir/my_table_dir/my_table_sub_dir1
/hdfs_dir/my_table_dir/my_table_sub_dir2
...
Normally I set the following parameters before I run a Hive script:
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
select * from my_db.my_table;
I'm trying to do the same using PySpark,
conf = (SparkConf().setAppName("My App")
...
.set("hive.input.dir.recursive", "true")
.set("hive.mapred.supports.subdirectories", "true")
.set("hive.supports.subdirectories", "true")
.set("mapred.input.dir.recursive", "true"))
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)
my_table = sqlContext.sql("select * from my_db.my_table")
and end up with an error like:
java.io.IOException: Not a file: hdfs://hdfs_dir/my_table_dir/my_table_sub_dir1
What's the correct way to read a Hive table with sub-directories in Spark?
What I have found is that these values must be preceded with spark as in:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
Try setting them through ctx.sql() prior to execute the query:
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
my_table = sqlContext.sql("select * from my_db.my_table")
Try setting them through SpakSession to execute the query:
sparkSession = (SparkSession
.builder
.appName('USS - Unified Scheme of Sells')
.config("hive.metastore.uris", "thrift://probighhwm001:9083", conf=SparkConf())
.config("hive.input.dir.recursive", "true")
.config("hive.mapred.supports.subdirectories", "true")
.config("hive.supports.subdirectories", "true")
.config("mapred.input.dir.recursive", "true")
.enableHiveSupport()
.getOrCreate()
)