I have the following Scala code that I use to write data from a json file to a table in Hive.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setAppName("App").setMaster("local")
import org.apache.spark.sql.hive._
val hiveContext = new HiveContext(sc)
val stg_comments = hiveContext.read.schema(buildSchema()).json(<path to json file)
comment.write.mode("append").saveAsTable(<table name>)
My json data has newline and carriage return characters in it's field values and hence, I cannot simply insert records in Hive (because Hive tables by default do not store newline and carriage returns in the data values) and hence, I need to use SaveAsTable option. The issue here is that every time a json file is read and new records are appended to the existing table, a new parquet file is created in the table directory in Hive warehouse directory. This leads to really small small parquet files in the directory. I would like the data to be appended to the existing parquet file. Do we know how to do that? Thanks!
This is an expected behavior. There is no append-to-existing file option here. Each job has its own set of tasks, each task has its own output file. repartitioning before rewrite can reduce number of files written, but not prevent creating new files.
If number of files becomes a problem, you have to run a separate job to read existing small files and merge into larger chunks.
Related
So, if I have a list of file locations on s3, I can build a dataframe with a column containing the contents of each file in a separate row by doing the following (for example):
s3_path_list = list(df.select('path').toPandas()['path']))
df2 = spark.read.format("binaryFile").load(s3_path_list,'path')
which returns:
df2: pyspark.sql.dataframe.DataFrame
path:string
modificationTime:timestamp
length:long
content:binary
What is the inverse of this operation?
Specifically... I have plotly generating html content stored as a string in an additional 'plot_string' column.
df3: pyspark.sql.dataframe.DataFrame
save_path:string
plot_string:string
How would I go about efficiently saving off each 'plot_string' entry as an html file at some s3 location specified in the 'save_path' column?
Clearly some form of df.write can be used to save off the dataframe (bucketed or partitioned) as parquet, csv, text table, etc... but I can't seem to find any straightforward method to perform a simple parallel write operation without a udf that initializes separate boto clients for each file... which, for large datasets, is a bottleneck (as well as being inelegant). Any help is appreciated.
I am trying to delete an existing Parquet file and replace it with data in a dataframe that read the data in the original Parquet file before deleting it. This is in Azure Synapse using PySpark.
So I created the Parquet file from a dataframe and put it in the path:
full_file_path
I am trying to update this Parquet file. From what I am reading, you can't edit a Parquet file so as a workaround, I am reading the file into a new dataframe:
df = spark.read.parquet(full_file_path)
I then create a new dataframe with the update:
df.createOrReplaceTempView("temp_table")
df_variance = spark.sql("""SELECT * FROM temp_table WHERE ....""")
and the df_variance dataframe is created.
I then delete the original file with:
mssparkutils.fs.rm(full_file_path, True)
and the original file is deleted. But when I do any operation with the df_variance dataframe, like df_variance.count(), I get a FileNotFoundException error. What I am really trying to do is:
df_variance.write.parquet(full_file_path)
and that is also a FileNotFoundException error. But I am finding that any operation I try to do with the df_variance dataframe is producing this error. So I am thinking it might have to do with the fact that the original full_file_path has been deleted and that the df_variance dataframe maintains some sort of reference to the (now deleted) file path, or something like that. Please help. Thanks.
Spark dataframes aren't collections of rows. Spark dataframes use "deferred execution". Only when you call
df_variance.write
is a spark job run that reads from the source, performs your transformations, and writes to the destination.
A Spark dataframe is really just a query that you can compose with other expressions before finally running it.
You might want to move on from parquet to delta. https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-what-is-delta-lake
I'm looping through some CSV files in a folder. I want to write these CSV files as delta table only if they are all valid. Each CSV files in a folder as different name and schemas. I want to reject the entire folder and all the files it contains until data are fixed. I'm running a lot of test but ultimately I have to actually write the files as delta table with the following loop (simplified for this question):
for f in files:
# read csv
df = spark.read.csv(f, header=True, schema=schema)
# writing to already existing delta table
df.write.format("delta").save('path/' + f)
Is there a callback mechanism so the write method is executed only if all the dataframe doesn't returns any errors? Delta table schema enforcement is pretty rigid which is great, but errors can pop at any time despite all the test I'm running before passing these files in this loop.
union is not an option because I want to handle this by date and each files has different schemas and names.
You can use df.union() or df.unionByName() to read all of your files into a single dataframe. Then that one is either written fully or fails.
# Create empty dataframe with schema to fill up
emptyRDD = spark.sparkContext.emptyRDD()
df = spark.createDataFrame(emptyRDD,schema)
for f in files:
# read csv
dfNext = spark.read.csv(f, header=True, schema=schema)
df = df.unionByName(dfNext)
df.write.format("delta").save(path)
TL:DR
Is there a way to read a Scala/Java properties file from a Databricks file system?
Or, is there a way to convert a spark data frame Rows into a set of text key/value pairs (that Scala will understand)?
Full Problem:
The properties file is not local, it's on the Databricks cluster. Attempts to read a file from "dbfs:/" or "/dbfs" fail to find the file when using the scala.io.Source library. My guess is that Scala Source has no ability to recognize the URI for the Databricks file system(?).
I'm able to read the file into a Spark Dataframe however, but attempts to populate a java.utils.Properties object fail with an error that it doesn't accept the Spark Dataframe "ROW" type. I've tried changing the data frame to an Array and List, but run into the same type mismatch. java.util.List[org.apache.spark.sql.Row] for example, is what I get when converting the data frame to a list. I'm guessing that means dataFrameObject.collectAsList() makes a list of spark rows instead of a text list of key/value pairs.
Obviously I'm new to Scala... If there isn't a way to read/load my properties file directly from DBFS, is there a way to convert the spark Row to a key/value pairs - or a byteStream?
Cheers and thanks,
Simon
If you're using full version of the Databricks, not community edition, then you should be able to access files on DBFS via /dbfs/_the_rest_of_your_path_without_dbfs:/_...
But if you can't access /dbfs/..., then you can still load properties as following:
load the file into Spark using the text format that converts every line in the file into individual row
create text from that rows - first you collect all rows to the driver node, then extract string from rows (using the .getString(0) to fetch first element of the row), and then merging all lines together using the mkString
create reader for that text
create properties object and load data from reader (don't forget to close reader after use):
val path_to_file = "dbfs:/something...."
val df = spark.read.format("text").load(path_to_file)
val allTextg = df.collect().map(_.getString(0)).mkString("\n")
val reader = new java.io.StringReader(allText)
val props = new java.util.Properties()
props.load(reader)
reader.close()
and you can check that properties are loaded with
props.list(System.out)
i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.
I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;