I am trying this code in Azure Databricks:
jsonSchema = StructType([ StructField("time", TimestampType(), True), StructField("action", StringType(), True) ])
// readstream from azure event hub
df = spark.readStream.format("eventhubs").options(**ehConf).schema(jsonSchema).load()
streamingCountsDF = (df.withWatermark("Time", "500 milliseconds").groupBy(
df.body,
window(df.enqueuedTime, "1 hour"))
.count()
)
//writing stream to azure blob
streamingCountsDF.writeStream.format("parquet").option("path", file_location).option("checkpointLocation", "/tmp/checkpoint").start()
file_location is the azure blob url.
I am hitting an error in the last step:
org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
How can we resolve this?
Depending upon the queries we use , we need to select appropriate
output mode. Choosing wrong one result in run time exception as below.
org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on
streaming DataFrames/DataSets without watermark;
Reference: You can read more about compatibility of different queries with different output modes here.
In structured streaming, output of the stream processing is a dataframe or table. The output modes of the query signify how this infinite output table is written to the sink, in our example to console.
There are three output modes:
Append - In this mode, the only records which arrive in the last trigger(batch) will be written to sink. This is supported for simple transformations like select, filter etc. As these transformations don’t change the rows which are calculated for earlier batches, appending the new rows work fine.
Complete - In this mode, every time complete resulting table will be written to sink. Typically used with aggregation queries. In case of aggregations, the output of the result will be keep on changing as and when the new data arrives.
Update - In this mode, the only records that are changed from last trigger will be written to sink.
Related
Data Fusion Pipeline gives us one or more part files at output if sync in GCS Bucket. My question is how we can combine those part files to one and also gave them a meaningful name ?
The Data Fusion transformations run in Dataproc clusters executing either Spark or MapReduce jobs. Your final output is split in many files because the jobs partition your data based on the HDFS partitions (this is the default behavior for Spark/Hadoop).
When writing a Spark script you are able to manipulate this default behavior and produce different outputs. However, Data Fusion was built to abstract the code layer and provide you the experience of using a fully managed data integrator. Using split files should not be a problem but if you really need to merge them I suggest that you use the following approach:
On the top of your Pipeline Studio click on Hub -> Plugins, search for Dynamic Spark Plugin, click on Deploy and then in Finish (you can ignore the JAR file)
Back to your pipeline, select Spark in the sink section.
Replace your GCS plugin with the Spark plugin
In your Spark plugin, set Compile at Deployment Time as false and replace the code with some Spark code that does what you want. The code below for example is hardcoded but works:
def sink(df: DataFrame) : Unit = {
new_df = df.coalesce(1)
new_df.write.format("csv").save("gs://your/path/")
}
This function receives the data from your pipeline as a Dataframe. The coalesce function reduces the number of partitions to 1 and the last line writes it to GCS.
Deploy your pipeline and it will be ready to run
We've been exploring using Glue to transform some JSON data to parquet. One scenario we tried was adding a column to the parquet table. So partition 1 has columns [A] and partition 2 has columns [A,B]. Then we wanted to write further Glue ETL jobs to aggregate the parquet table but the new column was not available. Using glue_context.create_dynamic_frame.from_catalog to load the dynamic frame our new column was never in the schema.
We tried several configurations for our table crawler. Using a single schema for all partitions, single schema for s3 path, schema per partition. We could always see the new column in the Glue table data but it was always null if we queried it from a Glue job using pyspark. The column was in the parquet when we downloaded some samples and available for querying via Athena.
Why are the new columns not available to pyspark?
This turned out to be a spark configuration issue. From the spark docs:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
We could enable schema merging in two ways.
set the option on the spark session spark.conf.set("spark.sql.parquet.mergeSchema", "true")
set mergeSchema to true in the additional_options when loading the dynamic frame.
source = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="table",
additional_options={"mergeSchema": "true"}
)
After that the new column was available in the frame's schema.
I'm performing a batch process using Spark with Scala.
Each day, I need to import a sales file into a Spark dataframe and perform some transformations. ( a file with the same schema, only the date and the sales values may change)
At the end of the week, I need to use all daily transformations to perform weekly aggregations. Consequently, I need to persist the daily transformations so that I don't let Spark do everything at the end of the week. ( I want to avoid importing all data and performing all transformations at the end of the week).
I would like also to have a solution that supports incremental updates ( upserts).
I went through some options like Dataframe.persist(StorageLevel.DISK_ONLY). I would like to know if there are better options like maybe using Hive tables ?
What are your suggestions on that ?
What are the advantages of using Hive tables over Dataframe.persist ?
Many thanks in advance.
You can save results of your daily transformations in a parquet (or orc) format, partitioned by day. Then you can run your weekly process on this parquet file with a query that filters only the data for last week. Predicate pushdown and partitioning works pretty efficiently in Spark to load only the data selected by the filter for further processing.
dataframe
.write
.mode(SaveMode.Append)
.partitionBy("day") // assuming you have a day column in your DF
.parquet(parquetFilePath)
SaveMode.Append option allows you to incrementally add data to parquet files (vs overwriting it using SaveMode.Overwrite)
Spark 2.1.1 (scala api) streaming json files from an s3 location.
I want to deduplicate any incoming records based on an ID column (“event_id”) found in the json for every record. I do not care which record is kept, even if duplication of the record is only partial. I am using append mode as the data is merely being enriched/filtered, with no group by/window aggregations, via the spark.sql() method. I then use the append mode to write parquet files to s3.
According to the documentation, I should be able to use dropDuplicates without watermarking in order to deduplicate (obviously this is not effective in long-running production). However, this fails with the error:
User class threw exception: org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets
That error seems odd as I am doing no aggregation (unless dropDuplicates or sparkSQL counts as an aggregation?).
I know that duplicates won’t occur outside 3 days of each other, so I then tried it again by adding a watermark (by using .withWatermark() immediately before the drop duplicates). However, it seems to want to wait until 3 days are up before writing the data. (ie since today is July 24, only data up to the same time on July 21 is written to the output).
As there is no aggregation, I want to write every row immediately after the batch is processed, and simply throw away any rows with an event id that has occurred in the previous 3 days. Is there a simple way to accomplish this?
Thanks
In my case, I used to achieve that in two ways through DStream :
One way:
load tmp_data(contain 3 days unique data, see below)
receive batch_data and do leftOuterJoin with tmp_data
do filter on step2 and output new unique data
update tmp_data with new unique data through step2's result and drop old data(more than 3 days)
save tmp_data on HDFS or whatever
repeat above again and again
Another way:
create a table on mysql and set UNIQUE INDEX on event_id
receive batch_data and just save event_id + event_time + whatever to mysql
mysql will ignore duplicate automatically
Solution we used was a custom implementation of org.apache.spark.sql.execution.streaming.Sink that inserts into a hive table after dropping duplicates within batch and performing a left anti join against the previous few days worth of data in the target hive table.
My task is basically:
Read data from Google Cloud BigQuery using Spark/Scala.
Perform some operation (Like, Update) on the data.
Write back the data to BigQuery
Till now, I am able to read data from BigQuery using newAPIHadoopRDD() which returns RDD[(LongWritable, JsonObject)].
tableData.map(entry => (entry._1.toString(),entry._2.toString()))
.take(10)
.foreach(println)
And below is the sample data,
(341,{"id":"4","name":"Shahu","score":"100"})
I am not able to figure out what functions should I use on this RDD to meet requirement.
Do I need to convert this RDD to DataFrame/Dataset/JSON format? and How?