My task is basically:
Read data from Google Cloud BigQuery using Spark/Scala.
Perform some operation (Like, Update) on the data.
Write back the data to BigQuery
Till now, I am able to read data from BigQuery using newAPIHadoopRDD() which returns RDD[(LongWritable, JsonObject)].
tableData.map(entry => (entry._1.toString(),entry._2.toString()))
.take(10)
.foreach(println)
And below is the sample data,
(341,{"id":"4","name":"Shahu","score":"100"})
I am not able to figure out what functions should I use on this RDD to meet requirement.
Do I need to convert this RDD to DataFrame/Dataset/JSON format? and How?
Related
I'm trying to achieve something similar using spark and scala
Updating BigQuery data using Java
https://cloud.google.com/bigquery/docs/updating-data
I want to update existing data and also insert new data into Bigquery table. Any ideas if we can using some sort of DML within spark to do an upsert operation against BigQuery ??
I found that BigQuery supports merge but I'm not sure if we can do something similar using spark and scala
Google BQ - how to upsert existing data in tables?
The spark API does not support upsert yet. The best workaround at this moment is to write the dataframe to a temporary table, run a MERGE job and then delete the temporary table.
I am trying this code in Azure Databricks:
jsonSchema = StructType([ StructField("time", TimestampType(), True), StructField("action", StringType(), True) ])
// readstream from azure event hub
df = spark.readStream.format("eventhubs").options(**ehConf).schema(jsonSchema).load()
streamingCountsDF = (df.withWatermark("Time", "500 milliseconds").groupBy(
df.body,
window(df.enqueuedTime, "1 hour"))
.count()
)
//writing stream to azure blob
streamingCountsDF.writeStream.format("parquet").option("path", file_location).option("checkpointLocation", "/tmp/checkpoint").start()
file_location is the azure blob url.
I am hitting an error in the last step:
org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
How can we resolve this?
Depending upon the queries we use , we need to select appropriate
output mode. Choosing wrong one result in run time exception as below.
org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on
streaming DataFrames/DataSets without watermark;
Reference: You can read more about compatibility of different queries with different output modes here.
In structured streaming, output of the stream processing is a dataframe or table. The output modes of the query signify how this infinite output table is written to the sink, in our example to console.
There are three output modes:
Append - In this mode, the only records which arrive in the last trigger(batch) will be written to sink. This is supported for simple transformations like select, filter etc. As these transformations don’t change the rows which are calculated for earlier batches, appending the new rows work fine.
Complete - In this mode, every time complete resulting table will be written to sink. Typically used with aggregation queries. In case of aggregations, the output of the result will be keep on changing as and when the new data arrives.
Update - In this mode, the only records that are changed from last trigger will be written to sink.
Can someone suggest me how to parse EDIFACT format data using Apache spark ?
i have a requirement as every day EDIFACT data will be written to aws s3 bucket. i am trying to find a best way to convert this data to structured format using Apache spark.
In case you have your invoices in EDIFACT format you can read each one of them as one String per Invoice using RDD´s. Then you will have a RDD[String] which represents the distributed invoice collection. Take a look to https://github.com/CenPC434/java-tools with this you can convert the EDIFACT strings to XML. This repo https://github.com/databricks/spark-xml shows how to use XML format as input source to create Dataframes and perform multiples queries, aggregation... Etc.
I'm performing a batch process using Spark with Scala.
Each day, I need to import a sales file into a Spark dataframe and perform some transformations. ( a file with the same schema, only the date and the sales values may change)
At the end of the week, I need to use all daily transformations to perform weekly aggregations. Consequently, I need to persist the daily transformations so that I don't let Spark do everything at the end of the week. ( I want to avoid importing all data and performing all transformations at the end of the week).
I would like also to have a solution that supports incremental updates ( upserts).
I went through some options like Dataframe.persist(StorageLevel.DISK_ONLY). I would like to know if there are better options like maybe using Hive tables ?
What are your suggestions on that ?
What are the advantages of using Hive tables over Dataframe.persist ?
Many thanks in advance.
You can save results of your daily transformations in a parquet (or orc) format, partitioned by day. Then you can run your weekly process on this parquet file with a query that filters only the data for last week. Predicate pushdown and partitioning works pretty efficiently in Spark to load only the data selected by the filter for further processing.
dataframe
.write
.mode(SaveMode.Append)
.partitionBy("day") // assuming you have a day column in your DF
.parquet(parquetFilePath)
SaveMode.Append option allows you to incrementally add data to parquet files (vs overwriting it using SaveMode.Overwrite)
I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367