I want to implement rank or partitioned row_num function on my data in Data Fusion but I don't find any plugin to do so.
Is there any way to have this ?
I want to implement the below,
Suppose I have this above data, now I want to group the data based on AccountNumber and send the most recent record into one sink and rest to the others.
So from the above data,
Sink1 is expected to have,
Sink2 ,
I was planning to have this segregation by applying the rank or row_number partition by AccountNumber and sort by Record_date desc like functionality and send the records with rank=1 or row_num=1 to one sink and rest to other.
A good approach to solve your problem is using the Spark plugin.
In order to add it to your Datafusion instance, go to HUB -> Plugins -> Search for Spark -> Deploy the plugin .Then you can find it on Analytics tab.
To give you an example of how could you use it I created the pipeline below:
This pipeline basically:
Reads a file from GCS.
Executes a rank function in your data
Filter the data with rank=1 and rank>1 in different branches
Save your data in different locations
Now lets take a look more deeply in each component:
1 - GCS: this is a simple GCS source. The file used for this example has the data showed below
2 - Spark_rank: this is a Spark plugin with the code below. The code basically created a temporary view with your data and them apply a query to rank your rows. After that your data comes back to the pipeline. Below you can also see the input and output data for this step. Please notice that the output is duplicated because it is delivered to two branches.
def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame = {
df.createTempView("source")
df.sparkSession.sql("SELECT AccountNumber, Address, Record_date, RANK() OVER (PARTITION BY accountNumber ORDER BY record_date DESC) as rank FROM source")
}
3 - Spark2 and Spark3: like the step below, this step uses the Spark plugin to transform the data. Spark2 gets only the data with rank = 1 using the code below
def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame = {
df.createTempView("source_0")
df.sparkSession.sql("SELECT AccountNumber, Address, Record_date FROM
source_0 WHERE rank = 1")
}
Spark3 gets the data with rank > 1 using the code below:
def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame = {
df.createTempView("source_1")
df.sparkSession.sql("SELECT accountNumber, address, record_date FROM source_1 WHERE rank > 1")
}
4 - GCS2 and GCS3: finally, in this step your data gets saved into GCS again.
Related
I have two datasets that i need to join and perform operations on and I cant figure out how to do it.
A stipulation for this is that i do not have org.apache.spark.sql.functions methods available to me, so must use the dataset API
The input given is two Datasets
The first dataset is of type Customer with Fields:
customerId, forename, surname - All String
And the second dataset is of Transaction:
customerId (String), accountId(String), amount (Long)
customerId is the link
The outputted Dataset needs to have these fields:
customerId (String), forename(String), surname(String), transactions( A list of type Transaction), transactionCount (int), totalTransactionAmount (Double),averageTransactionAmount (Double)
I understand that i need to use groupBy, agg, and some kind of join at the end.
Can anyone help/point me in the right direction? Thanks
It is very hard to work with the information you have, but from what I understand you dont want to use the dataframe functions but implement everything with the dataset api, you could do this in the following way
Join both the datasets using joinWith, you can find an example here https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html#joinWith
Aggregating : I would use groupByKey followed by mapGroups something like
ds.groupByKey(x=>x.id).mapGroups { case (key,iter) => {
val list = iter.toList
val totalTransactionAmount = ???
val averageTransactionAmount = ???
(key,totalTransactionAmount,averageTransactionAmount)
}
}
Hopefully the example gives you an idea how you could solve your problem with the dataset API and you could adapt it to your problem.
Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.
It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.
I have a DataFrame in spark with Sample accounts which has 5 different columns.
val sampledf= sqlContext.sql(select * from Sampledf)
I have other table in oracle db which has millions of records. OracleTable
I want to filter Accounts present in OracleTable with respect to SampleDF
Select * from OracleTable where column in (select column from SamplesDf)
I realized that in oracle we can not provide more than 1000 values in IN condition.
And below subquery query is not working. Due to huge data in OracleTable
I want to achieve below query
select column from OracleTable where (acctnum in (1,2,3,...1000) or acctnum in (1001,....2000) ....
Basically all the accounts from SampleDF (every 1000 accounts)
Since we cant give more than 1000 at once (that's the limitation in Oracle) we can give 1000 every time.
How can I generate this kind of dynamic query. DO I need to create Array from Dataframe?
I just need a work around, how can I proceed. Any suggestions will be helpful.
broadcast join is the best option which will broadcast the smaller dataframe across the cluster. As it’s mentioned the reading oracle data it’s taking time, it might be due to the profile restrictions of number of parallel sessions.
See below work around to build a dynamic in condition.
Val newsampledf = sampledf.withColumn(“seq”,row_number().over(Window.orderBy(“yourcolumn”)).select(“yourcolumn”, “seq”)
var i = 1L
var j = 0L
while(i <= (cnt/999))
{ var sql = newsampledf.select(“yourcolumn”).where(col(“seq” >= j).where(col(“seq”) <j + 999) j=j+999 i=i+1}
You can try to join the both tables based on the column.
Load the Oracle table as dataframe
Join the oracleDF with sampleDF
val resultDF=oracleDF.join(sampleDF,seq("column"))
Use broadcast if sampleDF is small for better performance
val resultDF=oracleDF.join(broadcast(sampleDF),seq("column"))
Hope it helps you.
We have customer data in a Hive table and sales data in another Hive table, which has data in TB's. We are trying to pull the sales data for multiple customers and save it to a file.
What we tried so far:
We tired with left outer join between customer and sales tables, but because of the huge sales data it is not working.
val data = customer.join(sales,"customer.id" = "sales.customerID",leftouter)
So the alternative is to pull the data form sales table based on specific customer region list and see if this region data has the customer data, if data exist save it in other dataframe and load the data to the same dataframe for all the regions.
My question here is, whether the multiple insert of data for the dataframe is supported in spark.
If the sales dataframe is larger than the customer dataframe then you could simply switch the order of the dataframes in the join operation.
val data = sales.join(customer,"customer.id" === "sales.customerID", "left_outer")
You could also add a hint for Spark to broadcast the smaller dataframe, though I belive it needs to be smaller than 2GB:
import org.apache.spark.sql.functions.broadcast
val data = sales.join(broadcast(customer),"customer.id" === "sales.customerID", "leftouter")
To use the other approach and iterativly merge dataframes is also possible. For this purpose you can use the union method (Spark 2.0+) or unionAll (older versions). This method will append a dataframe to another. In the case where you have a list of dataframes that you want to merge with each other you can use union together with reduce:
val dataframes = Seq(df1, df2, df3)
dataframes.reduce(_ union _)
I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")