Incremental inserts using spark/ pass dataframe value to another sparksql query - pyspark

I'd like to implement incremental inserts using spark. The approach i'm using here is to get the max value on primary key and use it as set point. But, the problem is how can i convert the dataframe and pass it to another query.
id=sqlContext.sql("""select max(requestid) as maxr from st""")
pass id to below query
query="""select requestid as maxr from st where requestid='{}'""".format(id)
dfjoin = sqlContext.sql(query)
dfjoin.show()
I'd like to see value of id and maxr to be same.Doing the above doesn't seem to work and is not returning any value.
Please help. Thanks

Thanks,
below worked for me
id=sqlContext.sql("""select max(requestid) as maxr from st""").collect()[0][0]

Related

Column name from repartition to Coalesce()

I have existing code as below in Scala/Spark, where I am supposed to replace repartition() to coalesce(). But after changing to coalesce its not compiling and saying datatype mismatch as it is considering it as Column Name.
How could I change existing code to Coalesce (with Column names) or there is no way to do it?
As I am new to Scala any suggestion would help and appreciate it. Do let me know if need any more details. Thanks!
val accountList = AccountList(MAPR_DB, src_accountList).filterByAccountType("GAMMA")
.fetchOnlyAccountsToProcess.df
.repartition($"Account", $"SecurityNo", $"ActivityDate")
val accountNos = broadcast(accountList.select($"AccountNo", $"Account").distinct)

How to do a groupby rank and add it as a column to existing dataframe in spark scala?

Currently this is what I'm doing:
val new_df= old_df.groupBy("column1").count().withColumnRenamed("count","column1_count")
val new_df_rankings = new_df.withColumn(
"column1_count_rank",
dense_rank()
.over(
Window.orderBy($"column1_count".desc))).select("column1_count","column1_count_rank")
But really all I'm looking to do is add a column to the original df (old_df) called "column1_count_rank" without going through all these intermediate steps and merging back.
Is there a way to do this?
Thanks and have a great day!
As you apply aggregation, there will be a calculative result, it will create new dataframe.
Can you give some sample input and output example
old_df.groupBy("column1").agg(count("*").alias("column1_count")) .withColumn("column1_count_rank",dense_rank().over(Window.orderBy($"column1_count".desc))) .select("column1_count","column1_count_rank")

Comparing columns in two data frame in spark

I have two dataframes, both of them contain different number of columns.
I need to compare three fields between them to check if those are equal.
I tried following approach but its not working.
if(df_table_stats("rec_cnt").equals(df_aud("REC_CNT")) || df_table_stats("hashcount").equals(df_aud("HASH_CNT")) || round(df_table_stats("hashsum"),0).equals(round(df_aud("HASH_TTL"),0)))
{
println("Job executed succefully")
}
df_table_stats("rec_cnt"), this returns Column rather than actual value hence condition becoming false.
Also, please explain difference between df_table_stats.select("rec_cnt") and df_table_stats("rec_cnt").
Thanks.
Use sql and inner join both df , with your conditions .
Per my comment, the syntax you're using are simple column references, they don't actually return data. Assuming you MUST use Spark for this, you'd want a method that actually returns the data, known in Spark as an action. For this case you can use take to return the first Row of data and extract the desired columns:
val tableStatsRow: Row = df_table_stats.take(1).head
val audRow: Row = df_aud.take(1).head
val tableStatsRecCount = tableStatsRow.getAs[Int]("rec_cnt")
val audRecCount = audRow.getAs[Int]("REC_CNT")
//repeat for the other values you need to capture
However, Spark definitely is overkill if this is all you're using it for. You could use a simple JDBC library for Scala like ScalikeJDBC to do these queries and capture the primitives in the results.

Using groupBy in Spark and getting back to a DataFrame

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")

Separate all values from Iterable, Apache Spark

I have grouped all my customers in JavaPairRDD<Long, Iterable<ProductBean>> by there customerId (of Long type). Means every customerId have a List or ProductBean.
Now i want to save all ProductBean to DB irrespective of customerId. I got all values by using method
JavaRDD<Iterable<ProductBean>> values = custGroupRDD.values();
Now i want to convert JavaRDD<Iterable<ProductBean>> to JavaRDD<Object, BSONObject> so that i can save it to Mongo. Remember, every BSONObject is made of Single ProductBean.
I am not getting any idea of how to do this in Spark, i mean which Spark's Transformation is used to do that job. I think this task is some kind of seperate all values from Iterable. Please let me know how is this possible.
Any hint in Scala or Python are also ok.
You can use the flatMapValues function:
JavaRDD<Object,ProductBean> result = custGroupRDD.flatMapValues(v -> v)