Spark session application formatting in scala - scala

I am new to Spark. I wrote few codes in scala and executed in spark-shell.
However, I want all the codes in a spark application. I tried to format it as much as possible. but I am still getting formatting errors. Can somebody debug it entirely?
kindly ignore the comments as they are the questions I need to solve
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql
object functions {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()
val df = spark.read.option("header", "true").option("inferSchema", "true").csv("aadhaar_data.csv").toDF("date", "registrar", "private_agency", "state", "district", "sub_district", "pincode", "gender", "age", "aadhaar_generated", "rejected", "mobile_number", "email_id")
df.registerTempTable("data")
//1. View/result of the top 25 rows from each individual store.
spark.sql("select * from (select *,row_number() over (partition by private_agency order by private_agency desc) as Row_Num from data) as b where b.Row_Num<=25").show(30)
//Checkpoint 2
//1. Describe the schema
df.printSchema()
//2. Find the count and names of registrars in the table.
df.select("registrar").distinct().show()
df.select("registrar").distinct().count()
//3. Find the number of states, districts in each state and sub-districts in each district.
df.select("state").distinct().count()
spark.sql("SELECT state, COUNT(district) AS district_count FROM data GROUP BY state ORDER BY COUNT(district) DESC").show()
spark.sql("SELECT district, COUNT(sub_district) AS sub_district_count FROM data GROUP BY district ORDER BY COUNT(sub_district) DESC").show()
//4. Find out the names of private agencies for each state.
spark.sql("SELECT state, private_agency FROM data GROUP BY state, private_agency ORDER BY state").show(2000, false)
//Checkpoint3
//1. Find top 3 states generating most number of Aadhaar cards?
spark.sqlContext.sql("SELECT state, SUM(aadhaar_generated) AS aadhaar_count FROM data GROUP BY state ORDER BY aadhaar_count DESC LIMIT 3").show()
//2. Find top 3 districts where enrolment numbers are maximum?
val generated = df.groupBy("district").sum("aadhaar_generated")
val rejected = df.groupBy("district").sum("rejected")
val concat = generated.withColumn("id", monotonically_increasing_id()).join(rejected.withColumn("id", monotonically_increasing_id()), Seq("id")).drop("id")
val final = concat.withColumn("Sum_Value", $"sum(aadhaar_generated)" + $"sum(rejected)")
println("top 3 districts where enrolment numbers are maximum along with the number of enrolments")
final.show(3, false)
//3. Find the no. of Aadhaar cards generated in each state?
spark.sqlContext.sql("SELECT state, SUM(aadhaar_generated) AS aadhaar_count FROM data GROUP BY state").show()
//Checkpoint 4:
//1. Find the number of unique pincodes in the data?
df.select("pincode").distinct.show()
//2. Find the number of Aadhaar registrations rejected in Uttar Pradesh and
Maharashtra ?
spark.sqlContext.sql("SELECT state, SUM(rejected) AS rejected_count FROM data GROUP BY state having (state='Uttar Pradesh' OR state='Maharashtra')").show()
PS: I tried to make the application in intellij idea

As you mentioned, you are facing issue with groupBy and withColumn.
Find below syntax to do groupBy.
Syntax:
df1.groupBy("Grouping_Column").agg(sum(column_to_sum).alias(Some_new_column_name))
For withColumn, probably val concat must be throwing an issue. follow the syntax for defining withColumn. you will get it.

Related

Best way to update a dataframe in Spark scala

Consider two Dataframe data_df and update_df. These two dataframes have the same schema (key, update_time, bunch of columns).
I know two (main) way to "update" data_df with update_df
full outer join
I join the two dataframes (on key) and then pick the appropriate columns (according to the value of update_timestamp)
max over partition
Union both dataframes, compute the max update_timestamp by key and then filter only rows that equal this maximum.
Here are the questions :
Is there any other way ?
Which one is the best way and why ?
I've already done the comparison with some Open Data
Here is the join code
var join_df = data_df.alias("data").join(maj_df.alias("maj"), Seq("key"), "outer")
var res_df = join_df.where( $"data.update_time" > $"maj.update_time" || $"maj.update_time".isNull)
.select(col("data.*"))
.union(
join_df.where( $"data.update_time" < $"maj.update_time" || $"data.update_time".isNull)
.select(col("maj.*")))
And here is window code
import org.apache.spark.sql.expressions._
val byKey = Window.partitionBy($"key") // orderBy is implicit here
res_df = data_df.union(maj_df)
.withColumn("max_version", max("update_time").over(byKey))
.where($"update_time" === $"max_version")
I can paste you DAGs and Plans here if needed, but they are pretty large
My first guess is that the join solution might be the best way but it only works if the update dataframe got only one version per key.
PS : I'm aware of Apache Delta solution but sadly i'm not able too use it.
Below is one way of doing it to only join on the keys, in an effort to minimize the amount of memory to be used on filters and on join commands.
///Two records, one with a change, one no change
val originalDF = spark.sql("select 'aa' as Key, 'Joe' as Name").unionAll(spark.sql("select 'cc' as Key, 'Doe' as Name"))
///Two records, one change, one new
val updateDF = = spark.sql("select 'aa' as Key, 'Aoe' as Name").unionAll(spark.sql("select 'bb' as Key, 'Moe' as Name"))
///Make new DFs of each just for Key
val originalKeyDF = originalDF.selectExpr("Key")
val updateKeyDF = updateDF.selectExpr("Key")
///Find the keys that are similar between both
val joinKeyDF = updateKeyDF.join(originalKeyDF, updateKeyDF("Key") === originalKeyDF("Key"), "inner")
///Turn the known keys into an Array
val joinKeyArray = joinKeyDF.select(originalKeyDF("Key")).rdd.map(x=>x.mkString).collect
///Filter the rows from original that are not found in the new file
val originalNoChangeDF = originalDF.where(!($"Key".isin(joinKeyArray:_*)))
///Update the output with unchanged records, update records, and new records
val finalDF = originalNoChangeDF.unionAll(updateDF)

DropDuplicates is not giving expected result

I am working on a use-case of removing duplicate records from incoming structured data (in the form of CSV files within a folder on HDFS). In order to try this use-case, I wrote some sample code using files option to see if duplicates can be removed from the records that are present in the CSVs that are copied to the folder (HDFS).
Find below the codepiece:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
val userSchema = new StructType()
.add("prod_code", "string")
.add("bal", "integer")
.add("v_txn_id", "string")
.add("timestamp", "Timestamp")
val csvDF = spark.readStream.option("sep", ",")
.schema(userSchema)
.csv("/user/Temp")
csvDF.dropDuplicates("v_txn_id")
csvDF.createOrReplaceTempView("table1")
val dbDf2 = spark.sql("select prod_code, bal, v_txn_id, current_timestamp timestamp from table1")
dbDf2.writeStream.queryName("aggregates").outputMode("update").format("memory").start()
spark.sql("select * from aggregates").show();
Now, when I copy a file in the folder with duplicate records (by v_txn_id), i still see that the result sink gets all the rows from the file:
P1,1000,TXNID1
P1,2000,TXNID2
P1,3000,TXNID2
P1,4000,TXNID3
P1,5000,TXNID3
P1,6000,TXNID4
All these rows in the csv file get moved to the result "aggregates". What I am expecting is:
P1,1000,TXNID1
P1,3000,TXNID2
P1,5000,TXNID3
P1,6000,TXNID4
This is the first time I am attempting structured streaming (with state), so pardon me for trivial question. Any suggestions would help a lot.
As per you expected output, I believe that you need to find the max of bal based on prod_code and v_txn_id column. To achieve you output, on your final aggregate table, you can use a window funtion (partition by) to find the max of bal based on prod_code and v_txn_id column by created a temporary column called temp_bal. Then in the outer query select distinct values based on prod_code, temp_bal and v_txn_id columns.
spark.sql("select distinct prod_code,temp_bal as bal,v_txn_id from(select *,max(bal) over(partition by prod_code,v_txn_id) as temp_bal from aggregates) order by prod_code,v_txn_id").show()
EDIT 1 :
As per your requirment please find the below script that will work according to the latest date/time for the v_txn_id.
spark.sql("select distinct a.prod_code,a.bal,a.v_txn_id from aggregates a join (select distinct v_txn_id,max(timestamp) over(partition by v_txn_id) as temp_timestamp from aggregates) b on a.v_txn_id=b.v_txn_id and a.timestamp=b.temp_timestamp order by a.v_txn_id").show()
Please let me know if you have any questions, else please mark this answer as accepted (tick icon).

Spark: efficient way to search another dataframe

I have one dataframe (df) with ip addresses and their corresponding long value (ip_int) and now I want to search in an another dataframe (ip2Country) which contains geolocation information to find their corresponding country name. How should I do it in Scala. My code currently didnt work out: Memory limit exceed.
val ip_ints=df.select("ip_int").distinct.collect().flatMap(_.toSeq)
val df_list = ListBuffer[DataFrame]()
for(v <- ip_ints){
var ip_int=v.toString.toLong
df_list +=ip2Country.filter(($"network_start_integer"<=ip_int)&&($"network_last_integer">=ip_int)).select("country_name").withColumn("ip_int", lit(ip_int))
}
var df1 = df_list.reduce(_ union _)
df=df.join(df1,Seq("ip_int"),"left")
Basically I try to iterate through every ip_int value and search them in ip2Country and merge them back with df.
Any help is much appreciated!
A simple join should do the trick for you
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
If you want to remove the null country_name then you can add filter too
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
.filter($"country_name".isNotNull)
I hope the answer is helpful
You want to do a non-equi join, which you can implement by cross joining and then filtering, though it is resource heavy to do so. Assuming you are using Spark 2.1:
df.createOrReplaceTempView("ip_int")
df.select("network_start_integer", "network_start_integer", "country_name").createOrReplaceTempView("ip_int_lookup")
// val spark: SparkSession
val result: DataFrame = spark.sql("select a.*, b.country_name from ip_int a, ip_int_lookup b where b.network_start_integer <= a.ip_int and b.network_last_integer >= a.ip_int)
If you want to include null ip_int, you will need to right join df to result.
I feel puzzled here.
df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int")
Can we use the
df1("network_start_integer")===df("ip_int")
here please?

value head is not a member of org.apache.spark.sql.Row

I am executing twitter sample code, while i am getting error for value head is not a member of org.apache.spark.sql.Row, can someone please explain little bit more on this error.
val tweets = sc.textFile(tweetInput)
println("------------Sample JSON Tweets-------")
for (tweet <- tweets.take(5)) {
println(gson.toJson(jsonParser.parse(tweet)))
}
val tweetTable = sqlContext.jsonFile(tweetInput).cache()
tweetTable.registerTempTable("tweetTable")
println("------Tweet table Schema---")
tweetTable.printSchema()
println("----Sample Tweet Text-----")
sqlContext.sql("SELECT text FROM tweetTable LIMIT 10").collect().foreach(println)
println("------Sample Lang, Name, text---")
sqlContext.sql("SELECT user.lang, user.name, text FROM tweetTable LIMIT 1000").collect().foreach(println)
println("------Total count by languages Lang, count(*)---")
sqlContext.sql("SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY user.lang ORDER BY cnt DESC LIMIT 25").collect.foreach(println)
println("--- Training the model and persist it")
val texts = sqlContext.sql("SELECT text from tweetTable").map(_.head.toString)
// Cache the vectors RDD since it will be used for all the KMeans iterations.
val vectors = texts.map(Utils.featurize).cache()
I think your problem is that the sql method returns a DataSet of Rows. Therefore the _ represents a Row and Row doesn't have a head method (which explains the error message).
To access items in a Row you can do one of the following:
// get the first element in the Row
val texts = sqlContext.sql("...").map(_.get(0))
// get the first element as an Int
val texts = sqlContext.sql("...").map(_.getInt(0))
See here for more info: https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Row.html

How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.
You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)