Scala sum and SQL query sum giving different count. The sum for SQL query is showing very high has compared to scala query output. Any reason why it will be different
%scala
import org.apache.spark.sql.functions._
val df = spark.sql( "select * from view_error_log")
val grpdf = df.groupby("id","error_description","table", "log_date").agg(sum("totalRows").alias("totalRows"))
%sql
Select id,error_description,table,log_date,sum(totalRows) as TotalRows from
view_error_log group by id,error_description,table,log_date
Related
I am working on a use-case of removing duplicate records from incoming structured data (in the form of CSV files within a folder on HDFS). In order to try this use-case, I wrote some sample code using files option to see if duplicates can be removed from the records that are present in the CSVs that are copied to the folder (HDFS).
Find below the codepiece:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
val userSchema = new StructType()
.add("prod_code", "string")
.add("bal", "integer")
.add("v_txn_id", "string")
.add("timestamp", "Timestamp")
val csvDF = spark.readStream.option("sep", ",")
.schema(userSchema)
.csv("/user/Temp")
csvDF.dropDuplicates("v_txn_id")
csvDF.createOrReplaceTempView("table1")
val dbDf2 = spark.sql("select prod_code, bal, v_txn_id, current_timestamp timestamp from table1")
dbDf2.writeStream.queryName("aggregates").outputMode("update").format("memory").start()
spark.sql("select * from aggregates").show();
Now, when I copy a file in the folder with duplicate records (by v_txn_id), i still see that the result sink gets all the rows from the file:
P1,1000,TXNID1
P1,2000,TXNID2
P1,3000,TXNID2
P1,4000,TXNID3
P1,5000,TXNID3
P1,6000,TXNID4
All these rows in the csv file get moved to the result "aggregates". What I am expecting is:
P1,1000,TXNID1
P1,3000,TXNID2
P1,5000,TXNID3
P1,6000,TXNID4
This is the first time I am attempting structured streaming (with state), so pardon me for trivial question. Any suggestions would help a lot.
As per you expected output, I believe that you need to find the max of bal based on prod_code and v_txn_id column. To achieve you output, on your final aggregate table, you can use a window funtion (partition by) to find the max of bal based on prod_code and v_txn_id column by created a temporary column called temp_bal. Then in the outer query select distinct values based on prod_code, temp_bal and v_txn_id columns.
spark.sql("select distinct prod_code,temp_bal as bal,v_txn_id from(select *,max(bal) over(partition by prod_code,v_txn_id) as temp_bal from aggregates) order by prod_code,v_txn_id").show()
EDIT 1 :
As per your requirment please find the below script that will work according to the latest date/time for the v_txn_id.
spark.sql("select distinct a.prod_code,a.bal,a.v_txn_id from aggregates a join (select distinct v_txn_id,max(timestamp) over(partition by v_txn_id) as temp_timestamp from aggregates) b on a.v_txn_id=b.v_txn_id and a.timestamp=b.temp_timestamp order by a.v_txn_id").show()
Please let me know if you have any questions, else please mark this answer as accepted (tick icon).
I want to iterate across the columns of dataframe in my Spark program and calculate min and max value.
I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe.
I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection.
val parquetRDD = spark.read.parquet("filename.parquet")
parquetRDD.collect.foreach ({ i => parquetRDD_subset.agg(max(parquetRDD(parquetRDD.columns(2))), min(parquetRDD(parquetRDD.columns(2)))).show()})
Appreciate any help on this.
You should not be iterating on rows or records. You should be using aggregation function
import org.apache.spark.sql.functions._
val df = spark.read.parquet("filename.parquet")
val aggCol = col(df.columns(2))
df.agg(min(aggCol), max(aggCol)).show()
First when you do spark.read.parquet you are reading a dataframe.
Next we define the column we want to work on using the col function. The col function translate a column name to a column. You could instead use df("name") where name is the name of the column.
The agg function takes aggregation columns so min and max are aggregation functions which take a column and return a column with an aggregated value.
Update
According to the comments, the goal is to have min and max for all columns. You can therefore do this:
val minColumns = df.columns.map(name => min(col(name)))
val maxColumns = df.columns.map(name => max(col(name)))
val allMinMax = minColumns ++ maxColumns
df.agg(allMinMax.head, allMinMax.tail: _*).show()
You can also simply do:
df.describe().show()
which gives you statistics on all columns including min, max, avg, count and stddev
I use scala/ spark to insert data into a Hive parquet table as follows
for(*lots of current_Period_Id*){//This loop is on a result of another query that returns multiple rows of current_Period_Id
val myDf = hiveContext.sql(s"""SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
val count: Int = myDf.count().toInt
if(count>0){
hiveContext.sql(s"""INSERT INTO destinationtable PARTITION(period_id=$current_Period_Id) SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
}
}
This approach takes a lot of time to complete because the select statement is being executed twice.
I'm trying to avoid selecting data twice and one way I've thought of is writing the dataframe myDf to the table directly.
This is the gist of the code I'm trying to use for the purpose
val sparkConf = new SparkConf().setAppName("myApp")
.set("spark.yarn.executor.memoryOverhead","4096")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition","true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
for(*lots of current_Period_Id*){//This loop is on a result of another query
val myDf = hiveContext.sql("SELECT COLUMNS FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id")
val count: Int = myDf.count().toInt
if(count>0){
myDf.write.mode("append").format("parquet").partitionBy("PERIOD_ID").saveAsTable("destinationtable")
}
}
But I get an error in the myDf.write part.
java.util.NoSuchElementException: key not found: period_id
The destination table is partitioned by period_id.
Could someone help me with this?
The spark version I'm using is 1.5.0-cdh5.5.2.
The dataframe schema and table's description differs from each other. The PERIOD_ID != period_id column name is Upper case in your DF but in UPPER case in table. Try in sql with lowercase the period_id
is there any way to group by table in sql spark which selects multiple elements
code i am using:
val df = spark.read.json("//path")
df.createOrReplaceTempView("GETBYID")
now doing group by like :
val sqlDF = spark.sql(
"SELECT count(customerId) FROM GETBYID group by customerId");
but when I try:
val sqlDF = spark.sql(
"SELECT count(customerId),customerId,userId FROM GETBYID group by customerId");
Spark gives an error :
org.apache.spark.sql.AnalysisException: expression 'getbyid.userId'
is neither present in the group by, nor is it an aggregate function.
Add to group by or wrap in first() (or first_value) if you don't care
which value you get.;
is there any possible way to do that
Yes, it's possible and the error message you attached describes all the possibilities. You can either add the userId to groupBy:
val sqlDF = spark.sql("SELECT count(customerId),customerId,userId FROM GETBYID group by customerId, userId");
or use first():
val sqlDF = spark.sql("SELECT count(customerId),customerId,first(userId) FROM GETBYID group by customerId");
And if you want to keep all the occurences of userId, you can do this :
spark.sql("SELECT count(customerId), customerId, collect_list(userId) FROM GETBYID group by customerId")
By using collect_list.
I am using following to find the max column value.
val d = sqlContext.sql("select max(date), id from myTable group By id")
How to do the same query on DataFrame without registering temp table.
thanks,
Direct translation to DataFrame Scala API:
df.groupBy("id").agg(max("date"))
Spark 2.2.0 execution plan is identical for both OP's SQL & DF scenarios.
Full code for spark-shell:
Seq((1, "2011-1-1"), (2, "2011-1-2")).toDF("id", "date_str").withColumn("date", $"date_str".cast("date")).write.parquet("tmp")
var df = spark.read.parquet("tmp")
df.groupBy("id").agg(max("date")).explain
df.createTempView("myTable")
spark.sql("select max(date), id from myTable group By id").explain
If you would like to translate that sql to code to be used with a dataframe, you could do something like:
df.groupBy("id").max("date").show()
For max use
df.describe(Columnname).filter("summary = 'max'").collect()[0].get(1))
And for min use
df.describe(Columnname).filter("summary = 'min'").collect()[0].get(1))
if you have a dataframe with id and date column, what you can do n spark 2.0.1 is
from pyspark.sql.functions import max
mydf.groupBy('date').agg({'id':'max'}).show()
var maxValue = myTable.select("date").rdd.max()