is there any way to group by table in sql spark which selects multiple elements
code i am using:
val df = spark.read.json("//path")
df.createOrReplaceTempView("GETBYID")
now doing group by like :
val sqlDF = spark.sql(
"SELECT count(customerId) FROM GETBYID group by customerId");
but when I try:
val sqlDF = spark.sql(
"SELECT count(customerId),customerId,userId FROM GETBYID group by customerId");
Spark gives an error :
org.apache.spark.sql.AnalysisException: expression 'getbyid.userId'
is neither present in the group by, nor is it an aggregate function.
Add to group by or wrap in first() (or first_value) if you don't care
which value you get.;
is there any possible way to do that
Yes, it's possible and the error message you attached describes all the possibilities. You can either add the userId to groupBy:
val sqlDF = spark.sql("SELECT count(customerId),customerId,userId FROM GETBYID group by customerId, userId");
or use first():
val sqlDF = spark.sql("SELECT count(customerId),customerId,first(userId) FROM GETBYID group by customerId");
And if you want to keep all the occurences of userId, you can do this :
spark.sql("SELECT count(customerId), customerId, collect_list(userId) FROM GETBYID group by customerId")
By using collect_list.
Related
I tried to do Join two dataframes in spark shell.
One of the dataframe is having 15000 records and another is having 14000 rows.
I tried Left outer join and inner join of these dataframes, but result is having count of 29000 rows.
How is that happening?
The code which i tried is given below.
val joineddf = finaldf.as("df1").join(cosmos.as("df2"), $"df1.BatchKey" === $"df2.BatchKey", "left_outer").select(($"df1.*"),col("df2.BatchKey").as("B2"))
val joineddf = finaldf.as("df1").join(cosmos.as("df2"), $"df1.BatchKey" === $"df2.BatchKey", "inner").select(($"df1.*"),col("df2.BatchKey").as("B2"))
Both above methods are resulted in a dataframe where count is sum of both dataframes.
Even I tried the below method, but still getting same result.
finaldf.createOrReplaceTempView("df1")
cosmos.createOrReplaceTempView("df2")
val test = spark.sql("""SELECT df1.*, df2.* FROM df1 LEFT OUTER JOIN df2 ON trim(df1.BatchKey) == trim(df2.BatchKey)""")
If i try to add more condition for join then the no of count is again increasing.
How to get exact result for a left outer join?
here in the case max count should be 15000
Antony,
Can you try performing the join below :
val joineddf = finaldf.join(cosmos.select("BatchKey"), Seq("BatchKey"), "left_outer")
Here I'm not using any alias.
Let's say I have a two tables, one for students (tbl_students) and another for exams (tbl_exams). In vanilla SQL with a relational database, I can be able to use an outer join to find the list of students who have missed a particular exam, since the student_id won't match any row in the exam table for a that particular exam_id. I could also insert the result of this outer join query into another table using the SELECT INTO syntax.
With that background, can I be able to achieve a similar result using spark sql and scala, where I can populate a dataframe using the result of an outer join? Example code is (the code is not tested and may not run as is):
//Create schema for single column
val schema = StructType(
StructField("student_id", StringType, true)
)
//Create empty RDD
var dataRDD = sc.emptyRDD
//pass rdd and schema to create dataframe
val joinDF = sqlContext.createDataFrame(dataRDD, schema);
joinDF.createOrReplaceTempView("tbl_students_missed_exam");
//Populate tbl_students_missed_exam dataframe using result of outer join
sparkSession.sql(s"""
SELECT tbl_students.student_id
INTO tbl_students_missed_exam
FROM tbl_students
LEFT OUTER JOIN tbl_exams ON tbl_students.student_id = tbl_exams.exam_id;""")
Thanks in advance for your input
I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.
You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)
I am doing join of 2 data frames and select all columns of left frame for example:
val join_df = first_df.join(second_df, first_df("id") === second_df("id") , "left_outer")
in above I want to do select first_df.* .How can I select all columns of one frame in join ?
With alias:
first_df.alias("fst").join(second_df, Seq("id"), "left_outer").select("fst.*")
We can also do it with leftsemi join. leftsemi join will select the data from left side dataframe from a joined dataframe.
Here we join two dataframes df1 and df2 based on column col1.
df1.join(df2, df1.col("col1").equalTo(df2.col("col1")), "leftsemi")
Suppose you:
Want to use the DataFrame syntax.
Want to select all columns from df1 but only a couple from df2.
This is cumbersome to list out explicitly due to the number of columns in df1.
Then, you might do the following:
val selectColumns = df1.columns.map(df1(_)) ++ Array(df2("field1"), df2("field2"))
df1.join(df2, df1("key") === df2("key")).select(selectColumns:_*)
Just to add one possibility, whithout using alias, I was able to do that in pyspark with
first_df.join(second_df, "id", "left_outer").select( first_df["*"] )
Not sure if applies here, but hope it helps
I am using following to find the max column value.
val d = sqlContext.sql("select max(date), id from myTable group By id")
How to do the same query on DataFrame without registering temp table.
thanks,
Direct translation to DataFrame Scala API:
df.groupBy("id").agg(max("date"))
Spark 2.2.0 execution plan is identical for both OP's SQL & DF scenarios.
Full code for spark-shell:
Seq((1, "2011-1-1"), (2, "2011-1-2")).toDF("id", "date_str").withColumn("date", $"date_str".cast("date")).write.parquet("tmp")
var df = spark.read.parquet("tmp")
df.groupBy("id").agg(max("date")).explain
df.createTempView("myTable")
spark.sql("select max(date), id from myTable group By id").explain
If you would like to translate that sql to code to be used with a dataframe, you could do something like:
df.groupBy("id").max("date").show()
For max use
df.describe(Columnname).filter("summary = 'max'").collect()[0].get(1))
And for min use
df.describe(Columnname).filter("summary = 'min'").collect()[0].get(1))
if you have a dataframe with id and date column, what you can do n spark 2.0.1 is
from pyspark.sql.functions import max
mydf.groupBy('date').agg({'id':'max'}).show()
var maxValue = myTable.select("date").rdd.max()