Serialize table to nested JSON using Apache Spark SQL - scala

The question is the same as this one but can this:
df.withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID")).
select("VEHICLE","ACCOUNTNO"). //only select reqired columns
groupBy("ACCOUNTNO").
agg(collect_list("VEHICLE").as("VEHICLE")). //for the same group create a list of vehicles
toJSON. //convert to json
show(false)
be rewritten with pure SQL? I mean something like that:
val sqlDF = spark.sql("SELECT VEHICLE, ACCOUNTNO as collect_list(ACCOUNTNO) FROM VEHICLES group by ACCOUNTNO)
sqlDF.show()
Does it possible?

The SQL equivalent of your dataframe example would be:
scala> val df = Seq((10003014,"MH43AJ411",20000000),
| (10003014,"MH43AJ411",20000001),
| (10003015,"MH12GZ3392",20000002)
| ).toDF("ACCOUNTNO","VEHICLENUMBER","CUSTOMERID").withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID"))
df: org.apache.spark.sql.DataFrame = [ACCOUNTNO: int, VEHICLENUMBER: string ... 2 more fields]
scala> df.registerTempTable("vehicles")
scala> val sqlDF = spark.sql("SELECT ACCOUNTNO, collect_list(VEHICLE) as ACCOUNT_LIST FROM VEHICLES group by ACCOUNTNO").toJSON
sqlDF: org.apache.spark.sql.Dataset[String] = [value: string]
scala> sqlDF.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|{"ACCOUNTNO":10003014,"ACCOUNT_LIST":[{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000},{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001}]}|
|{"ACCOUNTNO":10003015,"ACCOUNT_LIST":[{"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002}]} |
+-----------------------------------------------------------------------------------------------------------------------------------------------+

Related

Spark join dataframe based on column of different type spark 1.6

I have 2 dataframes df1 and df2 . I am joining both df1 and df2 based on columns col1 and col2. However the datatype of col1 is string in df1 and type of col2 is int in df2. When I try to join like below,
val df3 = df1.join(df2,df1("col1") === df2("col2"),inner).select(df2("col2"))
The join does not work and return empty datatype. Will it be possible to get proper output without changing type of col2 in df2
val dDF1 = List("1", "2", "3").toDF("col1")
val dDF2 = List(1, 2).toDF("col2")
val res1DF = dDF1.join(dDF2, dDF1.col("col1") === dDF2.col("col2").cast("string"), "inner")
.select(dDF2.col("col2"))
res1DF.printSchema()
res1DF.show(false)
// root
// |-- col2: integer (nullable = false)
//
// +----+
// |col2|
// +----+
// |1 |
// |2 |
// +----+
get schema DataFrame
val sch1 = dDF1.schema
sch1: org.apache.spark.sql.types.StructType = StructType(StructField(col1,StringType,true))
// public StructField(String name,
// DataType dataType,
// boolean nullable,
// Metadata metadata)

Replacing Spark SQL with Scala API calls

I've a simple LEFT JOIN
spark.sql(
s"""
|SELECT
| a.*,
| b.company_id AS companyId
|FROM profile_views a
|LEFT JOIN companies_info b
| ON a.memberId = b.member_id
|""".stripMargin
).createOrReplaceTempView("company_views")
How do I replace this with the scala API?
Try below code.
Below code will work for temp views as well as hive tables.
val profile_views = spark
.table("profile_views")
.as("a")
val companies_info = spark
.table("companies_info")
.select($"company_id".as("companyId"),$"member_id".as("memberId"))
.as("b")
profile_views
.join(companies_info,Seq("memberId"),"left")
.createOrReplaceTempView("company_views")
If you have already data in DataFrame, You can use below code.
profile_viewsDF.as("a")
.join(
companies_infoDF.select($"company_id".as("companyId"),$"member_id".as("memberId")).as("b"),
Seq("memberId"),
"left"
)
.createOrReplaceTempView("company_views")
Update : temp views can also called using spark.table(). Please check below code.
scala> val df = Seq(("srinivas",10)).toDF("name","age")
df: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> df.createTempView("person")
scala> spark.table("person").show
+--------+---+
| name|age|
+--------+---+
|srinivas| 10|
+--------+---+

How to append a string column to array string column in Scala Spark without using UDF?

I have a table which has a column containing array like this -
Student_ID | Subject_List | New_Subject
1 | [Mat, Phy, Eng] | Chem
I want to append the new subject into the subject list and get the new list.
Creating the dataframe -
val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subject")
I have tried this with UDF as follows -
def append_list = (arr: Seq[String], s: String) => {
arr :+ s
}
val append_list_UDF = udf(append_list)
val df_new = df.withColumn("New_List", append_list_UDF($"Subject_List",$"New_Subject"))
With UDF, I get the required output
Student_ID | Subject_List | New_Subject | New_List
1 | [Mat, Phy, Eng] | Chem | [Mat, Phy, Eng, Chem]
Can we do it without udf ? Thanks.
In Spark 2.4 or later a combination of array and concat should do the trick,
import org.apache.spark.sql.functions.{array, concat}
import org.apache.spark.sql.Column
def append(arr: Column, col: Column) = concat(arr, array(col))
df.withColumn("New_List", append($"Subject_List",$"New_Subject")).show
+----------+---------------+-----------+--------------------+
|Student_ID| Subject_List|New_Subject| New_List|
+----------+---------------+-----------+--------------------+
| 1|[Mat, Phy, Eng]| Chem|[Mat, Phy, Eng, C...|
+----------+---------------+-----------+--------------------+
but I wouldn't expect serious performance gains here.
val df = Seq((1, Array("Mat", "Phy", "Eng"), "Chem"),
(2, Array("Hindi", "Bio", "Eng"), "IoT"),
(3, Array("Python", "R", "scala"), "C")).toDF("Student_ID","Subject_List","New_Subject")
df.show(false)
val final_df = df.withColumn("exploded", explode($"Subject_List")).select($"Student_ID",$"exploded")
.union(df.select($"Student_ID",$"New_Subject"))
.groupBy($"Student_ID").agg(collect_list($"exploded") as "Your_New_List").show(false)
[enter code here][1]

Apache Spark SQL query and DataFrame as reference data

I have two Spark DataFrames:
cities DataFrame with the following column:
city
-----
London
Austin
bigCities DataFrame with the following column:
name
------
London
Cairo
I need to transform DataFrame cities and add an additional Boolean column there: bigCity Value of this column must be calculated based on the following condition "cities.city IN bigCities.name"
I can do this in the following way(with a static bigCities collection):
cities.createOrReplaceTempView("cities")
var resultDf = spark.sql("SELECT city, CASE WHEN city IN ['London', 'Cairo'] THEN 'Y' ELSE 'N' END AS bigCity FROM cities")
but I don't know how to replace the static bigCities collection ['London', 'Cairo'] with bigCities DataFrame in the query. I want to use bigCities as the reference data in the query.
Please advise how to achieve this.
val df = cities.join(bigCities, $"name".equalTo($"city"), "leftouter").
withColumn("bigCity", when($"name".isNull, "N").otherwise("Y")).
drop("name")
You can use collect_list() on the the bigCities table. Check this out
scala> val df_city = Seq(("London"),("Austin")).toDF("city")
df_city: org.apache.spark.sql.DataFrame = [city: string]
scala> val df_bigCities = Seq(("London"),("Cairo")).toDF("name")
df_bigCities: org.apache.spark.sql.DataFrame = [name: string]
scala> df_city.createOrReplaceTempView("cities")
scala> df_bigCities.createOrReplaceTempView("bigCities")
scala> spark.sql(" select city, case when array_contains((select collect_list(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city |bigCity|
+------+-------+
|London|Y |
|Austin|N |
+------+-------+
scala>
If the dataset is big, you can use collect_set which will be more efficient.
scala> spark.sql(" select city, case when array_contains((select collect_set(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city |bigCity|
+------+-------+
|London|Y |
|Austin|N |
+------+-------+
scala>

Selecting several columns from spark dataframe with a list of columns as a start

Assuming that I have a list of spark columns and a spark dataframe df, what is the appropriate snippet of code in order to select a subdataframe containing only the columns in the list?
Something similar to maybe:
var needed_column: List[Column]=List[Column](new Column("a"),new Column("b"))
df(needed_columns)
I wanted to get the columns names then select them using the following line of code.
Unfortunately, the column name seems to be in write mode only.
df.select(needed_columns.head.as(String),needed_columns.tail: _*)
Your needed_columns is of type List[Column], hence you can simply use needed_columns: _* as the arguments for select:
val df = Seq((1, "x", 10.0), (2, "y", 20.0)).toDF("a", "b", "c")
import org.apache.spark.sql.Column
val needed_columns: List[Column] = List(new Column("a"), new Column("b"))
df.select(needed_columns: _*)
// +---+---+
// | a| b|
// +---+---+
// | 1| x|
// | 2| y|
// +---+---+
Note that select takes two types of arguments:
def select(cols: Column*): DataFrame
def select(col: String, cols: String*): DataFrame
If you have a list of column names of String type, you can use the latter select:
val needed_col_names: List[String] = List("a", "b")
df.select(needed_col_names.head, needed_col_names.tail: _*)
Or, you can map the list of Strings to Columns to use the former select
df.select(needed_col_names.map(col): _*)
I understand that you want to select only those columns from a list(A)other than the dataframe columns. I have a below example, where I select the firstname and lastname using a separate list. check this out
scala> val df = Seq((101,"Jack", "wright" , 27, "01976", "US")).toDF("id","fname","lname","age","zip","country")
df: org.apache.spark.sql.DataFrame = [id: int, fname: string ... 4 more fields]
scala> df.columns
res20: Array[String] = Array(id, fname, lname, age, zip, country)
scala> val needed =Seq("fname","lname")
needed: Seq[String] = List(fname, lname)
scala> val needed_df = needed.map( x=> col(x) )
needed_df: Seq[org.apache.spark.sql.Column] = List(fname, lname)
scala> df.select(needed_df:_*).show(false)
+-----+------+
|fname|lname |
+-----+------+
|Jack |wright|
+-----+------+
scala>