I have a dataframe with multiple records and I want to create multiple json files based on the column in a dataframe. The files would already be there so I want to append the files.
val emp_seq = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val empDf = emp_seq.toDF("employee_name", "department", "state", "salary", "age", "bonus")
val msgDf = empDf.select($"department", to_json(struct($"employee_name", $"state", $"salary", $"age", $"bonus")).alias("message"))
Output
+----------+------------------------------------------------------------------------------+
|department|message |
+----------+------------------------------------------------------------------------------+
|Sales |{"employee_name":"James","state":"NY","salary":90000,"age":34,"bonus":10000} |
|Sales |{"employee_name":"Michael","state":"NY","salary":86000,"age":56,"bonus":20000}|
|Sales |{"employee_name":"Robert","state":"CA","salary":81000,"age":30,"bonus":23000} |
|Finance |{"employee_name":"Maria","state":"CA","salary":90000,"age":24,"bonus":23000} |
|Finance |{"employee_name":"Raman","state":"CA","salary":99000,"age":40,"bonus":24000} |
|Finance |{"employee_name":"Scott","state":"NY","salary":83000,"age":36,"bonus":19000} |
|Finance |{"employee_name":"Jen","state":"NY","salary":79000,"age":53,"bonus":15000} |
|Marketing |{"employee_name":"Jeff","state":"CA","salary":80000,"age":25,"bonus":18000} |
|Marketing |{"employee_name":"Kumar","state":"NY","salary":91000,"age":50,"bonus":21000} |
+----------+------------------------------------------------------------------------------+
In this case I would have 3 files sales.json, finance.json and marketing.json with respective message column data.
How would I append existing files and get only the message part of the dataframe?
Refer below steps
1. Filter the dataframes
val salesDf = empDf.filter(empDf("department") === "Sales")
val financeDf = empDf.filter(empDf("department") === "Finance")
val marketingDf = empDf.filter(empDf("department") === "Marketing")
2. Read existing Sales, finance and marketing files as Df
val existingSalesDf=spark.read.json("/path/of/existing/json/sales-file")
val existingFinanceDf=spark.read.json("/path/of/existing/json/finance-file")
val existingMarketDf=spark.read.json("/path/of/existing/json/marketing-file")
3. Perform union operation on all sales,finance and marketing dr with their corresponding existing Dataframes
val appendedSalesDf= salesDf.union(existingSalesDf)
val appendedFinanceDf=financeDf.union(existingFinanceDf)
val appendedMarketDf= marketingDf.union(existingMarketDf)
Related
Schema of input dataframe
- employeeKey (int)
- employeeTypeId (string)
- loginDate (string)
- employeeDetailsJson (string)
{"Grade":"100","ValidTill":"2021-12-01","Supervisor":"Alex","Vendor":"technicia","HourlyRate":29}
For Perm employees , some attributes are available and some not. Same for Contracting Employees.
So looking to find an efficient way to build dataframe based on only selected columns, as against transforming all columns and select the ones which I need.
Also please advise this is the best way to extract values from json string based on a key. As the attributes in the string are dynamic, I can not build StructSchema based on it. So using good old get_json_object.
(spark 2.45 and will use spark 3 in future)
val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","cont.Vendor-Name","cont.Hourly-Rate" )
//val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","perm.Level","perm-Validity","perm.Supervisor" )
val resultDF = inputDF.get
.withColumn("Employee-Key", col("employeeKey"))
.withColumn("Employee-Type", when(col("employeeTypeId") === 1, "Permanent")
.when(col("employeeTypeId") === 2, "Contractor")
.otherwise("unknown"))
.withColumn("Login-Date", to_utc_timestamp(to_timestamp(col("loginDate"), "yyyy-MM-dd'T'HH:mm:ss"), ""America/Chicago""))
.withColumn("perm.Level", get_json_object(col("employeeDetailsJson"), "$.Grade"))
.withColumn("perm.Validity", get_json_object(col("employeeDetailsJson"), "$.ValidTill"))
.withColumn("perm.SuperVisor", get_json_object(col("employeeDetailsJson"), "$.Supervisor"))
.withColumn("cont.Vendor-Name", get_json_object(col("employeeDetailsJson"), "$.Vendor"))
.withColumn("cont.Hourly-Rate", get_json_object(col("employeeDetailsJson"), "$.HourlyRate"))
.select(dfSelectColumns.head, dfSelectColumns.tail: _*)
I see that you have 2 schemas, one for Permanent and another for Contractor. You can have 2 schemas.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val schemaBase = new StructType().add("Employee-Key", IntegerType).add("Employee-Type", StringType).add("Login-Date", DateType)
val schemaPerm = schemaBase.add("Level", IntegerType).add("Validity", StringType)// Permanent attributes
val schemaCont = schemaBase.add("Vendor", StringType).add("HourlyRate", DoubleType) // Contractor attributes
Then you can use the 2 schemas to load the data into dataframe.
For Permanent Employee:
val jsonPermDf = Seq( // Construct sample dataframe
(2, """{"Employee-Key":2, "Employee-Type":"Permanent", "Login-Date":"2021-11-01", "Level":3, "Validity":"ok"}""")
, (3, """{"Employee-Key":3, "Employee-Type":"Permanent", "Login-Date":"2020-10-01", "Level":2, "Validity":"ok-yes"}""")
).toDF("key", "raw_json")
val permDf = jsonPermDf.withColumn("data", from_json(col("raw_json"),schemaPerm)).select($"data.*")
permDf.show()
For Contractor:
val jsonContDf = Seq( // Construct sample dataframe
(1, """{"Employee-Key":1, "Employee-Type":"Contractor", "Login-Date":"2021-12-01", "Vendor":"technicia", "HourlyRate":29}""")
, (4, """{"Employee-Key":4, "Employee-Type":"Contractor", "Login-Date":"2019-09-01", "Vendor":"Minis", "HourlyRate":35}""")
).toDF("key", "raw_json")
val contDf = jsonContDf.withColumn("data", from_json(col("raw_json"),schemaCont)).select($"data.*")
contDf.show()
This is the result datafrme for Permanent:
+------------+-------------+----------+-----+--------+
|Employee-Key|Employee-Type|Login-Date|Level|Validity|
+------------+-------------+----------+-----+--------+
| 2| Permanent|2021-11-01| 3| ok|
| 3| Permanent|2020-10-01| 2| ok-yes|
+------------+-------------+----------+-----+--------+
This is the result dataframe for Contractor:
+------------+-------------+----------+---------+----------+
|Employee-Key|Employee-Type|Login-Date| Vendor|HourlyRate|
+------------+-------------+----------+---------+----------+
| 1| Contractor|2021-12-01|technicia| 29.0|
| 4| Contractor|2019-09-01| Minis| 35.0|
+------------+-------------+----------+---------+----------+
If the schema of the JSON in employeeDetailsJson is unstable, you can still parse it into Map(String, String) type using from_json function with schema map<string,string>. Then you can explode the map column and pivot to get keys as columns.
Example:
val df1 = df.withColumn(
"employeeDetails",
from_json(col("employeeDetailsJson"), "map<string,string>")
).select(
col("employeeKey"),
col("employeeTypeId"),
col("loginDate"),
explode("employeeDetails")
).groupBy("employeeKey", "employeeTypeId", "loginDate")
.pivot("key")
.agg(first("value"))
df1.show()
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|employeeKey|employeeTypeId|loginDate |Grade|HourlyRate|Supervisor|ValidTill |Vendor |
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|1 |1 |2021-02-05'T'21:28:06|100 |29 |Alex |2021-12-01|technicia|
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
In Spark (Scala), after the application jar is submitted to Spark, is it possible for the jar to fetch many strings from a database table, convert each string to a catalyst Expression and then convert that expression to a UDF, and use the UDF to filters rows in another DataFrame, and finally union the result of each UDF?
(The said expression needs some or all columns of the DataFrame, but which columns are needed is unknown at the time of the code of the jar is written, the schema of the DataFrame is known at development time)
An example:
expression 1: "id == 1"
expression 2: "name == \"andy\""
DataFrame:
row 1: id = 1, name = "red", age = null
row 2: id = 2, name = "andy", age = 20
row 3: id = 3, name = "juliet", age = 21
the final result should be the first two rows
Note: it is not acceptable to first concatenate the two expressions with a or, for I needed to track which expression results the result row
Edited: Filter for each argument and union All.
import org.apache.spark.sql.DataFrame
val df = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
val args = Array("id == 1", "name == \"andy\"")
val filters = args.zipWithIndex
var dfs = Array[DataFrame]()
filters.foreach {
case (filter, index) =>
val tempDf = df.filter(filter).withColumn("index", lit(index))
dfs = dfs :+ tempDf
}
val resultDF = dfs.reduce(_ unionAll _)
resultDF.show(false)
+---+----+----+-----+
|id |name|age |index|
+---+----+----+-----+
|1 |red |null|0 |
|2 |andy|20 |1 |
+---+----+----+-----+
Original: Why just put the string to the filter?
val df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
val condition = "id == 1 or name == \"andy\""
df.filter(condition).show(false)
+---+----+----+
|id |name|age |
+---+----+----+
|1 |red |null|
|2 |andy|20 |
+---+----+----+
Something I have missed?
I want to write a nested data structure consisting of a Map inside another Map using an array of a Scala case class.
The result should transform this dataframe:
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
| 123| ITA|1475600500|18.0|
| 123| ITA|1475600516|19.0|
+-----+-------+----------+----+
into:
+--------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600500":18,"1475600516":19}}}]
+--------------------------------------------------------------------+
The actualResult dataset below gets me close but the structure isn't quite the same as my expected dataframe.
case class Record(value: Integer, attributes: Map[String, Map[String, BigDecimal]])
val actualResult = df
.map(r =>
Array(
Record(
r.getAs[Int]("Value"),
Map(
r.getAs[String]("Country") ->
Map(
r.getAs[String]("Timestamp") -> new BigDecimal(
r.getAs[Double]("Sum").toString
)
)
)
)
)
)
The Timestamp column in the actualResult dataset doesn't get combined together into the same Record row but rather creates two separate rows instead.
+----------------------------------------------------+
|value |
+----------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600516":19}}}]
[{"value":123,"attributes":{"ITA":{"1475600500":18}}}]
+----------------------------------------------------+
With the use of groupBy and collect_list by creatng combined column using struct I was able to get single row as below output.
val mycsv =
"""
|Value|Country|Timestamp|Sum
| 123|ITA|1475600500|18.0
| 123|ITA|1475600516|19.0
""".stripMargin('|').lines.toList.toDS()
val df: DataFrame = spark.read.option("header", true)
.option("sep", "|")
.option("inferSchema", true)
.csv(mycsv)
df.show
val df1 = df.
groupBy("Value","Country")
.agg( collect_list(struct(col("Country"), col("Timestamp"), col("Sum"))).alias("attributes")).drop("Country")
val json = df1.toJSON // you can save in to file
json.show(false)
Result combined 2 rows
+-----+-------+----------+----+
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
|123.0|ITA |1475600500|18.0|
|123.0|ITA |1475600516|19.0|
+-----+-------+----------+----+
+----------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|{"Value":123.0,"attributes":[{"Country":"ITA","Timestamp":1475600500,"Sum":18.0},{"Country":"ITA","Timestamp":1475600516,"Sum":19.0}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------+
I have a dataframe (DF1) with two columns
+-------+------+
|words |value |
+-------+------+
|ABC |1.0 |
|XYZ |2.0 |
|DEF |3.0 |
|GHI |4.0 |
+-------+------+
and another dataframe (DF2) like this
+-----------------------------+
|string |
+-----------------------------+
|ABC DEF GHI |
|XYZ ABC DEF |
+-----------------------------+
I have to replace the individual string values in DF2 with their corresponding values in DF1.. for eg, after the operation, I should get back this dataframe.
+-----------------------------+
|stringToDouble |
+-----------------------------+
|1.0 3.0 4.0 |
|2.0 1.0 3.0 |
+-----------------------------+
I have tried multiple ways but I cannot seem to figure out the solution.
def createCorpus(conversationCorpus: Dataset[Row], dataDictionary: Dataset[Row]): Unit = {
import spark.implicits._
def getIndex(word: String): Double = {
val idxRow = dataDictionary.selectExpr("index").where('words.like(word))
val idx = idxRow.toString
if (!idx.isEmpty) idx.trim.toDouble else 1.0
}
conversationCorpus.map { //eclipse doesnt like this map here.. throws an error..
r =>
def row = {
val arr = r.getString(0).toLowerCase.split(" ")
val arrList = ArrayBuffer[Double]()
arr.map {
str =>
val index = getIndex(str)
}
Row.fromSeq(arrList.toSeq)
}
row
}
}
Combining multiple dataframes to create new columns would require a join. And by looking at your two dataframes it seems we can join by words column of df1 and string column of df2 but string column needs an explode and combination later (which can be done by giving unique ids to each rows before explode). monotically_increasing_id gives unique ids to each rows in df2. split function turns string column to array for an explode. Then you can join them. and then rest of the steps is to combine back the exploded rows back to original by doing groupBy and aggregation.
Finally collected array column can be changed to desired string column by using a udf function
Long story short, following solution should work for you
import org.apache.spark.sql.functions._
def arrayToString = udf((array: Seq[Double])=> array.mkString(" "))
df2.withColumn("rowId", monotonically_increasing_id())
.withColumn("string", explode(split(col("string"), " ")))
.join(df1, col("string") === col("words"))
.groupBy("rowId")
.agg(collect_list("value").as("stringToDouble"))
.select(arrayToString(col("stringToDouble")).as("stringToDouble"))
which should give you
+--------------+
|stringToDouble|
+--------------+
|1.0 3.0 4.0 |
|2.0 1.0 3.0 |
+--------------+
I use Spark sql to load data into a val like this
val customers = sqlContext.sql("SELECT * FROM customers")
But I have a separate txt file that contains one column CUST_ID and 50,00 rows. i.e.
CUST_ID
1
2
3
I want my customers val to have all customers in customers table that are not in the TXT file.
Using Sql I would do this by SELECT * FROM customers NOT IN cust_id ('1','2','3')
How can I do this using Spark?
I've read the textFile and I can print rows of it but I'm not sure how to match this with my sql query
scala> val custids = sc.textFile("cust_ids.txt")
scala> custids.take(4).foreach(println)
CUST_ID
1
2
3
You can import your text file as a dataframe and do a left outer join:
val customers = Seq(("1", "AAA", "shipped"), ("2", "ADA", "delivered") , ("3", "FGA", "never received")).toDF("id","name","status")
val custId = Seq(1,2).toDF("custId")
customers.join(custId,'id === 'custId,"leftOuter")
.where('custId.isNull)
.drop("custId")
.show()
+---+----+--------------+
| id|name| status|
+---+----+--------------+
| 3| FGA|never received|
+---+----+--------------+