Tabular data from DB to MAP Data structure - scala

I am fetching few values from DB and want to create a nested map data structure out of this. The tabular data looks like this
+---------+--------------+----------------+------------------+----------------+-----------------------+
| Cube_ID | Dimension_ID | Dimension_Name | Partition_Column | Display_name | Dimension_Description |
+---------+--------------+----------------+------------------+----------------+-----------------------+
| 1 | 1 | Reporting_Date | Y | Reporting_Date | Reporting_Date |
| 1 | 2 | Platform | N | Platform | Platform |
| 1 | 3 | Country | N | Country | Country |
| 1 | 4 | OS_Version | N | OS_Version | OS_Version |
| 1 | 5 | Device_Version | N | Device_Version | Device_Version |
+---------+--------------+----------------+------------------+----------------+-----------------------+
I want to create a nested structure something like this
{
CubeID = "1": {
Dimension ID = "1": [
{
"Name": "Reporting_Date",
"Partition_Column": "Y"
"Display": "Reporting_Date"
}
]
Dimension ID = "2": [
{
"Name": "Platform",
"Column": "N"
"Display": "Platform"
}
]
},
CubeID = "2": {
Dimension ID = "1": [
{
"Name": "Reporting_Date",
"Partition_Column": "Y"
"Display": "Reporting_Date"
}
]
Dimension ID = "2": [
{
"Name": "Platform",
"Column": "N"
"Display": "Platform"
}
]
}
}
I have the result set from DB using the following. I am able to populate individual columns, but not sure how to create a map for later computation
while (rs.next()) {
val Dimension_ID = rs.getInt("Dimension_ID")
val Dimension_Name = rs.getString("Dimension_Name")
val Partition_Column = rs.getString("Partition_Column")
val Display_name = rs.getString("Display_name")
val Dimension_Description = rs.getString("Dimension_Description")
}
I believe I should write a case class for this, but I am not sure how to create a case class and load values to the case class.
Thanks for the help. I can provide any other info needed. Let me know

Background
you can define data class something as below,
case class Dimension(
dimensionId: Long,
name: String,
partitionColumn: String,
display: String
)
case class Record(
cubeId: Int,
dimension: Dimension
)
case class Data(records: List[Record])
And this is how you can construct data,
val data =
Data(
List(
Record(
cubeId = 1,
dimension = Dimension(
dimensionId = 1,
name = "Reporting_Date",
partitionColumn = "Y",
display = "Reporting_Date"
)
),
Record(
cubeId = 2,
dimension = Dimension(
dimensionId = 1,
name = "Platform",
partitionColumn = "N",
display = "Platform"
)
)
)
)
Now to your question, since you are using JDBC you have to construct list of records in a mutable way or use scala Iterator. I will write below mutable way to construct above data class but you can explore more.
import scala.collection.mutable.ListBuffer
var mutableData = new ListBuffer[Record]()
while (rs.next()) {
mutableData += Record(
cubeId = rs.getIn("Cube_ID"),
dimension = Dimension(
dimensionId = rs.getInt("Dimension_ID"),
name = rs.getString("Dimension_Name"),
partitionColumn = rs.getString("Partition_Column"),
display = rs.getString("Dimension_Description")
)
)
}
val data = Data(records = mutableData.toList)
Also read - Any better way to convert SQL ResultSet to Scala List

Related

Spark Create DF from json string and string scala

I have a json string and a different string I'd like to create a dataframe of.
val body = """{
| "time": "2020-07-01T17:17:15.0495314Z",
| "ver": "4.0",
| "name": "samplename",
| "iKey": "o:something",
| "random": {
| "stuff": {
| "eventFlags": 258,
| "num5": "DHM",
| "num2": "something",
| "flags": 415236612,
| "num1": "4004825",
| "seq": 44
| },
| "banana": {
| "id": "someid",
| "ver": "someversion",
| "asId": 123
| },
| "something": {
| "example": "somethinghere"
| },
| "apple": {
| "time": "2020-07-01T17:17:37.874Z",
| "flag": "something",
| "userAgent": "someUserAgent",
| "auth": 12,
| "quality": 0
| },
| "loc": {
| "country": "US"
| }
| },
| "EventEnqueuedUtcTime": "2020-07-01T17:17:59.804Z"
|}
|""".stripMargin
val offset = "10"
I tried
val data = Seq(body, offset)
val columns = Seq("body","offset")
import sparkSession.sqlContext.implicits._
val df = data.toDF(columns:_*)
As well as
val data = Seq(body, offset)
val rdd = sparkSession.sparkContext.parallelize((data))
val dfFromRdd = rdd.toDF("body", "offset")
dfFromRdd.show(20, false)
but for both I get this an error: "value toDF is not a member of org.apache.spark.RDD[String]"
Is there a different way I can create a dataframe that will have one column with my json body data, and another column with my offset string value?
Edit: I've also tried the following:
val offset = "1000"
val data = Seq(body, offset)
val rdd = sparkSession.sparkContext.parallelize((data))
val dfFromRdd = rdd.toDF("body", "offset")
dfFromRdd.show(20, false)
and get an error of column mismatch : "The number of columns doesn't match.
Old column names (1): value
New column names (2): body, offset"
I dont understand why my data has the column name of "value"
I guess the issue is with your Seq syntax, elements should be tuples. Below code has worked for me,
val data = Seq((body, offset)) // <--- Check this line
val columns = Seq("body","offset")
import sparkSession.sqlContext.implicits._
data.toDF(columns:_*).printSchema()
/*
/
/ root
/ |-- body: string (nullable = true)
/ |-- offset: string (nullable = true)
/
*/
data.toDF(columns:_*).show()
/*
/
/ +--------------------+------+
/ | body|offset|
/ +--------------------+------+
/ |{
/ "time": "2020...| 10|
/ +--------------------+------+
/
/*

Spark dataframe Column content modification

I have a dataframe as shown below df.show():
+--------+---------+---------+---------+---------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+---------+---------+---------+
| Value1 | value1 | 123 | 2264 | 56 |
| Value1 | value2 | 124 | 2255 | 23 |
+--------+---------+---------+---------+---------+
Can I transform the above data frame to the below using some SQL?
+--------+---------+-------------+---------------+------------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+-------------+---------------+------------+
| Value1 | value1 | Expend1:123 | Expend2: 2264 | Expend3:56 |
| Value1 | value2 | Expend1:124 | Expend2: 2255 | Expend3:23 |
+--------+---------+-------------+---------------+------------+
You can use the idea of foldLeft here
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.sparkContext.parallelize(Seq(
("Value1", "value1", "123", "2264", "56"),
("Value1", "value2", "124", "2255", "23")
)).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
//Lists your columns for operation
val cols = List("Expend1", "Expend2", "Expend3")
val newDF = cols.foldLeft(df){(acc, name) =>
acc.withColumn(name, concat(lit(name + ":"), col(name)))
}
newDF.show()
Output:
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+
you can do that using simple sql select statement if you want can use udf as well
Ex -> select Col11 , Col22 , 'Expend1:' + cast(Expend1 as varchar(10)) as Expend1, .... from table
val df = Seq(("Value1", "value1", "123", "2264", "56"), ("Value1", "value2", "124", "2255", "23") ).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
val cols = df.columns.filter(!_.startsWith("Col")) // It will only fetch other than col% prefix columns
val getCombineData = udf { (colName:String, colvalue:String) => colName + ":"+ colvalue}
var in = df
for (e <- cols) {
in = in.withColumn(e, getCombineData(lit(e), col(e)) )
}
in.show
// results
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+

How to UnPivot COLUMNS into ROWS in AWS Glue / Py Spark script

I have a large nested json document for each year (say 2018, 2017), which has aggregated data by each month (Jan-Dec) and each day (1-31).
{
"2018" : {
"Jan": {
"1": {
"u": 1,
"n": 2
}
"2": {
"u": 4,
"n": 7
}
},
"Feb": {
"1": {
"u": 3,
"n": 2
},
"4": {
"u": 4,
"n": 5
}
}
}
}
I have used AWS Glue Relationalize.apply function to convert above hierarchal data into flat structure:
dfc = Relationalize.apply(frame = datasource0, staging_path = my_temp_bucket, name = my_ref_relationalize_table, transformation_ctx = "dfc")
Which gives me table with columns of each json element as below:
| 2018.Jan.1.u | 2018.Jan.1.n | 2018.Jan.2.u | 2018.Jan.1.n | 2018.Feb.1.u | 2018.Feb.1.n | 2018.Feb.2.u | 2018.Feb.1.n |
| 1 | 2 | 4 | 7 | 3 | 2 | 4 | 5 |
As you can see, there will be lot of column in the table for each day and each month. And, I want to simplify the table by converting columns into rows to have below table.
| year | month | dd | u | n |
| 2018 | Jan | 1 | 1 | 2 |
| 2018 | Jan | 2 | 4 | 7 |
| 2018 | Feb | 1 | 3 | 2 |
| 2018 | Jan | 4 | 4 | 5 |
With my search, I could not get right answer. Is there a solution AWS Glue/PySpark or any other way to accomplish unpivot function to get row based table from column based table? Can it be done in Athena ?
Implemented solution similar to the below snippet
dataFrame = datasource0.toDF()
tableDataArray = [] ## to hold rows
rowArrayCount = 0
for row in dataFrame.rdd.toLocalIterator():
for colName in dataFrame.schema.names:
value = row[colName]
keyArray = colName.split('.')
rowDataArray = []
rowDataArray.insert(0,str(id))
rowDataArray.insert(1,str(keyArray[0]))
rowDataArray.insert(2,str(keyArray[1]))
rowDataArray.insert(3,str(keyArray[2]))
rowDataArray.insert(4,str(keyArray[3]))
tableDataArray.insert(rowArrayCount,rowDataArray)
rowArrayCount=+1
unpivotDF = None
for rowDataArray in tableDataArray:
newRowDF = sc.parallelize([Row(year=rowDataArray[0],month=rowDataArray[1],dd=rowDataArray[2],u=rowDataArray[3],n=rowDataArray[4])]).toDF()
if unpivotDF is None:
unpivotDF = newRowDF
else :
unpivotDF = unpivotDF.union(newRowDF)
datasource0 = datasource0.fromDF(unpivotDF, glueContext, "datasource0")
in above newRowDF can also be created as below if data type has to be enforced
columns = [StructField('year',StringType(), True),StructField('month', IntegerType(), ....]
schema = StructType(columns)
unpivotDF = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for rowDataArray in tableDataArray:
newRowDF = spark.createDataFrame(rowDataArray, schema)
Here are the steps to successfully unpivot your Dataset Using AWS Glue with Pyspark
We need to add an additional import statement to the existing boiler plate import statements
from pyspark.sql.functions import expr
If our data is in a DynamicFrame, we need to convert it to a Spark DataFrame for example:
df_customer_sales = dyf_customer_sales.toDF()
Use the stack method to unpivot our dataset based on how many columns we want to unpivot
unpivotExpr = "stack(4, 'january', january, 'febuary', febuary, 'march', march, 'april', april) as (month, total_sales)"
unPivotDF = df_customer_sales.select('item_type', expr(unpivotExpr))
So using an example dataset, our dataframe looks like this now:
If my explanation is not clear, I made a youtube tutorial walkthrough of the solution: https://youtu.be/Nf78KMhNc3M

Slick-pg: How to use arrayElementsText and overlap operator "?|"?

I'm trying to write the following query in scala using slick/slick-pg, but I don't have much experience with slick and can't figure out how:
SELECT *
FROM attributes a
WHERE a.other_id = 10
and ARRAY(SELECT jsonb_array_elements_text(a.value->'value'))
&& array['1','30','205'];
This is a simplified version of the attributes table, where the value field is a jsonb:
class Attributes(tag: Tag) extends Table[Attribute](tag, "ship_attributes") {
def id = column[Int]("id")
def other_id = column[Int]("other_id")
def value = column[Json]("value")
def * = (id, other_id, value) <> (Attribute.tupled, Attribute.unapply)
}
Sample data:
| id | other_id | value |
|:-----|:-----------|:------------------------------------------|
| 1 | 10 | {"type": "IdList", "value": [1, 21]} |
| 2 | 10 | {"type": "IdList", "value": [5, 30]} |
| 3 | 10 | {"type": "IdList", "value": [7, 36]} |
This is my current query:
attributes
.filter(_.other_id = 10)
.filter { a =>
val innerQuery = attributes.map { _ =>
a.+>"value".arrayElementsText
}.to[List]
innerQuery #& List("1", "30", "205").bind
}
But it's complaining about the .to[List] conversion.
I've tried to create a SimpleFunction.unary[X, List[String]]("ARRAY"), but I don't know how to pass innerQuery to it (innerQuery is Query[Rep[String], String, Seq]).
Any ideas are very much appreciated.
UPDATE 1
while I can't figure this out, I changed the app to save in the database the json field as a list of strings instead of integer to be able to do this simple query:
attributes
.filter(_.other_id = 10)
.filter(_.+>"value" ?| List("1", "30", "205").bind)
| id | other_id | value |
|:-----|:-----------|:------------------------------------------|
| 1 | 10 | {"type": "IdList", "value": ["1", "21"]} |
| 2 | 10 | {"type": "IdList", "value": ["5", "30"]} |
| 3 | 10 | {"type": "IdList", "value": ["7", "36"]} |

Join and group by in LINQ

Here is the entity and table structure-
class Person { int PersonId, string PersonName}
class Report { int ReportId, datetime ReportTime, int PersonId }
Table: Persons
----------------------------
| PersonId | PersonName |
----------------------------
| 1 | Abc |
----------------------------
| 2 | Xyz |
----------------------------
Table: Reports
----------------------------------------------
| ReportId | ReportTime | PersonId |
----------------------------------------------
| 10 | 2017-02-27 11:12 | 1 |
---------------------------- -----------------
| 14 | 2017-02-27 15:23 | 1 |
---------------------------- -----------------
I want to select data as follows (PersonName from Person and last record of his Id in reports table)-
-------------------------------------
| PersonName | ReportTime |
-------------------------------------
| Abc | 2017-02-27 15:23 |
-------------------------------------
How can i do it in Lambda or LINQ?
Use Queryable.GroupJoin:
from p in db.Persons
join r in db.Reports on p.PersonId equals r.PersonId into g
where g.Any() // if some persons do not have reports
select new {
p.PersonName,
ReportTime = g.Max(r => r.ReportTime)
}
Lambda (note that it will return Nullable ReportTime with nulls for persons which don't have any reports)
db.Persons.GroupJoin(
db.Reports,
p => p.PersonId,
r => r.PersonId,
(p,g) => new { p.PersonName, ReportTime = g.Max(r => (DateTime?)r.ReportTime) })
Try this:
List<Person> people = new List<Person>
{
new Person {PersonId = 1, PersonName = "AB" },
new Person {PersonId = 2, PersonName = "CD" },
new Person {PersonId = 3, PersonName = "EF" },
};
List<Report> reports = new List<Report>()
{
new Report {ReportId = 1, ReportTime = DateTime.Now, PersonId = 1 },
new Report {ReportId = 2, ReportTime = DateTime.Now.AddHours(-1), PersonId = 1 },
new Report {ReportId = 3, ReportTime = DateTime.Now.AddHours(-2), PersonId = 1 },
new Report {ReportId = 4, ReportTime = DateTime.Now.AddMinutes(-3), PersonId = 2 },
new Report {ReportId = 5, ReportTime = DateTime.Now, PersonId = 2 }
};
var res = (from rep in reports
group rep by rep.PersonId into repGrp
join ppl in people on repGrp.FirstOrDefault().PersonId equals ppl.PersonId
select new
{
PersonName = ppl.PersonName,
ReportDate = repGrp.Max(r => r.ReportTime),
}).ToList();
}