I'm trying to write the following query in scala using slick/slick-pg, but I don't have much experience with slick and can't figure out how:
SELECT *
FROM attributes a
WHERE a.other_id = 10
and ARRAY(SELECT jsonb_array_elements_text(a.value->'value'))
&& array['1','30','205'];
This is a simplified version of the attributes table, where the value field is a jsonb:
class Attributes(tag: Tag) extends Table[Attribute](tag, "ship_attributes") {
def id = column[Int]("id")
def other_id = column[Int]("other_id")
def value = column[Json]("value")
def * = (id, other_id, value) <> (Attribute.tupled, Attribute.unapply)
}
Sample data:
| id | other_id | value |
|:-----|:-----------|:------------------------------------------|
| 1 | 10 | {"type": "IdList", "value": [1, 21]} |
| 2 | 10 | {"type": "IdList", "value": [5, 30]} |
| 3 | 10 | {"type": "IdList", "value": [7, 36]} |
This is my current query:
attributes
.filter(_.other_id = 10)
.filter { a =>
val innerQuery = attributes.map { _ =>
a.+>"value".arrayElementsText
}.to[List]
innerQuery #& List("1", "30", "205").bind
}
But it's complaining about the .to[List] conversion.
I've tried to create a SimpleFunction.unary[X, List[String]]("ARRAY"), but I don't know how to pass innerQuery to it (innerQuery is Query[Rep[String], String, Seq]).
Any ideas are very much appreciated.
UPDATE 1
while I can't figure this out, I changed the app to save in the database the json field as a list of strings instead of integer to be able to do this simple query:
attributes
.filter(_.other_id = 10)
.filter(_.+>"value" ?| List("1", "30", "205").bind)
| id | other_id | value |
|:-----|:-----------|:------------------------------------------|
| 1 | 10 | {"type": "IdList", "value": ["1", "21"]} |
| 2 | 10 | {"type": "IdList", "value": ["5", "30"]} |
| 3 | 10 | {"type": "IdList", "value": ["7", "36"]} |
Related
Apologies if my title might not be clear, I'll explain the question further here.
What I would like to do is to have multiple inserts based on a JSON array that I (backend) will be receiving from the frontend. The JSON object has the following data:
//Sample JSON
{
// Some other data here to insert
...
"quests": {
[
{
"player_id": [1, 2, 3],
"task_id": [11, 12],
},
{
"player_id": [4, 5, 6],
"task_id": [13, 14, 15],
}
]
}
Based on this JSON, this is my expected output upon being inserted in Table quests and processed by the backend:
//quests table (Output)
----------------------------
id | player_id | task_id |
----------------------------
1 | 1 | 11 |
2 | 1 | 12 |
3 | 2 | 11 |
4 | 2 | 12 |
5 | 3 | 11 |
6 | 3 | 12 |
7 | 4 | 13 |
8 | 4 | 14 |
9 | 4 | 15 |
10| 5 | 13 |
11| 5 | 14 |
12| 5 | 15 |
13| 6 | 13 |
14| 6 | 14 |
15| 6 | 15 |
// Not sure if useful info, but I will be using the player_id as a join later on.
-- My current progress --
What I currently have (and tried) is to do multiple inserts by iterating each JSON object.
//The previous JSON response I accept:
{
"quests: {
[
{
"player_id": 1,
"task_id": 11
},
{
"player_id": 1,
"task_id": 12
},
{
"player_id": 6,
"task_id": 15
}
]
}
}
// My current backend code
db.tx(async t => {
const q1 // some queries
....
const q3 = await t.none(
`INSERT INTO quests (
player_id, task_id)
SELECT player_id, task_id FROM
json_to_recordset($1::json)
AS x(player_id int, tasl_id int)`,[
JSON.stringify(quests)
]);
return t.batch([q1, q2, q3]);
}).then(data => {
// Success
}).catch(error => {
// Fail
});
});
It works, but I think it's not good to have a long request body, which is why I'm wondering if it's possible to run iteration of the arrays inside the object.
If there are information needed, I'll edit again this post.
Thank you advance!
I have a json string and a different string I'd like to create a dataframe of.
val body = """{
| "time": "2020-07-01T17:17:15.0495314Z",
| "ver": "4.0",
| "name": "samplename",
| "iKey": "o:something",
| "random": {
| "stuff": {
| "eventFlags": 258,
| "num5": "DHM",
| "num2": "something",
| "flags": 415236612,
| "num1": "4004825",
| "seq": 44
| },
| "banana": {
| "id": "someid",
| "ver": "someversion",
| "asId": 123
| },
| "something": {
| "example": "somethinghere"
| },
| "apple": {
| "time": "2020-07-01T17:17:37.874Z",
| "flag": "something",
| "userAgent": "someUserAgent",
| "auth": 12,
| "quality": 0
| },
| "loc": {
| "country": "US"
| }
| },
| "EventEnqueuedUtcTime": "2020-07-01T17:17:59.804Z"
|}
|""".stripMargin
val offset = "10"
I tried
val data = Seq(body, offset)
val columns = Seq("body","offset")
import sparkSession.sqlContext.implicits._
val df = data.toDF(columns:_*)
As well as
val data = Seq(body, offset)
val rdd = sparkSession.sparkContext.parallelize((data))
val dfFromRdd = rdd.toDF("body", "offset")
dfFromRdd.show(20, false)
but for both I get this an error: "value toDF is not a member of org.apache.spark.RDD[String]"
Is there a different way I can create a dataframe that will have one column with my json body data, and another column with my offset string value?
Edit: I've also tried the following:
val offset = "1000"
val data = Seq(body, offset)
val rdd = sparkSession.sparkContext.parallelize((data))
val dfFromRdd = rdd.toDF("body", "offset")
dfFromRdd.show(20, false)
and get an error of column mismatch : "The number of columns doesn't match.
Old column names (1): value
New column names (2): body, offset"
I dont understand why my data has the column name of "value"
I guess the issue is with your Seq syntax, elements should be tuples. Below code has worked for me,
val data = Seq((body, offset)) // <--- Check this line
val columns = Seq("body","offset")
import sparkSession.sqlContext.implicits._
data.toDF(columns:_*).printSchema()
/*
/
/ root
/ |-- body: string (nullable = true)
/ |-- offset: string (nullable = true)
/
*/
data.toDF(columns:_*).show()
/*
/
/ +--------------------+------+
/ | body|offset|
/ +--------------------+------+
/ |{
/ "time": "2020...| 10|
/ +--------------------+------+
/
/*
I have 2 Dataframes and would like to join them and would like to filter the data, i want to filter the
data where OrgTypeToExclude is matching with respect to each transactionid.
in single word my transactionId is join contiions and OrgTypeToExclude is exclude condition,sharing a simple example here
import org.apache.spark.sql.functions.expr
import spark.implicits._
val jsonstr ="""{
"id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
"Transactions": [
{
"TransactionId": "USAL",
"OrgTypeToExclude": ["A","B"]
},
{
"TransactionId": "USMD",
"OrgTypeToExclude": ["E"]
},
{
"TransactionId": "USGA",
"OrgTypeToExclude": []
}
]
}"""
val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
val json = spark.read.json(Seq(jsonstr).toDS).select("Transactions.TransactionId","Transactions.OrgTypeToExclude")
df.printSchema()
json.printSchema()
df.join(json,$"code"<=> $"TransactionId".cast("string") && !exp("array_contains(OrgTypeToExclude, Alp)") ,"inner" ).show()
--Expecting output
id Code Alp
4 "USAL" "C"
2 "USMD" "B"
3 "USGA" "C"
Thanks,
Manoj.
Transactions is an array type & you are accessing TransactionId & OrgTypeToExclude on that so you will be getting multiple arrays.
Instead of that You just explode root level Transactions array & extract the struct values that is OrgTypeToExclude & TransactionId next steps will be easy.
Please check below code.
scala> val jsonstr ="""{
|
| "id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
| "Transactions": [
| {
| "TransactionId": "USAL",
| "OrgTypeToExclude": ["A","B"]
| },
| {
| "TransactionId": "USMD",
| "OrgTypeToExclude": ["E"]
| },
| {
| "TransactionId": "USGA",
| "OrgTypeToExclude": []
| }
| ]
| }"""
jsonstr: String =
{
"id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
"Transactions": [
{
"TransactionId": "USAL",
"OrgTypeToExclude": ["A","B"]
},
{
"TransactionId": "USMD",
"OrgTypeToExclude": ["E"]
},
{
"TransactionId": "USGA",
"OrgTypeToExclude": []
}
]
}
scala> val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
df: org.apache.spark.sql.DataFrame = [id: int, code: string ... 1 more field]
scala> val json = spark.read.json(Seq(jsonstr).toDS).select(explode($"Transactions").as("Transactions")).select($"Transactions.*")
json: org.apache.spark.sql.DataFrame = [OrgTypeToExclude: array<string>, TransactionId: string]
scala> df.show(false)
+---+----+---+
|id |code|Alp|
+---+----+---+
|1 |USAL|A |
|4 |USAL|C |
|2 |USMD|B |
|5 |USMD|E |
|3 |USGA|C |
+---+----+---+
scala> json.show(false)
+----------------+-------------+
|OrgTypeToExclude|TransactionId|
+----------------+-------------+
|[A, B] |USAL |
|[E] |USMD |
|[] |USGA |
+----------------+-------------+
scala> df.join(jsondf,(df("code") === jsondf("TransactionId") && !array_contains(jsondf("OrgTypeToExclude"),df("Alp"))),"inner").select("id","code","alp").show(false)
+---+----+---+
|id |code|alp|
+---+----+---+
|4 |USAL|C |
|2 |USMD|B |
|3 |USGA|C |
+---+----+---+
scala>
First, it looks like you overlooked the fact that Transactions is also an array, which we can use explode to deal with:
val json = spark.read.json(Seq(jsonstr).toDS)
.select(explode($"Transactions").as("t")) // deal with Transactions array first
.select($"t.TransactionId", $"t.OrgTypeToExclude")
Also, array_contains wants a value rather than a column as its second argument. I'm not aware of a version that supports referencing a column, so we'll make a udf:
val arr_con = udf { (a: Seq[String], v: String) => a.contains(v) }
We can then modify the join condition like so:
df.join(json0, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
And the expected result:
scala> df.join(json, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
+---+----+---+-------------+----------------+
| id|code|Alp|TransactionId|OrgTypeToExclude|
+---+----+---+-------------+----------------+
| 4|USAL| C| USAL| [A, B]|
| 2|USMD| B| USMD| [E]|
| 3|USGA| C| USGA| []|
+---+----+---+-------------+----------------+
I am fetching few values from DB and want to create a nested map data structure out of this. The tabular data looks like this
+---------+--------------+----------------+------------------+----------------+-----------------------+
| Cube_ID | Dimension_ID | Dimension_Name | Partition_Column | Display_name | Dimension_Description |
+---------+--------------+----------------+------------------+----------------+-----------------------+
| 1 | 1 | Reporting_Date | Y | Reporting_Date | Reporting_Date |
| 1 | 2 | Platform | N | Platform | Platform |
| 1 | 3 | Country | N | Country | Country |
| 1 | 4 | OS_Version | N | OS_Version | OS_Version |
| 1 | 5 | Device_Version | N | Device_Version | Device_Version |
+---------+--------------+----------------+------------------+----------------+-----------------------+
I want to create a nested structure something like this
{
CubeID = "1": {
Dimension ID = "1": [
{
"Name": "Reporting_Date",
"Partition_Column": "Y"
"Display": "Reporting_Date"
}
]
Dimension ID = "2": [
{
"Name": "Platform",
"Column": "N"
"Display": "Platform"
}
]
},
CubeID = "2": {
Dimension ID = "1": [
{
"Name": "Reporting_Date",
"Partition_Column": "Y"
"Display": "Reporting_Date"
}
]
Dimension ID = "2": [
{
"Name": "Platform",
"Column": "N"
"Display": "Platform"
}
]
}
}
I have the result set from DB using the following. I am able to populate individual columns, but not sure how to create a map for later computation
while (rs.next()) {
val Dimension_ID = rs.getInt("Dimension_ID")
val Dimension_Name = rs.getString("Dimension_Name")
val Partition_Column = rs.getString("Partition_Column")
val Display_name = rs.getString("Display_name")
val Dimension_Description = rs.getString("Dimension_Description")
}
I believe I should write a case class for this, but I am not sure how to create a case class and load values to the case class.
Thanks for the help. I can provide any other info needed. Let me know
Background
you can define data class something as below,
case class Dimension(
dimensionId: Long,
name: String,
partitionColumn: String,
display: String
)
case class Record(
cubeId: Int,
dimension: Dimension
)
case class Data(records: List[Record])
And this is how you can construct data,
val data =
Data(
List(
Record(
cubeId = 1,
dimension = Dimension(
dimensionId = 1,
name = "Reporting_Date",
partitionColumn = "Y",
display = "Reporting_Date"
)
),
Record(
cubeId = 2,
dimension = Dimension(
dimensionId = 1,
name = "Platform",
partitionColumn = "N",
display = "Platform"
)
)
)
)
Now to your question, since you are using JDBC you have to construct list of records in a mutable way or use scala Iterator. I will write below mutable way to construct above data class but you can explore more.
import scala.collection.mutable.ListBuffer
var mutableData = new ListBuffer[Record]()
while (rs.next()) {
mutableData += Record(
cubeId = rs.getIn("Cube_ID"),
dimension = Dimension(
dimensionId = rs.getInt("Dimension_ID"),
name = rs.getString("Dimension_Name"),
partitionColumn = rs.getString("Partition_Column"),
display = rs.getString("Dimension_Description")
)
)
}
val data = Data(records = mutableData.toList)
Also read - Any better way to convert SQL ResultSet to Scala List
I have a large nested json document for each year (say 2018, 2017), which has aggregated data by each month (Jan-Dec) and each day (1-31).
{
"2018" : {
"Jan": {
"1": {
"u": 1,
"n": 2
}
"2": {
"u": 4,
"n": 7
}
},
"Feb": {
"1": {
"u": 3,
"n": 2
},
"4": {
"u": 4,
"n": 5
}
}
}
}
I have used AWS Glue Relationalize.apply function to convert above hierarchal data into flat structure:
dfc = Relationalize.apply(frame = datasource0, staging_path = my_temp_bucket, name = my_ref_relationalize_table, transformation_ctx = "dfc")
Which gives me table with columns of each json element as below:
| 2018.Jan.1.u | 2018.Jan.1.n | 2018.Jan.2.u | 2018.Jan.1.n | 2018.Feb.1.u | 2018.Feb.1.n | 2018.Feb.2.u | 2018.Feb.1.n |
| 1 | 2 | 4 | 7 | 3 | 2 | 4 | 5 |
As you can see, there will be lot of column in the table for each day and each month. And, I want to simplify the table by converting columns into rows to have below table.
| year | month | dd | u | n |
| 2018 | Jan | 1 | 1 | 2 |
| 2018 | Jan | 2 | 4 | 7 |
| 2018 | Feb | 1 | 3 | 2 |
| 2018 | Jan | 4 | 4 | 5 |
With my search, I could not get right answer. Is there a solution AWS Glue/PySpark or any other way to accomplish unpivot function to get row based table from column based table? Can it be done in Athena ?
Implemented solution similar to the below snippet
dataFrame = datasource0.toDF()
tableDataArray = [] ## to hold rows
rowArrayCount = 0
for row in dataFrame.rdd.toLocalIterator():
for colName in dataFrame.schema.names:
value = row[colName]
keyArray = colName.split('.')
rowDataArray = []
rowDataArray.insert(0,str(id))
rowDataArray.insert(1,str(keyArray[0]))
rowDataArray.insert(2,str(keyArray[1]))
rowDataArray.insert(3,str(keyArray[2]))
rowDataArray.insert(4,str(keyArray[3]))
tableDataArray.insert(rowArrayCount,rowDataArray)
rowArrayCount=+1
unpivotDF = None
for rowDataArray in tableDataArray:
newRowDF = sc.parallelize([Row(year=rowDataArray[0],month=rowDataArray[1],dd=rowDataArray[2],u=rowDataArray[3],n=rowDataArray[4])]).toDF()
if unpivotDF is None:
unpivotDF = newRowDF
else :
unpivotDF = unpivotDF.union(newRowDF)
datasource0 = datasource0.fromDF(unpivotDF, glueContext, "datasource0")
in above newRowDF can also be created as below if data type has to be enforced
columns = [StructField('year',StringType(), True),StructField('month', IntegerType(), ....]
schema = StructType(columns)
unpivotDF = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for rowDataArray in tableDataArray:
newRowDF = spark.createDataFrame(rowDataArray, schema)
Here are the steps to successfully unpivot your Dataset Using AWS Glue with Pyspark
We need to add an additional import statement to the existing boiler plate import statements
from pyspark.sql.functions import expr
If our data is in a DynamicFrame, we need to convert it to a Spark DataFrame for example:
df_customer_sales = dyf_customer_sales.toDF()
Use the stack method to unpivot our dataset based on how many columns we want to unpivot
unpivotExpr = "stack(4, 'january', january, 'febuary', febuary, 'march', march, 'april', april) as (month, total_sales)"
unPivotDF = df_customer_sales.select('item_type', expr(unpivotExpr))
So using an example dataset, our dataframe looks like this now:
If my explanation is not clear, I made a youtube tutorial walkthrough of the solution: https://youtu.be/Nf78KMhNc3M