How to iterate over a JSON object in a transaction for multiple inserts in database using pg-promise - pg-promise

Apologies if my title might not be clear, I'll explain the question further here.
What I would like to do is to have multiple inserts based on a JSON array that I (backend) will be receiving from the frontend. The JSON object has the following data:
//Sample JSON
{
// Some other data here to insert
...
"quests": {
[
{
"player_id": [1, 2, 3],
"task_id": [11, 12],
},
{
"player_id": [4, 5, 6],
"task_id": [13, 14, 15],
}
]
}
Based on this JSON, this is my expected output upon being inserted in Table quests and processed by the backend:
//quests table (Output)
----------------------------
id | player_id | task_id |
----------------------------
1 | 1 | 11 |
2 | 1 | 12 |
3 | 2 | 11 |
4 | 2 | 12 |
5 | 3 | 11 |
6 | 3 | 12 |
7 | 4 | 13 |
8 | 4 | 14 |
9 | 4 | 15 |
10| 5 | 13 |
11| 5 | 14 |
12| 5 | 15 |
13| 6 | 13 |
14| 6 | 14 |
15| 6 | 15 |
// Not sure if useful info, but I will be using the player_id as a join later on.
-- My current progress --
What I currently have (and tried) is to do multiple inserts by iterating each JSON object.
//The previous JSON response I accept:
{
"quests: {
[
{
"player_id": 1,
"task_id": 11
},
{
"player_id": 1,
"task_id": 12
},
{
"player_id": 6,
"task_id": 15
}
]
}
}
// My current backend code
db.tx(async t => {
const q1 // some queries
....
const q3 = await t.none(
`INSERT INTO quests (
player_id, task_id)
SELECT player_id, task_id FROM
json_to_recordset($1::json)
AS x(player_id int, tasl_id int)`,[
JSON.stringify(quests)
]);
return t.batch([q1, q2, q3]);
}).then(data => {
// Success
}).catch(error => {
// Fail
});
});
It works, but I think it's not good to have a long request body, which is why I'm wondering if it's possible to run iteration of the arrays inside the object.
If there are information needed, I'll edit again this post.
Thank you advance!

Related

PostgreSQL - Fetch only a specified limit of related rows but with a number indicating the total count

Basically I have 2 models like so:
Comments:
+----+--------+----------+-----------------+
| id | userId | parentId | text |
+----+--------+----------+-----------------+
| 2 | 5 | 1 | Beautiful photo |
+----+--------+----------+-----------------+
| 3 | 2 | 2 | Thanks Jeff. |
+----+--------+----------+-----------------+
| 4 | 7 | 2 | Thank you, Jeff.|
+----+--------+----------+-----------------+
This table is designed to handle threads. Each parentId is a comment itself.
And CommentLikes:
+----+--------+-----------+
| id | userId | commentId |
+----+--------+-----------+
| 1 | 2 | 2 |
+----+--------+-----------+
| 2 | 7 | 2 |
+----+--------+-----------+
| 3 | 7 | 3 |
+----+--------+-----------+
What I'm trying to achieve is an SQL query that will perform the following (given the parameter parentId):
Get a limit of 10 replies that belong to parentId. With each reply, I need a count indicating the total number of replies to that reply and another count indicating the total number of likes given to that reply.
Sample input #1: /replies/1
Expected output:
[{
id: 2,
userId: 5,
parentId: 1,
text: 'Beautiful photo',
likeCount: 2,
replyCount: 2
}]
Sample input #2: /replies/2
Expected output:
[
{
id: 2,
userId: 2,
parentId: 2,
text: 'Thanks Jeff.'
replyCount: 0,
likeCount: 1
},
{
id: 3,
userId: 7,
parentId: 2,
text: 'Thank you, Jeff.'
replyCount: 0,
likeCount: 0
}
]
I'm trying to use Sequelize for my case but it seems to only over-complicate things so any raw SQL query will do.
Thank you in advance.
what about something like this
SELECT *,
(SELECT COUNT(*) FROM comment_likes WHERE comments.id =comment_likes."commentId") AS likecount,
(SELECT COUNT(*) FROM comments AS c WHERE c."parentId" = comments.id) AS commentcount
FROM comments
WHERE comments."parentId"=2

Replicating MongoDB $bucket with conditional sum in Postgres

I have a database with hundreds of thousands of rows with this schema:
+----+----------+---------+
| id | duration | type |
+----+----------+---------+
| 1 | 41 | cycling |
+----+----------+---------+
| 2 | 15 | walking |
+----+----------+---------+
| 3 | 6 | walking |
+----+----------+---------+
| 4 | 26 | running |
+----+----------+---------+
| 5 | 30 | cycling |
+----+----------+---------+
| 6 | 13 | running |
+----+----------+---------+
| 7 | 10 | running |
+----+----------+---------+
I was previously using a MongoDB aggregation to do this and get a distribution of activities by type and total count:
{
$bucket: {
groupBy: '$duration',
boundaries: [0, 16, 31, 61, 91, 121],
default: 121,
output: {
total: { $sum: 1 },
walking: {
$sum: { $cond: [{ $eq: ['$type', 'walking'] }, 1, 0] },
},
running: {
$sum: { $cond: [{ $eq: ['$type', 'running'] }, 1, 0] },
},
cycling: {
$sum: { $cond: [{ $eq: ['$type', 'cycling'] }, 1, 0] },
},
},
},
}
I have just transitioned to using Postgres and can't figure out how to do the conditional sums there. What would the query be to get a result table like this?
+---------------+---------+---------+---------+-------+
| duration_band | walking | running | cycling | total |
+---------------+---------+---------+---------+-------+
| 0-15 | 41 | 21 | 12 | 74 |
+---------------+---------+---------+---------+-------+
| 15-30 | 15 | 1 | 44 | 60 |
+---------------+---------+---------+---------+-------+
| 30-60 | 6 | 56 | 7 | 69 |
+---------------+---------+---------+---------+-------+
| 60-90 | 26 | 89 | 32 | 150 |
+---------------+---------+---------+---------+-------+
| 90-120 | 30 | 0 | 6 | 36 |
+---------------+---------+---------+---------+-------+
| 120+ | 13 | 90 | 0 | 103 |
+---------------+---------+---------+---------+-------+
| Total | 131 | 257 | 101 | 492 |
+---------------+---------+---------+---------+-------+
SQL is very good at retrieving and making calculations on data, and delivering it so getting the values you want is an easy task. It is not so good at formatting results, that why that task is typically left to the presentation layer. That said, however, does not mean it cannot be done - it can and in a single query. The difficulty is the pivot process - transforming rows into columns. But first some setup. You should put the duration data on a its own table (if not already). With the addition of a identifier which then allows multiple criteria sets (more on that later). I will proceed that way.
create table bands( name text, period int4range, title text );
insert into bands(name, period, title)
values ('Standard', '[ 0, 15)'::int4range , '0 - 15')
, ('Standard', '[ 15, 30)'::int4range , '15 - 30')
, ('Standard', '[ 30, 60)'::int4range , '30 - 60')
, ('Standard', '[ 60, 90)'::int4range , '60 - 00')
, ('Standard', '[ 90,120)'::int4range , '90 - 120')
, ('Standard', '[120,)'::int4range , '120+');
This sets up the your current criteria. The name column is the prior mentioned identifier where the title column becomes the duration band on the output. The interesting column is the period; defined as an integer range. In this case a [closed,open) range that includes the 1st number but not the 2nd - yea the brackets have meaning. That definition becomes the heart of resulting query. The query builds as follows:
Retrieve the desired interval set ( [0-5) ... ) set and append to it
a "totals" entry.
Define the list of activities (cycling, ...).
Combine these sets to create a list of interval set with each
activity.
The above gives the activity intervals which becomes the matrix generated when pivoted.
Combine the "test" table values into the above list calculating the
total time for each activity within each interval. This is the work
horse of the query. It does ALL of the calculations.
The above now contains intervals plus total activity for each cell in the matrix. However it still exists in row orientation.
With the results calculated pivot them from row orientation to
column orientation.
Finally compress the pivoted results into a single row for each interval and set the final interval ordering.
And the result is:
with buckets ( period , title, ord) as
( select period , title, row_number() over (order by lower(b.period)) ord ---- 1
from bands b
where name = 'Standard'
union all
select '[0,)','Total',count(*) + 1
from bands b
where name = 'Standard'
)
, activities (activity) as ( values ('running'),('walking'),('cycling'), ('Total')) ---- 2
, activity_buckets (period, title, ord, activity) as
(select * from buckets cross join activities) ---- 3
select s2.title "Duration Band" ---- 6
, max(cycling) "Cycling"
, max(running) "Running"
, max(walking) "Walking"
, max(Total) "Total "
from ( select s1.title, s1.ord
, case when s1.activity = 'cycling' then duration else null end cycling ---- 5
, case when s1.activity = 'running' then duration else null end running
, case when s1.activity = 'walking' then duration else null end walking
, case when s1.activity = 'Total' then duration else null end total
from ( select ab.ord, ab.title, ab.activity
, sum(coalesce(t.duration,0)) duration ---- 4
from activity_buckets ab
left join test t
on ( (t.type = ab.activity or ab.activity = 'Total')
and t.duration <# ab.period --** determines which time interval(s) the value belongs
)
group by ab.ord, ab.title, ab.activity
) s1
) s2
group by s2.ord,s2.title
order by s2.ord;
See demo. It contains each of the major steps along the way. Additionally it shows how creating a table for the intervals can be put to use. Since I dislike long queries I generally hide them behind a SQL function and then just use the function. Demo also contains this.

How to UnPivot COLUMNS into ROWS in AWS Glue / Py Spark script

I have a large nested json document for each year (say 2018, 2017), which has aggregated data by each month (Jan-Dec) and each day (1-31).
{
"2018" : {
"Jan": {
"1": {
"u": 1,
"n": 2
}
"2": {
"u": 4,
"n": 7
}
},
"Feb": {
"1": {
"u": 3,
"n": 2
},
"4": {
"u": 4,
"n": 5
}
}
}
}
I have used AWS Glue Relationalize.apply function to convert above hierarchal data into flat structure:
dfc = Relationalize.apply(frame = datasource0, staging_path = my_temp_bucket, name = my_ref_relationalize_table, transformation_ctx = "dfc")
Which gives me table with columns of each json element as below:
| 2018.Jan.1.u | 2018.Jan.1.n | 2018.Jan.2.u | 2018.Jan.1.n | 2018.Feb.1.u | 2018.Feb.1.n | 2018.Feb.2.u | 2018.Feb.1.n |
| 1 | 2 | 4 | 7 | 3 | 2 | 4 | 5 |
As you can see, there will be lot of column in the table for each day and each month. And, I want to simplify the table by converting columns into rows to have below table.
| year | month | dd | u | n |
| 2018 | Jan | 1 | 1 | 2 |
| 2018 | Jan | 2 | 4 | 7 |
| 2018 | Feb | 1 | 3 | 2 |
| 2018 | Jan | 4 | 4 | 5 |
With my search, I could not get right answer. Is there a solution AWS Glue/PySpark or any other way to accomplish unpivot function to get row based table from column based table? Can it be done in Athena ?
Implemented solution similar to the below snippet
dataFrame = datasource0.toDF()
tableDataArray = [] ## to hold rows
rowArrayCount = 0
for row in dataFrame.rdd.toLocalIterator():
for colName in dataFrame.schema.names:
value = row[colName]
keyArray = colName.split('.')
rowDataArray = []
rowDataArray.insert(0,str(id))
rowDataArray.insert(1,str(keyArray[0]))
rowDataArray.insert(2,str(keyArray[1]))
rowDataArray.insert(3,str(keyArray[2]))
rowDataArray.insert(4,str(keyArray[3]))
tableDataArray.insert(rowArrayCount,rowDataArray)
rowArrayCount=+1
unpivotDF = None
for rowDataArray in tableDataArray:
newRowDF = sc.parallelize([Row(year=rowDataArray[0],month=rowDataArray[1],dd=rowDataArray[2],u=rowDataArray[3],n=rowDataArray[4])]).toDF()
if unpivotDF is None:
unpivotDF = newRowDF
else :
unpivotDF = unpivotDF.union(newRowDF)
datasource0 = datasource0.fromDF(unpivotDF, glueContext, "datasource0")
in above newRowDF can also be created as below if data type has to be enforced
columns = [StructField('year',StringType(), True),StructField('month', IntegerType(), ....]
schema = StructType(columns)
unpivotDF = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for rowDataArray in tableDataArray:
newRowDF = spark.createDataFrame(rowDataArray, schema)
Here are the steps to successfully unpivot your Dataset Using AWS Glue with Pyspark
We need to add an additional import statement to the existing boiler plate import statements
from pyspark.sql.functions import expr
If our data is in a DynamicFrame, we need to convert it to a Spark DataFrame for example:
df_customer_sales = dyf_customer_sales.toDF()
Use the stack method to unpivot our dataset based on how many columns we want to unpivot
unpivotExpr = "stack(4, 'january', january, 'febuary', febuary, 'march', march, 'april', april) as (month, total_sales)"
unPivotDF = df_customer_sales.select('item_type', expr(unpivotExpr))
So using an example dataset, our dataframe looks like this now:
If my explanation is not clear, I made a youtube tutorial walkthrough of the solution: https://youtu.be/Nf78KMhNc3M

Slick-pg: How to use arrayElementsText and overlap operator "?|"?

I'm trying to write the following query in scala using slick/slick-pg, but I don't have much experience with slick and can't figure out how:
SELECT *
FROM attributes a
WHERE a.other_id = 10
and ARRAY(SELECT jsonb_array_elements_text(a.value->'value'))
&& array['1','30','205'];
This is a simplified version of the attributes table, where the value field is a jsonb:
class Attributes(tag: Tag) extends Table[Attribute](tag, "ship_attributes") {
def id = column[Int]("id")
def other_id = column[Int]("other_id")
def value = column[Json]("value")
def * = (id, other_id, value) <> (Attribute.tupled, Attribute.unapply)
}
Sample data:
| id | other_id | value |
|:-----|:-----------|:------------------------------------------|
| 1 | 10 | {"type": "IdList", "value": [1, 21]} |
| 2 | 10 | {"type": "IdList", "value": [5, 30]} |
| 3 | 10 | {"type": "IdList", "value": [7, 36]} |
This is my current query:
attributes
.filter(_.other_id = 10)
.filter { a =>
val innerQuery = attributes.map { _ =>
a.+>"value".arrayElementsText
}.to[List]
innerQuery #& List("1", "30", "205").bind
}
But it's complaining about the .to[List] conversion.
I've tried to create a SimpleFunction.unary[X, List[String]]("ARRAY"), but I don't know how to pass innerQuery to it (innerQuery is Query[Rep[String], String, Seq]).
Any ideas are very much appreciated.
UPDATE 1
while I can't figure this out, I changed the app to save in the database the json field as a list of strings instead of integer to be able to do this simple query:
attributes
.filter(_.other_id = 10)
.filter(_.+>"value" ?| List("1", "30", "205").bind)
| id | other_id | value |
|:-----|:-----------|:------------------------------------------|
| 1 | 10 | {"type": "IdList", "value": ["1", "21"]} |
| 2 | 10 | {"type": "IdList", "value": ["5", "30"]} |
| 3 | 10 | {"type": "IdList", "value": ["7", "36"]} |

How to create DataFrame from fixed-length text file given field lengths?

I am reading fixed positional file. Final result of file is stored in string. I would like to convert string into a DataFrame to process further. Kindly help me on this. Below is my code:
Input data:
+---------+----------------------+
|PRGREFNBR|value |
+---------+----------------------+
|01 |11 apple TRUE 0.56|
|02 |12 pear FALSE1.34|
|03 |13 raspberry TRUE 2.43|
|04 |14 plum TRUE .31|
|05 |15 cherry TRUE 1.4 |
+---------+----------------------+
data position: "3,10,5,4"
expected result with default header in data frame:
+-----+-----+----------+-----+-----+
|SeqNo|col_0| col_1|col_2|col_3|
+-----+-----+----------+-----+-----+
| 01 | 11 |apple |TRUE | 0.56|
| 02 | 12 |pear |FALSE| 1.34|
| 03 | 13 |raspberry |TRUE | 2.43|
| 04 | 14 |plum |TRUE | 1.31|
| 05 | 15 |cherry |TRUE | 1.4 |
+-----+-----+----------+-----+-----+
Given the fixed-position file (say input.txt):
11 apple TRUE 0.56
12 pear FALSE1.34
13 raspberry TRUE 2.43
14 plum TRUE 1.31
15 cherry TRUE 1.4
and the length of every field in the input file as (say lengths):
3,10,5,4
you could create a DataFrame as follows:
// Read the text file as is
// and filter out empty lines
val lines = spark.read.textFile("input.txt").filter(!_.isEmpty)
// define a helper function to do the split per fixed lengths
// Home exercise: should be part of a case class that describes the schema
def parseLinePerFixedLengths(line: String, lengths: Seq[Int]): Seq[String] = {
lengths.indices.foldLeft((line, Array.empty[String])) { case ((rem, fields), idx) =>
val len = lengths(idx)
val fld = rem.take(len)
(rem.drop(len), fields :+ fld)
}._2
}
// Split the lines using parseLinePerFixedLengths method
val lengths = Seq(3,10,5,4)
val fields = lines.
map(parseLinePerFixedLengths(_, lengths)).
withColumnRenamed("value", "fields") // <-- it'd be unnecessary if a case class were used
scala> fields.show(truncate = false)
+------------------------------+
|fields |
+------------------------------+
|[11 , apple , TRUE , 0.56]|
|[12 , pear , FALSE, 1.34]|
|[13 , raspberry , TRUE , 2.43]|
|[14 , plum , TRUE , 1.31]|
|[15 , cherry , TRUE , 1.4 ]|
+------------------------------+
That's what you may have had already so let's unroll/destructure the nested sequence of fields into columns
val answer = lengths.indices.foldLeft(fields) { case (result, idx) =>
result.withColumn(s"col_$idx", $"fields".getItem(idx))
}
// drop the unnecessary/interim column
scala> answer.drop("fields").show
+-----+----------+-----+-----+
|col_0| col_1|col_2|col_3|
+-----+----------+-----+-----+
| 11 |apple |TRUE | 0.56|
| 12 |pear |FALSE| 1.34|
| 13 |raspberry |TRUE | 2.43|
| 14 |plum |TRUE | 1.31|
| 15 |cherry |TRUE | 1.4 |
+-----+----------+-----+-----+
Done!