Merge two lines into one using ID in Power BI - mongodb

I'm using Power BI for visualize my data saved in a Mongo Database.
My records looks like that
'_id': 0,
'code_zone': "ABCD",
'type_zone': "Beautiful",
'all_coordinates': [{
"type": "Feature",
"geometry": {
"type": "Point",
"one_coordinates": [10.11, 40.44]
},
"properties": {
"limite_vertical_min": "L0",
"limite_vertical_max": "L100"
}
}]
When I import data on Power BI, he divides my records into 3 "tables":
my_collection
my_collection.all_coordinates
my_collection.all_coordinates.one_coordinates
Because I didn't know how can I fixe this issue, I selected this 3 tables and I linked them using id.
Actually, I can visualize that :
_id | code_zone | index_all_coordinates | index_one_coordinate | value
----------------------------------------------------------------------
id0 | ABCD | 1 | 0 | 10.11
----------------------------------------------------------------------
id0 | ABCD | 1 | 1 | 40.44
I'm expecting to have that :
_id | code_zone | index_all_coordinates | value_x | value_y
------------------------------------------------------------
id0 | ABCD | 1 | 10.11 | 40.44
------------------------------------------------------------
Is it the good solution or I have to refactor my data before the import in Power BI ?
How can I merge this two lines into one with Power BI ?

To get from the first table to the second, you can pivot on the index_one_coordinate column and then relabel those new columns 0 and 1 to value_x and value_y.

Related

Replicating MongoDB $bucket with conditional sum in Postgres

I have a database with hundreds of thousands of rows with this schema:
+----+----------+---------+
| id | duration | type |
+----+----------+---------+
| 1 | 41 | cycling |
+----+----------+---------+
| 2 | 15 | walking |
+----+----------+---------+
| 3 | 6 | walking |
+----+----------+---------+
| 4 | 26 | running |
+----+----------+---------+
| 5 | 30 | cycling |
+----+----------+---------+
| 6 | 13 | running |
+----+----------+---------+
| 7 | 10 | running |
+----+----------+---------+
I was previously using a MongoDB aggregation to do this and get a distribution of activities by type and total count:
{
$bucket: {
groupBy: '$duration',
boundaries: [0, 16, 31, 61, 91, 121],
default: 121,
output: {
total: { $sum: 1 },
walking: {
$sum: { $cond: [{ $eq: ['$type', 'walking'] }, 1, 0] },
},
running: {
$sum: { $cond: [{ $eq: ['$type', 'running'] }, 1, 0] },
},
cycling: {
$sum: { $cond: [{ $eq: ['$type', 'cycling'] }, 1, 0] },
},
},
},
}
I have just transitioned to using Postgres and can't figure out how to do the conditional sums there. What would the query be to get a result table like this?
+---------------+---------+---------+---------+-------+
| duration_band | walking | running | cycling | total |
+---------------+---------+---------+---------+-------+
| 0-15 | 41 | 21 | 12 | 74 |
+---------------+---------+---------+---------+-------+
| 15-30 | 15 | 1 | 44 | 60 |
+---------------+---------+---------+---------+-------+
| 30-60 | 6 | 56 | 7 | 69 |
+---------------+---------+---------+---------+-------+
| 60-90 | 26 | 89 | 32 | 150 |
+---------------+---------+---------+---------+-------+
| 90-120 | 30 | 0 | 6 | 36 |
+---------------+---------+---------+---------+-------+
| 120+ | 13 | 90 | 0 | 103 |
+---------------+---------+---------+---------+-------+
| Total | 131 | 257 | 101 | 492 |
+---------------+---------+---------+---------+-------+
SQL is very good at retrieving and making calculations on data, and delivering it so getting the values you want is an easy task. It is not so good at formatting results, that why that task is typically left to the presentation layer. That said, however, does not mean it cannot be done - it can and in a single query. The difficulty is the pivot process - transforming rows into columns. But first some setup. You should put the duration data on a its own table (if not already). With the addition of a identifier which then allows multiple criteria sets (more on that later). I will proceed that way.
create table bands( name text, period int4range, title text );
insert into bands(name, period, title)
values ('Standard', '[ 0, 15)'::int4range , '0 - 15')
, ('Standard', '[ 15, 30)'::int4range , '15 - 30')
, ('Standard', '[ 30, 60)'::int4range , '30 - 60')
, ('Standard', '[ 60, 90)'::int4range , '60 - 00')
, ('Standard', '[ 90,120)'::int4range , '90 - 120')
, ('Standard', '[120,)'::int4range , '120+');
This sets up the your current criteria. The name column is the prior mentioned identifier where the title column becomes the duration band on the output. The interesting column is the period; defined as an integer range. In this case a [closed,open) range that includes the 1st number but not the 2nd - yea the brackets have meaning. That definition becomes the heart of resulting query. The query builds as follows:
Retrieve the desired interval set ( [0-5) ... ) set and append to it
a "totals" entry.
Define the list of activities (cycling, ...).
Combine these sets to create a list of interval set with each
activity.
The above gives the activity intervals which becomes the matrix generated when pivoted.
Combine the "test" table values into the above list calculating the
total time for each activity within each interval. This is the work
horse of the query. It does ALL of the calculations.
The above now contains intervals plus total activity for each cell in the matrix. However it still exists in row orientation.
With the results calculated pivot them from row orientation to
column orientation.
Finally compress the pivoted results into a single row for each interval and set the final interval ordering.
And the result is:
with buckets ( period , title, ord) as
( select period , title, row_number() over (order by lower(b.period)) ord ---- 1
from bands b
where name = 'Standard'
union all
select '[0,)','Total',count(*) + 1
from bands b
where name = 'Standard'
)
, activities (activity) as ( values ('running'),('walking'),('cycling'), ('Total')) ---- 2
, activity_buckets (period, title, ord, activity) as
(select * from buckets cross join activities) ---- 3
select s2.title "Duration Band" ---- 6
, max(cycling) "Cycling"
, max(running) "Running"
, max(walking) "Walking"
, max(Total) "Total "
from ( select s1.title, s1.ord
, case when s1.activity = 'cycling' then duration else null end cycling ---- 5
, case when s1.activity = 'running' then duration else null end running
, case when s1.activity = 'walking' then duration else null end walking
, case when s1.activity = 'Total' then duration else null end total
from ( select ab.ord, ab.title, ab.activity
, sum(coalesce(t.duration,0)) duration ---- 4
from activity_buckets ab
left join test t
on ( (t.type = ab.activity or ab.activity = 'Total')
and t.duration <# ab.period --** determines which time interval(s) the value belongs
)
group by ab.ord, ab.title, ab.activity
) s1
) s2
group by s2.ord,s2.title
order by s2.ord;
See demo. It contains each of the major steps along the way. Additionally it shows how creating a table for the intervals can be put to use. Since I dislike long queries I generally hide them behind a SQL function and then just use the function. Demo also contains this.

How to UnPivot COLUMNS into ROWS in AWS Glue / Py Spark script

I have a large nested json document for each year (say 2018, 2017), which has aggregated data by each month (Jan-Dec) and each day (1-31).
{
"2018" : {
"Jan": {
"1": {
"u": 1,
"n": 2
}
"2": {
"u": 4,
"n": 7
}
},
"Feb": {
"1": {
"u": 3,
"n": 2
},
"4": {
"u": 4,
"n": 5
}
}
}
}
I have used AWS Glue Relationalize.apply function to convert above hierarchal data into flat structure:
dfc = Relationalize.apply(frame = datasource0, staging_path = my_temp_bucket, name = my_ref_relationalize_table, transformation_ctx = "dfc")
Which gives me table with columns of each json element as below:
| 2018.Jan.1.u | 2018.Jan.1.n | 2018.Jan.2.u | 2018.Jan.1.n | 2018.Feb.1.u | 2018.Feb.1.n | 2018.Feb.2.u | 2018.Feb.1.n |
| 1 | 2 | 4 | 7 | 3 | 2 | 4 | 5 |
As you can see, there will be lot of column in the table for each day and each month. And, I want to simplify the table by converting columns into rows to have below table.
| year | month | dd | u | n |
| 2018 | Jan | 1 | 1 | 2 |
| 2018 | Jan | 2 | 4 | 7 |
| 2018 | Feb | 1 | 3 | 2 |
| 2018 | Jan | 4 | 4 | 5 |
With my search, I could not get right answer. Is there a solution AWS Glue/PySpark or any other way to accomplish unpivot function to get row based table from column based table? Can it be done in Athena ?
Implemented solution similar to the below snippet
dataFrame = datasource0.toDF()
tableDataArray = [] ## to hold rows
rowArrayCount = 0
for row in dataFrame.rdd.toLocalIterator():
for colName in dataFrame.schema.names:
value = row[colName]
keyArray = colName.split('.')
rowDataArray = []
rowDataArray.insert(0,str(id))
rowDataArray.insert(1,str(keyArray[0]))
rowDataArray.insert(2,str(keyArray[1]))
rowDataArray.insert(3,str(keyArray[2]))
rowDataArray.insert(4,str(keyArray[3]))
tableDataArray.insert(rowArrayCount,rowDataArray)
rowArrayCount=+1
unpivotDF = None
for rowDataArray in tableDataArray:
newRowDF = sc.parallelize([Row(year=rowDataArray[0],month=rowDataArray[1],dd=rowDataArray[2],u=rowDataArray[3],n=rowDataArray[4])]).toDF()
if unpivotDF is None:
unpivotDF = newRowDF
else :
unpivotDF = unpivotDF.union(newRowDF)
datasource0 = datasource0.fromDF(unpivotDF, glueContext, "datasource0")
in above newRowDF can also be created as below if data type has to be enforced
columns = [StructField('year',StringType(), True),StructField('month', IntegerType(), ....]
schema = StructType(columns)
unpivotDF = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for rowDataArray in tableDataArray:
newRowDF = spark.createDataFrame(rowDataArray, schema)
Here are the steps to successfully unpivot your Dataset Using AWS Glue with Pyspark
We need to add an additional import statement to the existing boiler plate import statements
from pyspark.sql.functions import expr
If our data is in a DynamicFrame, we need to convert it to a Spark DataFrame for example:
df_customer_sales = dyf_customer_sales.toDF()
Use the stack method to unpivot our dataset based on how many columns we want to unpivot
unpivotExpr = "stack(4, 'january', january, 'febuary', febuary, 'march', march, 'april', april) as (month, total_sales)"
unPivotDF = df_customer_sales.select('item_type', expr(unpivotExpr))
So using an example dataset, our dataframe looks like this now:
If my explanation is not clear, I made a youtube tutorial walkthrough of the solution: https://youtu.be/Nf78KMhNc3M

How to get back aggregate values across 2 dimensions using Python Cubes?

Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table

Optimizing MongoDB indexing multiple field with multiple query

I am new to database indexing. My application has the following "find" and "update" queries, searched by single and multiple fields
reference | timestamp | phone | username | key | Address
update x | | | | |
findOne | x | x | | |
find/limit:16 | x | x | x | |
find/limit:11 | x | | | x | x
find/limit:1/sort:-1 | x | x | | x | x
find | x | | | |
1)update({"reference":"f0d3dba-278de4a-79a6cb-1284a5a85cde"}, ……….
2)findOne({"timestamp":"1466595571", "phone":"9112345678900"})
3)find({"timestamp":"1466595571", "phone":"9112345678900", "username":"a0001a"}).limit(16)
4)find({"timestamp":"1466595571", "key":"443447644g5fff", "address":"abc road, mumbai, india"}).limit(11)
5)find({"timestamp":"1466595571", "phone":"9112345678900", "key":"443447644g5fff", "address":"abc road, mumbai, india"}).sort({"_id":-1}).limit(1)
6)find({"timestamp":"1466595571"})
I am creating index
db.coll.createIndex( { "reference": 1 } ) //for 1st, 6th query
db.coll.createIndex( { "timestamp": 1, "phone": 1, "username": 1 } ) //for 2nd, 3rd query
db.coll.createIndex( { "timestamp": 1, "key": 1, "address": 1, phone: 1 } ) //for 4th, 5th query
Is this the correct way?
Please help me
Thank you
I think what you have done looks fine. One way to check if your query is using an index, which index is being used, and whether the index is effective is to use the explain() function alongside your find().
For example:
db.coll.find({"timestamp":"1466595571"}).explain()
will return a json document which details what index (if any) was used. In addition to this you can specify that the explain return "executionStats"
eg.
db.coll.find({"timestamp":"1466595571"}).explain("executionStats")
This will tell you how many index keys were examined to find the result set as well as the execution time and other useful metrics.

Mongo - find with multiple

Giving I have this data in my mongo collection
product_id | original_id | text
1 | "A00149" | "1280 x 1024"
1 | "A00373" | "Black"
2 | "A00149" | "1280 x 1024"
2 | "A00373" | "White"
3 | "A00149" | "1980 x 1200"
3 | "A00373" | "Black"
(I have added quotes around the values in hand - these are not in the real collection)
With the following query, Im getting 0 results, though I was expecting 1.
product_id = 1 should meet the query.
Can somebody explain me what Im doing wrong?
In SQL the where would look like this
WHERE
(original_id = "A00149" AND text = "1280 x 1024")
AND
(original_id = "A00373" AND text = "Black")
And the mongo query
db.Filter.find({
"find":true,
"query":{
"$and":[
{
"original_id":"A00149",
"text":"1280 x 1024"
},
{
"original_id":"A00373",
"text":"Black"
}
]
},
"fields":{
"product_id":1
}
});
If your collection is called 'Filter' and you want a query to return the document with product_id = 1 then its simple:
db.Filter.find({"product_id" : 1})
I maybe misunderstood your question though?
Edit:
try:
db.Filter.find({$and: [{"original_id": "A00149", "text": "1280 x 1024"}, {"original_id": "A00373", "text": "Black"}]},{"product_id": 1})
see http://docs.mongodb.org/manual/reference/operator/query/and/#op._S_and