HiveQL - Create chains of group IDs - hiveql

Let's say I have a table with 3 columns each containing a group ID like so:
I want the group ID column to be populated with an ID that links rows together based on the other 3 groups.
So looking at group 1 which I've populated manually,
rows 2 and 3 are contained in this group because they have the same A group
rows 2, 7 and 10 are contained in this group because they have the same C group
rows 6 and 7 are contained in this group because they have the same B group
All those rows are in the same group despite there being no direct link between some rows (e.g. row 2 and 6)
I typically have a simple solution for this in SQL where I simply do the following loop:
CLUSTER:
UPDATE A
SET A.GroupID = B.GroupID
FROM #TEMP A
JOIN #TEMP B ON B.GroupA = A.GroupA
OR B.GroupB = A.GroupB
OR B.GroupC = A.GroupC
WHERE A.GroupID>B.GroupID
IF ##ROWCOUNT > 0 GOTO CLUSTER
but obviously I can't do that on hive as you can't loop.
I searched online and found a similar question on stackoverflow but unfortunately the solution is a link to another question which has since been deleted (and also this person only uses 2 group columns whilst I have 3):
SQL Server : chain grouping of columns
Any help would be highly appreciated.

Related

PowerBI: Appending tables with calculated columns

i have uploaded into Power BI 4 Tables: 1) Exchange Rates 2) Ledger Tree 3) Actuals results 4) Forecast.
Tables 3 and 4 had different columns or information (eg. 3 showed values by ledger in usd and 4 in local currencies; table 3) is not showing the ledger currency while Table 4 does), so I had to add in both several calculated columns (including looking values from tables 1 and 2) to make them look alike.
I would like now to put table 3 and 4 above one another one but I am not sure how to do it?
Any help is much appreciated
In Power Query to Union tables use append query
When selecting this, you can add to the existing query, or create a new table from the appended tables. If you do append you can select the old tables no longer used, and turn off 'Enable Load' so the reference table isn't loaded into the Power Pivot
In DAX, you can union tables using the Union function

SQL: How to prevent double summing

I'm not exactly sure what the term is for this but, when you have a many-to-many relationship when joining 2 tables and you want to sum up one of the variables, I believe that you can sum the same values over and over again.
What I want to accomplish is to prevent this from happening. How do I make sure that my sum function is returning the correct number?
I'm using PostgreSQL
Example:
Table 1 Table 2
SampleID DummyName SampleID DummyItem
1 John 1 5
1 John 1 4
2 Doe 1 5
3 Jake 2 3
3 Jake 2 3
3 2
If I join these two tables ON SampleID, and I want to sum the DummyItem for each DummyName, how can I do this without double summing?
The solution is to first aggregate and then do the join:
select t1.sampleid, t1.dummyname, t.total_items
from table_1 t1
join (
select t2.sampleid, sum(dummyitem) as total_items
from table_2 t2
group by t2
) t ON t.sampleid = t1.sampleid;
The real question is however: why are the duplicates in table_1?
I would take a step back and try to assess the database design. Specifically, what rules allow such duplicate data?
To address your specific issue given your data, here's one option: create a temp table that contains unique rows from Table 1, then join the temp table with Table 2 to get the sums I think you are expecting.

Creating a view with a column not in the base table

I have a table with 4 columns. I am being asked to create a view that performs a calculation and then puts the results in a column not in the table.
Here it is: Create a view called v_count that shows the number of students working on each assignment. The view should have columns for the assignment number and the count.
The underlying table does not have a count column.
Well you have to make use of Count function and GROUP BY clause. Suppose you have student id and assignment id in your table:
sId AsnId
1 1
1 2
2 1
2 5
2 8
3 2
3 4
Then following query will give you count of students working on an assignments:
SELECT asnId [Assignment], COUNT(sid) [Students]
FROM Assignment
GROUP BY asnid
Now you can use this query to create your view. But do read docs about Count and Group By

query gives two of the same results

I have the following SQL query but I got a problem:
When I execute it I got two of the same serial numbers from the "sn" column in the "products" table.
SELECT specifications.productname,
products.sn, specifications.year,
lendings.lending_date
FROM products
INNER JOIN lendings ON products.id = lendings.product_id
INNER JOIN specifications ON products.sn LIKE CONCAT(\'%\', specifications.sn, \'%\') OR products.type LIKE CONCAT(\'%\', specifications.type, \'%\')
WHERE lendings.user_id = ?
EDIT:
lendings table:
user_id product_id
1 1
1 2
2 3
Specifications table:
productname year type sn
name1 2012 1 1234
name2 2011 2 4321
name3 2010 3 3241
products table:
id sn
1 AAAAAAAA1234
2 BBBBBBBB4321
3 CCCCCCCC3241
EDIT2:
SELECT products.id,
specifications.productname,
products.sn,
specifications.year,
lendings.lending_date
FROM products
INNER JOIN lendings ON products.id = lendings.product_id
INNER JOIN specifications ON products2.sn LIKE CONCAT(specifications.sn, \'%\') OR products.type = specifications.type
WHERE lendings.user_id = ?
One of your Join on conditions is too slack then
for instance two lendings records pointing to the same product.
Usually, that means you don't have all the necesary join columns present in one of your joins and you are getting a cartesian product. In database terms, this means you are joining to a table and expected to join to a single row, but multiple rows match the criteria, so you are actually joining to more than one row. When this happens, you will get the same row multiple times (product row in your example) in your result.
It would have been better if you posted some test data so this scenario could be confirmed, but since you didn't, I would recommend checking each of your joins to make sure you are not getting multiple rows back for the given products row.
One part of your query I find particularly suspect is this join:
INNER JOIN specifications ON products.sn LIKE CONCAT(\'%\', specifications.sn, \'%\') OR products.type LIKE CONCAT(\'%\', specifications.type, \'%\')
You're joining using a LIKE operator, which seems to have a high chance of getting multiple rows.

Duplicate values returned with joins

I was wondering if there is a way using TSQL join statement (or any other available option) to only display certain values. I will try and explain exactly what I mean.
My database has tables called Job, consign, dechead, decitem. Job, consign, and dechead will only ever have one line per record but decitem can have multiple records all tied to the dechead with a foreign key. I am writing a query that pulls various values from each table. This is fine with all the tables except decitem. From dechead I need to pull an invoice value and from decitem I need to grab the net wieghts. When the results are returned if dechead has multiple child decitem tables it displays all values from both tables. What I need it to do is only display the dechad values once and then all the decitems values.
e.g.
1 ¦123¦£2000¦15.00¦1
2 ¦--¦------¦20.00¦2
3 ¦--¦------¦25.00¦3
Line 1 displays values from dechead and the first line/Join from decitems. Lines 2 and 3 just display values from decitem. If I then export the query to say excel I do not have duplicate values in the first two fileds of lines 2 and 3
e.g.
1 ¦123¦£2000¦15.00¦1
2 ¦123¦£2000¦20.00¦2
3 ¦123¦£2000¦25.00¦3
Thanks in advance.
Check out 'group by' for your RDBMS http://msdn.microsoft.com/en-US/library/ms177673%28v=SQL.90%29.aspx
this is a task best left for the application, but if you must do it in sql, try this:
SELECT
CASE
WHEN RowVal=1 THEN dt.col1
ELSE NULL
END as Col1
,CASE
WHEN RowVal=1 THEN dt.col2
ELSE NULL
END as Col2
,dt.Col3
,dt.Col4
FROM (SELECT
col1, col2, col3
,ROW_NUMBER OVER(PARTITION BY Col1 ORDER BY Col1,Col4) AS RowVal
FROM ...rest of your big query here...
) dt
ORDER BY dt.col1,dt.Col4