How to get multiple columns with a single group by in Postgres? - postgresql

I have a table that looks like below
Table "public.test_systems"
Column | Type | Modifiers
-----------------------------+------------------------+-----------
rid | integer | not null
r_osname | character varying(255) |
r_health | integer |
r_patch | bigint |
r_loc | character varying(255) |
Here each row in the table depicts a system. Now if I want to find out how many systems by unique OS names, I do a query like below
select r_osname, count(*) as total_systems from test_systems group by r_osname;
So I get a result like below
r_osname | total_systems
-----------------------------------------------+--------------
Ubuntu 18.04.4 LTS | 18
Windows 10 Pro | 2
CentOS Linux | 1
Windows Server 2019 | 3
Mac OS X - High Sierra | 2
Now I want to run the same query but for multiple columns. In other words I want to get multiple columns with a single groupby. But Postgres forces me to mention the additional columns in the groupby too.
I tried distinct on in my query like below
select distinct on (r_osname) test_systems.* from test_systems order by os_name;
I got same number of rows (partial success) but can't get the count(*) as an additional column.
The final result could look something like below (on including additional columns like r_health and r_loc)
r_osname | r_health | r_loc | total_systems
-----------------------------------------------+-----------------------------------+--------------------+--------------
Ubuntu 18.04.4 LTS | 1012 | NYC | 18
Windows 10 Pro | 1121 | LON | 2
CentOS Linux | 1255 | DEL | 1
Windows Server 2019 | 1451 | HYD | 3
Mac OS X - High Sierra | 1120 | LA | 2
How do I get the expected result?

You need a window function to make this work:
SELECT DISTINCT
r_osname, r_health, r_loc,
count(*) OVER (PARTITION BY r_osname, r_health, r_loc)
FROM test_systems
Depending on which combination of values in the columns you want to include in the result you can play with the DISTINCT ON (...) clause. Without a DISTINCT clause you will get as many rows as there are in the table (26 in your example) but if you want only 1 row for each OS then you should use DISTINCT ON (r_osname). The row that will be returned depends on the ORDER BY clause - if none is given then the first row will be returned for each set of rows which have the same r_osname but there will be no way to predict which row that will be.

Related

Insert last characters of a value in Postgres table while inserting that same value

CONTEXT:
I'm currently building a custom import script in Python with psycopg2 that inserts values from a csv file into a postgres database. The csv however provides me with a value that needs refining.
PROBLEM: In the example below you can see what I want:
I want the 5 last digits from the 15-digit.
mytestdb=# select * from testtable;
uid | first_name | last_name | age | 15-digit | last_5_digits
-----+------------+-----------+-----+----------------------------+-----------------
1 | John | Doe | 42 | 99999999912345 | 12345
I know I could accomplish this by first inserting the supplied values (first_name, last_name, age and 15-digit) and then using RIGHT(15-digit,5) and an UPDATE statement fill the last_5_digits field.
I would however prefer to do this during the initial INSERT of the row. This would considerably lower the amount of transactions on the database.
Could anyone help me getting this done?

Know which table are affected by a connection

I want to know if there is a way to retrieve which table are affected by request made from a connection in PostgreSQL 9.5 or higher.
The purpose is to have the information in such a way that will allow me to know which table where affected, in which order and in what way.
More precisely, something like this will suffice me :
id | datetime | id_conn | id_query | table | action
---+----------+---------+----------+---------+-------
1 | ... | 2256 | 125 | user | select
2 | ... | 2256 | 125 | order | select
3 | ... | 2256 | 125 | product | select
(this will be the result of a select query from user join order join product).
I know I can retrieve id_conn througth "pg_stat_activity", and I can see if there is a running query, but I can't find an "history" of the query.
The final purpose is to debug the database when incoherent data are inserted into the table (due to a lack of constraint). Knowing which connection do the insert will lead me to find the faulty script (as I have already the script name and the id connection linked).

What is the column limit for Spark Data Frames?

Our team is having a lot of issues with the Spark API particularly with large schema tables. We currently have a program written in Scala that utilizes the Apache spark API to create two Hive tables from raw files. We have one particularly very large raw data file that is giving us issues that contains around ~4700 columns and ~200,000 rows.
Every week we get a new file that shows the updates, inserts and deletes that happened in the last week. Our program will create two tables – a master table and a history table. The master table will be the most up to date version of this table while the history table shows all changes inserts and updates that happened to this table and showing what changed. For example, if we have the following schema where A and B are the primary keys:
Week 1 Week 2
|-----|-----|-----| |-----|-----|-----|
| A | B | C | | A | B | C |
|-----|-----|-----| |-----|-----|-----|
| 1 | 2 | 3 | | 1 | 2 | 4 |
|-----|-----|-----| |-----|-----|-----|
Then the master table will now be
|-----|-----|-----|
| A | B | C |
|-----|-----|-----|
| 1 | 2 | 4 |
|-----|-----|-----|
And The history table will be
|-----|-----|-------------------|----------------|-------------|-------------|
| A | B | changed_column | change_type | old_value | new_value |
|-----|-----|-------------------|----------------|-------------|-------------|
| 1 | 2 | C | Update | 3 | 4 |
|-----|-----|-------------------|----------------|-------------|-------------|
This process is working flawlessly for shorter schema tables. We have a table that has 300 columns but over 100,000,000 rows and this code still runs as expected. The process above for the larger schema table runs for around 15 hours, and then crashes with the following error:
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.generic.Growable$class.loop$1(Growable.scala:52)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
Here is a code example that takes around 4 hours to run for this larger table, but runs in 20 seconds for other tables:
var dataframe_result = dataframe1.join(broadcast(dataframe2), Seq(listOfUniqueIds:_*)).repartition(100).cache()
We have tried all of the following with no success:
Using hash broad-cast joins (dataframe2 is smaller, dataframe1 is huge)
Repartioining on different numbers, as well as not repartitioning at all
Caching the result of the dataframe (we originally did not do this).
What is causing this error and how can we fix it? The only difference between this problem table is that it has so many columns. Is there an upper limit to how many columns Spark can handle?
Note: We are running this code on a very large MAPR cluster and we tried giving the code 500GB of RAM and its still failing.

MySQL Select if field is unique or null

Sorry, I can't find an example anywhere, mainly because I can't think of any other way to explain it that doesn't include DISTINCT or UNIQUE (which I've found to be misleading terms in SQL).
I need to select unique values AND null values from one table.
FLAVOURS:
id | name | flavour
--------------------------
1 | mark | chocolate
2 | cindy | chocolate
3 | rick |
4 | dave |
5 | jenn | vanilla
6 | sammy | strawberry
7 | cindy | chocolate
8 | rick |
9 | dave |
10 | jenn | caramel
11 | sammy | strawberry
I want the kids who have a unique flavour (vanilla, caramel) and the kids who don't have any flavour.
I don't want the kids with duplicate flavours (chocolate, strawberry).
My searches for help always return an answer for how to GROUP BY, UNIQUE and DISTINCT for chocolate and strawberry. That's not what I want. I don't want any repeated terms in a field - I want everything else.
What is the proper MySQL select statement for this?
Thanks!
You can use HAVING to select just some of the groups, so to select the groups where there is only one flavor, you use:
SELECT * from my_table GROUP BY flavour HAVING COUNT(*) = 1
If you then want to select those users that have NULL entries, you use
SELECT * FROM my_table WHERE flavour IS NULL
and if you combine them, you get all entries that either have a unique flavor, or NULL.
SELECT * from my_table GROUP BY flavour HAVING COUNT(*) = 1 AND flavour IS NOT NULL
UNION
SELECT * FROM my_table WHERE flavour IS NULL
I added the "flavour IS NOT NULL" just to ensure that a flavour that is NULL is not picked if it's the single one, which would generate a duplicate.
I don't have a database to hand, but you should be able to use a query along the lines of.
SELECT name from FLAVOURS WHERE flavour IN ( SELECT flavour, count(Flavour) from FLAVOURS where count(Flavour) = 1 GROUP BY flavour) OR flavour IS NULL;
I apologise if this isn't quite right, but hopefully is a good start.
You need a self-join that looks for duplicates, and then you need to veto those duplicates by looking for cases where there was no match (that's the WHERE t2.flavor IS NULL). Then your're doing something completely different, looking for nulls in the original table, with the second line in the WHERE clause (OR t1.flavor IS NULL)
SELECT DISTINCT t1.name, t1.flavor
FROM tablename t1
LEFT JOIN tablename t2
ON t2.flavor = t1.flavor AND t2.ID <> t1.ID
WHERE t2.flavor IS NULL
OR t1.flavor IS NULL
I hope this helps.

Select distinct rows from MongoDB

How do you select distinct records in MongoDB? This is a pretty basic db functionality I believe but I can't seem to find this anywhere else.
Suppose I have a table as follows
--------------------------
| Name | Age |
--------------------------
|John | 12 |
|Ben | 14 |
|Robert | 14 |
|Ron | 12 |
--------------------------
I would like to run something like SELECT DISTINCT age FROM names WHERE 1;
db.names.distinct('age')
Looks like there is a SQL mapping chart that I overlooked earlier.
Now is a good time to say that using a distinct selection isn't the best way to go around querying things. Either cache the list in another collection or keep your data set small.