Is there a way to use order by to order similar fields together? - postgresql

Is there a way to make similar values equivalent in an order by?
Say the data is:
name | number
John. | 9
John | 1
John. | 2
Smith | 4
John | 3
I'd want to order by name and then number, so that the output looks like this, but order by name, number will put all John entries ahead of John. entries.
name | number
John | 1
John. | 2
John | 3
John. | 9
Smith | 4

There's Fuzzy String Matching and beyond that, Pattern Matching.

You need some more advanced treatment on the name field then.
This topic will help you strip non-alphabetic characters from your string before ordering:
How to strip all non-alphabetic characters from string in SQL Server?
But the fact that you need such a complex function makes me question the very building process of your Database: if "John" and "John." are the same person, they should have the same name. So if the "." is important, that means you need another field to store the information it represents.

Use a regex replace function to strip out all special characters in your data, replacing with a space. Then wrap that in a TRIM function to remove the spaces
TRIM(CASE
WHEN name LIKE '%.%'
OR name LIKE '%_%'
OR name ~ '%\d%' --This is for a number
THEN
REGEXP_REPLACE(name, '(\_|\.|\d)', ' ' ) END) AS name_processed
The bit in brackets means replace an underscore or (|) a period or a digit with whatever is after the comma, which here is a space
Now you can order by name_processed and number as well
ORDER BY name_processed, number DESC
But you can always keep the original name in a SELECT afterwards if you wrote a subquery first through WITH. Let me know if you want to do this. Basically the syntx would be:
WITH processed_names AS (
SELECT
name,
TRIM(CASE
WHEN name LIKE '%.%'
OR name LIKE '%_%'
OR name ~ '%\d%' --This is for a number
THEN
REGEXP_REPLACE(name, '(\_|\.|\d)', ' ' ) END) AS name_processed,
number
FROM names
ORDER BY 2,3 DESC)
SELECT
name,
number
FROM processed_names;

Related

How to filter rows based on a condition and if the condition isn't met, grab another row in Talend?

It was hard to think of a title for this question, so hopefully that did make sense.
I will explain further. I have a flow of data from an Excel file and each row has one of two words in the last column. It will either contain "Open" or "Current".
So lets say I have an input that looks like this:
NAME | SSN | TYPE
John | 12345| Current
Katy | 99999| Current
Sam | 33333| Current
John | 12345| Open
Cody | 55555| Open
And the goal is grab only a person once. Each person has their unique id as their SSN. I want to grab Open rows if both Open and Current exist for that person. If only Current exists, then grab that.
So the final output should look like this:
NAME | SSN | TYPE
Katy | 99999| Current
Sam | 33333| Current
John | 12345| Open
Cody | 55555| Open
NOTE: As you can see, the first entry for John has been removed since he had an Open row.
I have attempted this already but it is sloppy and I figure there must be a better way. Here is an image of what I have done:
Talend flow
Here's how you can do it:
First sort the data by Name, and Type descending (this is important so that for each person, the Open record is on the top); then in the tMap filter it like this:
Numeric.sequence(row2.name, 1, 1) == 1
Only let the record through if this is the first we're seeing this name.

Atomic values / divisibility to reach 1NF

After reading about normalization I am unsure of how to interpreter the 1 NF requirements
According to wikipedia, something is in first normal form, if the "domain of each attribute contains only atomic indivisible values"
My question is: Who decides what is indivisible or not?
You may divide a date datatype into year, month, day, second, nanoseconds. You may aswell divide an adress into the exact latitude coordinates. When can you really be sure that you have reached 1NF?
Would this table be considered 1NF?
fullName
fullAdresss
Joe Zowesson
87th Victoria Street London EC96 1MB, 14584
Mason Hamburg
47th Jeremy Street London EC26 1MB, 13584
Dedrik Terry
27th Burger Street London EC16 1MB, 17584
My interpretation here is that the value Joe Zowesson is indivisible in regards to the column fullName. And that both zip code, street number and street name is atomic in relation to the column name fullAddress.
I am almost certain that I am in the wrong, but I can not yet understand why.
The question is in regards to an upcoming exam, where I will need to "proove" which normal form something currently is in. Something that I find very hard depending on how you interpreter the word atomic.
You have misunderstood the concept of 1NF basically. By atomic value, it is meant that when you have a column for Name, you should not store any other values alongside it. In other words, the column intended for the Name should not store ID, Address or anything else together with Name, so that when you query the column Name you get only Name, and not name with Id or Address. And Name can be in any form you want whether it be First name + Last name or First name + Last name + Middle name + Previous name.
The decision of whether you need separate columns for the related data should be made during design. Let's suppose you have table Student:
StudentId
FullName
Address
Average grade
1
John Done
New York, US
3.4
2
Robert Bored
New York, US
0
3
Student LName
Dallas, US
1
4
Another LName
Munich, Germany
2
In this case, it means that you do not write queries and don't need data based on First name, Last name separately, but you need all at once for example:
SELECT FullName
FROM Student
WHERE StudentId = 1;
John Done
And when you need First name, Last name separately, you decompose them into several columns, for example:
StudentId
FullName
LastName
Address
Average grade
1
John
Done
New York, US
3.4
2
Robert
Bored
New York, US
0
3
Student
LName
Dallas, US
1
4
Another
LName
Munich, Germany
2
And your queries might look like this:
SELECT LastName, AverageGrade
FROM Student
WHERE AverageGrade >= 1 AND FirstName != 'John';
The result will be:
| LastName | AverageGrade |
---------------------------
| LName | 1 |
| LName | 2 |
Or something like this maybe:
UPDATE Student
SET AverageGrade = 4
WHERE LastName = 'LName' AND FirstName != 'Student'
Basically, the decision depends on how you manipulate the data and in which form you need it.
To sum it up. Whether the relation is in 1NF or not depends on what values you're trying to store on this table, as I mentioned above, one column should store only one type of value, e.g ID, Address, Name, etc. And the decision of how your columns' values will look depends on the design and how you NEED TO STORE the data. If you do not need to query fistname, middlename, lastname, secondname separately, then what you can do is just save all of them in one column FullName and it will still be in 1NF. But if you need them separately, you can store them in separate columns, and again it will still be in 1NF, but it might violate other rules.
Here are some tutorials you might find useful: https://www.studytonight.com/dbms/first-normal-form.php
Let the application, and how it will be used, guide you as to what data should be split further into additional fields (or not).
For example;
If, in your application, you are constantly splitting first name from last name so that you can say "Hi Joe" on correspondence, you should split fullName into two fields. Conversely, If you had two fields firstName and lastName, and were always concatenating them so that you could correctly address an envelope, it would make more sense to have those two fields stored in a single column in your table.
In practice, it is not uncommon for a database to show some de-normalization with the above example given how common both scenarios are but the risk is that they get out of sync if someone updates first name (for example) but doesn't update fullName.
Consider things like how you will force your users to follow a certain pattern if you decide to go with a single column fullName. How would you prevent "Smith, Joe" if your application needed "Joe Smith"?
Dates are another good example and again, whether you split the parts into separate columns depends on how they will be used.
A datetime field which indicates when a row was inserted probably doesn't need to be split out, but if you had many queries which were only interested in the year (for example), it might make sense to split it out.
This only scratches the surface which is why this answer is more about how to think about the underlying problem. Yes normalizing your database is important for all kinds of reasons, but how far you go with it depends on how your data will be used at the end of the day.

postgres ANY() with BETWEEN Condition

in case someone is wondering, i am recycling a different question i answered myself, because is realized that my problem has a different root-cause than i thought:
My question actually seems pretty simple, but i cannot find a way.
How do is query postgres if any element of an array is between two values?
The Documentation states that a BETWEEN b and c is equivalent to a > b and a < c
This however does not work on arrays, as
ANY({1, 101}) BETWEEN 10 and 20 has to be false
while
ANY({1,101}) > 10 AND ANY({1,101}) < 20 has to be true.
{1,101} meaning an array containing the two elements 1 and 101.
how can i solve this problem, without resorting to workarounds?
regards,
BillDoor
EDIT: for clarity:
The scenario i have is, i am querying an xml document via xpath(), but for this problem a column containing an array of type int[] does the job.
id::int | numbers::int[] | name::text
1 | {1,3,200} | Alice
2 | {21,100} | Bob
I want all Names, where there is a number that is between 20 and 30 - so i want Bob
The query
SELECT name from table where ANY(numbers) > 20 AND ANY(numbers) < 30
will return Alice and Bob, showing that alice has numbers > 20 as well as other numbers < 30.
A BETWEEN syntax is not allowed in this case, however between only gets mapped to > 20 AND < 30 internally anyways
Quoting the docs on the Between Operators' mapping to > and < documentation:
There is no difference between the two respective forms apart from the
CPU cycles required to rewrite the first one into the second one
internally.
PS.:
Just to avoid adding a new question for this: how can i solve
id::int | numbers::int[] | name::text
1 | {1,3,200} | Alice
2 | {21,100} | Alicia
SELECT id FROM table WHERE ANY(name) LIKE 'Alic%'
result: 1, 2
i can only find examples of matching one value to multiple regex, but not matching one regex against a set of values :/. Besides the shown syntax is invalid, ANY has to be the second operand, but the second operand of LIKE has to be the regex.
exists (select * from (select unnest(array[1,101]) x ) q1 where x between 10 and 20 )
you can create a function based on on this query
second approach:
select int4range(10,20,'[]') #> any(array[1, 101])
for timestamps and dates its like:
select tsrange( '2015-01-01'::timestamp,'2015-05-01'::timestamp,'[]') #> any(array['2015-05-01', '2015-05-02']::timestamp[])
for more info read: range operators

Counting multiple values from one column in Tableau

I have a field from the data I am reading in that can contain multiple values. They are essentially tags.
For example, there could be a column called "persons responsible". This could read "Joe; Bob; Sue" or "Sue" for a given row.
Is it possible from within Tableau to read these in as separate categories? So that for this sample data:
Project | Persons
---------------------------
Zeta | Bob; Sue; Joe
Enne | Sue
Doble Ve | Bob
There could be a count of Bob (2), Sue (2), Joe (1)?
I am working on getting better data inputs, but I was wondering if there was a temporary solution at this level.
I would definitely work towards normalizing your schema.
In the meantime, there is a workaround that is almost reasonable if there is a small set of possible values for the tags (persons in your example).
If Bob, Sue and Joe are the only people in the system, you can use the contains() function to define a boolean calculated field for each person -- e.g. Bob_Is_Responsible = contains(Persons, 'Bob"), and similar fields for Sue and Joe. Then you could use those as building blocks, possibly with sets, to break the data up in different ways.
Of course, this approach gets cumbersome fast if the number of tags grows, or if it is unconstrained. But you asked for a temporary solution ...
If the number of elements is small, you write and union several queries with each one having the project and nth element.
Ideally, you'd reshape your data to look like this either in the database or with the above mentioned union technique. Then you could count() or countd() the elements by project.
Project | Persons
---------------------------
Zeta | Bob
Zeta | Sue
Zeta | Joe
Enne | Sue
Doble Ve | Bob

Space character within symbol literals

I need to query a database which contains names of companies. I have list of around 50 names, for which i have to get the data. But I am unable to write a query using in command as there spaces in a name which are not being recognized. ex
select from sales where name in (`Coca Cola, `Pepsi)
This is giving me an error as 'Cola' is not being recognized. Is there a way to write such a query?
The spaces between the strings cause the interpreter to get confused. The `$() casts the list of characters to symbols.
q)t:([] a:1 2 3; name:`$("coca cola";"pepsi";"milk"))
q)select from t where name in `$("coca cola";"pepsi")
a name
-----------
1 coca cola
2 pepsi
You may also want to be careful of casing and either use consistently lower or upper case else that would cause unexpected empty results:
q)select from t where name in `$("Coca Cola";"Pepsi")
a name
------
q)select from t where upper[name] in upper `$("Coca Cola";"Pepsi")
a name
-----------
1 coca cola
2 pepsi
You need to do something like the following:
select from sales where name in `$("Coca Cola";"Pepsi")