I want to get repetitive titles to have name, name and not be repeated in another Colum.
#CODE
create or replace view Master_view as
select books.title, books.isbn as book_id, authors.name, authors.e_mail as author_id
from book_author join books on book_author.book_id = books.id
join authors on book_author.author_id = authors.id;
#OUTPUT
title | book_id | name | author_id
-----------------------------------+---------+---------------------------+--------------------------------
The Pragmatic Programmer | 999999 | Andrew Hunt | andyhunt#pragprogrammers.com
The Pragmatic Programmer | 999999 | Dave Thomas | davethomas#pragprogrammers.com
Pragmatic Thinking and Learning | 999998 | Andrew Hunt | andyhunt#pragprogrammers.com
Pragmatic Unit Testing | 999997 | Andrew Hunt | andyhunt#pragprogrammers.com
Pragmatic Unit Testing | 999997 | Dave Thomas | davethomas#pragprogrammers.com
Agile Web Development with Rails | 999996 | Dave Thomas | davethomas#pragprogrammers.com
Agile Web Development with Rails | 999996 | Sam Ruby | samruby#pragprogrammers.com
Agile Web Development with Rails | 999996 | David Heinemeier Hansson | dhh#railsrules.co
(8 rows)
create or replace view Master_view as
select books.title, books.isbn as book_id, array_agg(authors.name), array_agg(authors.e_mail) as author_id
from book_author join books on book_author.book_id = books.id
join authors on book_author.author_id = authors.id group by (books.title, books.isbn);
This will give you the expected results.You have to group the rows using isbn and title of the book.
Related
I am trying to come up with a pyspark sql query to return the row within the text column of the review Dataframe with the most number of words.
I would like to return both the full text as well as the number of words. This question is in regards to the reviews of the Yelp dataset. Here is what I have so far but apparently it is not (fully) correct:
query = """
SELECT text,LENGTH(text) - LENGTH(REPLACE(text,' ', '')) + 1 as count
FROM review
GROUP BY text
ORDER BY count DESC
"""
spark.sql(query).show()
Here is an example of a few rows from the dataframe:
[Row(business_id='ujmEBvifdJM6h6RLv4wQIg', cool=0, date='2013-05-07 04:34:36', funny=1, review_id='Q1sbwvVQXV2734tPgoKj4Q', stars=1.0, text='Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.', useful=6, user_id='hG7b0MtEbXx5QzbzE6C_VA'),
Row(business_id='NZnhc2sEQy3RmzKTZnqtwQ', cool=0, date='2017-01-14 21:30:33', funny=0, review_id='GJXCdrto3ASJOqKeVWPi6Q', stars=5.0, text="I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon! I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level! \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit. Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room. Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure. That was superb! Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen. The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement. It was so much fun to be there! \n\nNext Travis started with the flat iron. The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable. It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists. At the end of the blowout & style my hair was perfectly bouncey and looked terrific. The only thing better? That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas. You make me feel beauuuutiful!", useful=0, user_id='yXQM5uF2jS6es16SJzNHfg'),
Row(business_id='WTqjgwHlXbSFevF32_DJVw', cool=0, date='2016-11-09 20:09:03', funny=0, review_id='2TzJjDVDEuAW6MR5Vuc1ug', stars=5.0, text="I have to say that this office really has it together, they are so organized and friendly! Dr. J. Phillipp is a great dentist, very friendly and professional. The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable! I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit! I highly recommend this office for the nice synergy the whole office has!", useful=3, user_id='n6-Gk65cPZL6Uz8qRm3NYw')]
And expected output if this was the review with the most words:
I have to say that this office really has it together, they are so organized and friendly! Dr. J. Phillipp is a great dentist, very friendly and professional. The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable! I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit! I highly recommend this office for the nice synergy the whole office has!
And then something like Word count = xxxx
Edit: Here the example output for the first review using this code:
query = """
SELECT text, size(split(text, ' ')) AS word_count
FROM review
ORDER BY word_count DESC
"""
spark.sql(query).show(20, False)
Review returned with highest number of words:
Got a date with de$tiny?
** A ROMANTIC MOMENT WITH **
** THE BEST VIEW IN TOWN**
------------------------------------------------
/ **CN TOWER'S** \
/ **REVOLVING RESTAURANT** \
\ /
\ ----------------------------------------------- /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
/ \
===========
o o~
/|~ ~|\
/\ / \ uhm, maybe not. the view may be great but a $30 to
$40 bleh $teak ain't necessarily gonna get you some
action later. Cheaper to get takeout from Harvey's and
eat and the beach! |4329 |
Encapsulating the UDF you had into native SQL logic by splitting string into an array of words and finding the array size.
spark.sql("SELECT text, size(split(text, ' ')) as word_count FROM review ORDER BY word_count DESC").show(200, False)
Example
data = [("This is a sentence.",), ("This sentence has 5 words.", )]
review = spark.createDataFrame(data, ("text", ))
review.registerTempTable("review")
spark.sql("SELECT text, size(split(text, ' ')) as word_count FROM review ORDER BY word_count DESC").show(200, False)
Output
+--------------------------+----------+
|text |word_count|
+--------------------------+----------+
|This sentence has 5 words.|5 |
|This is a sentence. |4 |
+--------------------------+----------+
I have a table like this:
| id | cars | owner |
|----|--------------------------|----------------|
| 1 | {tesla, bmw, mercedes} | Chris Houghton |
| 2 | {toyota, bmw, fiat} | Matt Quinn |
Is there a way to access the car table array DISTINCT values and store them in a new table without duplicate values?
I want this table
| brands |
|--------|
| tesla |
| bmw |
|mercedes|
| toyota |
| fiat |
I believe you are looking for this kind of statement.
SELECT
DISTINCT
table_array.array_unnest
FROM (
SELECT
UNNEST(cars)
FROM
<table>
) AS table_array(array_unnest)
see demo
This indeed works but how can I store them for example in a column
"brand" of a table Manufactures.
INSERT INTO
Manufactures
(brand)
SELECT
DISTINCT
table_array.array_unnest
FROM (
SELECT
UNNEST(cars)
FROM
<table>
) AS table_array(array_unnest)
see demo
I'm using Postgresql. Let's say I have 3 tables:
Classes
id | name
1 | Biology
2 | Math
Students
id | name
1 | John
2 | Jane
Student_Classes
id | student_id | class_id | registration_token
1 | 1 | 1 | abc
2 | 1 | 2 | def
3 | 2 | 1 | zxc
I want to obtain a result set like this:
Results
student_name | biology | math
John | abc | def
Jane | zxc | NULL
I can get this result set with this query:
SELECT
student.name as student_name,
biology.registration_token as biology,
math.registration_token as math
FROM
Students
LEFT JOIN (
SELECT registration_token FROM Student_Classes WHERE class_id = (
SELECT id FROM Classes WHERE name = 'Biology'
)
) AS biology
ON Students.id = biology.student_id
LEFT JOIN (
SELECT registration_token FROM Student_Classes WHERE class_id = (
SELECT id FROM Classes WHERE name = 'Math'
)
) AS math
ON Students.id = math.student_id
Is there a way to get this same result set without having a join statement for each class? With this solution, if I want to add a class, I need to add another join statement.
You can do this via postgresql tablefunc extension crosstab but such presentation requirements may be handled better outside of sql.
I need to join two tables based on names. And the problem is that names may be a slight mispelling in one of the database. I have remedy this problem in the past using Stata and Python's fuzzy merging, where names are matched based on how closely similar they are, but I am wondering if this is possible to do in Postgresql.
For example, may data may be something similar to this:
Table A:
first_name_a | last_name_a | id_a
----------------------------------
William | Hartnell | 1
Matt | Smithe | 2
Paul | McGann | 3
David | Tennant | 4
Colin | Baker | 5
Table B:
first_name_b | last_name_b | id_b
----------------------------------
Matt | Smith | a
Peter | Davison | b
Dave | Tennant | c
Colin | Baker | d
Will | Hartnel | e
And in the end, I hope my results would look something like:
first_name_a | last_name_a | id_a | first_name_b | last_name_b | id_b
----------------------------------------------------------------------
William | Hartnell | 1 | Will | Hartnel | e
Matt | Smithe | 2 | Matt | Smith | a
Paul | McGann | 3 | | |
David | Tennant | 4 | Dave | Tennant | c
Colin | Baker | 5 | Colin | Baker | d
| | | Peter | Davison | b
My Sonic Screwdriver gives me some pseudo-code like this:
SELECT a.*, b.* FROM A a
JOIN B b
WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1
AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1
The DML you mention:
SELECT a.*, b.* FROM A a
JOIN B b
WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1
AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1
Looks correct, just bump up the 'fuzziness' (given 'IS LESS THAN 1' substitute 1 for the 'fuzzyness' level that you you require)
See http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html for reference info on LEVENSHTEIN.
Done up as an SQLFiddle. Play with the thresholds/look at some of the other mapping functions mentioned in matching fuzzy strings.
I have two tables, CompanyAddresses & MyCompanyAddresses. (Names changed to protect the guilty).
CompanyAddresses holds a list of default addresses for companies. These records are immutable. The user can change the details of a company address, but those changes are stored MyCompanyAddresses.
How can I produce a single list of addresses from both tables, excluding records from CompanyAddresses where a corresponding record exists in MyCompanyAddresses?
Sample Data
CompanyAddresses
DatabaseId | Id | Code | Name | Street | City | Zip | Maint Date
1 | Guid1 | APL | Apple | 1 Infinite Loop | Cupertino | 95014 | 11/1/2012
2 | Guid2 | MS | Microsoft | One Microsoft Way | Redmond | 98052 | 11/1/2012
MyCompanyAddresses
DatabaseId | Id | Code | Name | Street | City | Zip | Maint Date
5 | Guid3 | APL | Apple | Updated Address | Cupertino | 95014 | 11/6/2012
Desired Results
DatabaseId | Id | Code | Name | Street | City | Zip | Maint Date
2 | Guid2 | MS | Microsoft | One Microsoft Way | Redmond | 98052 | 11/1/2012
5 | Guid3 | APL | Apple | Updated Address | Cupertino | 95014 | 11/6/2012
I've tried various permutations of MS SQL's UNION, EXCEPT & INTERSECT to no avail. Also, I don't believe JOIN's are the answer either, but I'll be happily proven wrong.
The database design can be changed, but it would be preferable if it stayed the same.
Use a LEFT JOIN in combination with COALESCE. If the JOIN finds a match, the COALESCE will select values from the overridden row. If no match is found, the original values are returned.
SELECT ca.DatabaseId,
COALESCE(mca.Id, ca.Id) AS Id,
COALESCE(mca.Name, ca.Name) AS Name,
COALESCE(mca.Street, ca.Street) AS Street,
COALESCE(mca.City, ca.City) AS City,
COALESCE(mca.Zip, ca.Zip) AS Zip,
COALESCE(mca.MaintDate, ca.MaintDate) AS MaintDate,
FROM CompanyAddresses ca
LEFT JOIN MyCompanyAddresses mca
ON ca.Code = mca.Code;