the query with start and end date - tsql

Can somebody help me for a query. I have the table dbo.table. I has writtenthis query:
select Name, Service, number, Dates, Country, Price
from dbo.table
where country is not NULL and Service in ('Incoming', 'Outgoing') and years =2014 and months=3
I am getting the results:
User1 Incoming 1111 03.07.2014 Belarus 5,5
User1 Incoming 1111 03.09.2014 Belarus 1,5
User1 Incoming 1111 03.10.2014 Belarus 1,5
User1 Outgoing 1111 03.10.2014 Belarus 2
User1 Outgoing 1111 03.11.2014 Belarus 3
User1 Outgoing 1111 03.11.2014 Belarus 4
User1 Incoming 1111 03.07.2014 France 4,3
User1 Incoming 1111 03.07.2014 France 2,7
User1 Incoming 1111 03.08.2014 France 1
User1 Outgoing 1111 03.15.2014 France 2
User1 Outgoing 1111 03.15.2014 France 3
User1 Outgoing 1111 03.15.2014 France 6
What should I use the query that if want get this result:
User1 Incoming 03.07.2014 03.10.2014 Belarus 8,5
User1 Outgoing 03.10.2014 03.11.2014 Belarus 9
User1 Incoming 03.07.2014 03.08.2014 France 8
User1 Outgoing 03.15.2014 03.15.2014 France 11

select Name, Service, MIN(Dates) as MinDate, MAX(Dates) as MaxDate, Country, SUM(Price) as Price
from dbo.table
where country is not NULL and Service in ('Incoming', 'Outgoing') and years =2014 and months=3
GROUP BY Name, Service, Country

Related

Get distinct values in Pyspark and if duplicate value then should be placed in another column

Input Table:
prod
acct
acctno
newcinsfx
John
A01
1
89
John
A01
2
90
John
A01
2
92
Mary
A02
1
92
Mary
A02
3
81
Desired output table:
prod
acct
newcinsfx1
newcinsfx2
John
A01
89
John
A01
90
92
Mary
A02
92
Mary
A02
81
I tried to do it by distinct function.
df.select('prod',"acctno").distinct()
df.show()

Remove duplicates in spark with 90 percent column match

Compare two rows in a dataframe in Spark and to remove the row if 90 percent of the columns matches(if there are 10 columns and if 9 matches). How to do this?
Name Country City Married Salary
Tony India Delhi Yes 30000
Carol USA Chicago Yes 35000
Shuaib France Paris No 25000
Dimitris Spain Madrid No 28000
Richard Italy Milan Yes 32000
Adam Portugal Lisbon Yes 36000
Tony India Delhi Yes 22000 <--
Carol USA Chicago Yes 21000 <--
Shuaib France Paris No 20000 <--
Have to remove the marked rows since 90 percent that 4 out of 5 column values are matching with already existing rows.How to do this in Pyspark Dataframe.TIA

Divide the dataframe to 3 buckets based on sum of one column

I have a dataframe
c1 c2
user1 5
user2 3
user3 3
user4 1
I want to divide the dataframes into 3 equal groups based on total sum of c2
total sum of c2 = 12/3 = 4
In this case user1 have value 5 (>4) Ist group , user2 and user3 (total 6) >4(2nd group), and remaining all should be in 3rd group
so my expected dataframe
c1 c2 rank
user1 5 1
user2 3 2
user3 3 2
user4 1 3
I am trying with Window function and custom window function, but no luck so far.

How to find out the keywords in two hadoop tables with Spark?

I have two tables in HDFS. One table (Table-1) has some keywords as you can see below. Another table (Table-2) has a text column. Every row could have more than one keyword in Table-1. I need to find out all the matched keywords in Table-1 for the text column in Table-2, and output the keyword list for every row in Table-2.
Example :
Table-1:
ID | Name | Age | City | Gender
---------------------------------
111 | Micheal | 19 | NY | male
222 | George | 23 | CA | male
333 | Linda | 22 | LA | female
Table-2:
Text_Description
------------------------------------------------------------------------
1-Linda and my cousin left the house.
2-Michael who is 19 year old, and George are going to rock concert in CA.
3-Shopping card is ready at the NY for male persons.
Output:
1- Linda
2- Micheal, 19, George, CA
3- NY, male

Select from table removing similar rows - PostgreSQL

There is a table with document revisions and authors. Looks like this:
doc_id rev_id rev_date editor title,content so on....
123 1 2016-01-01 03:20 Bill ......
123 2 2016-01-01 03:40 Bill
123 3 2016-01-01 03:50 Bill
123 4 2016-01-01 04:10 Bill
123 5 2016-01-01 08:40 Alice
123 6 2016-01-01 08:41 Alice
123 7 2016-01-01 09:00 Bill
123 8 2016-01-01 10:40 Cate
942 9 2016-01-01 11:10 Alice
942 10 2016-01-01 11:15 Bill
942 15 2016-01-01 11:17 Bill
I need to find out moments when document was transferred to another editor - only first rows of every edition series.
Like so:
doc_id rev_id rev_date editor title,content so on....
123 1 2016-01-01 03:20 Bill ......
123 5 2016-01-01 08:40 Alice
123 7 2016-01-01 09:00 Bill
123 8 2016-01-01 10:40 Cate
942 9 2016-01-01 11:10 Alice
942 10 2016-01-01 11:15 Bill
If I use DISTINCT ON (doc_id, editor) it resorts a table and I see only one per doc and editor, that is incorrect.
Of course I can dump all and filter with shell tools like awk | sort | uniq. But it is not good for big tables.
Window functions like FIRST_ROW do not give much, because I cannot partition by doc_id, editor not to mess all them.
How to do better?
Thank you.
You can use lag() to get the previous value, and then a simple comparison:
select t.*
from (select t.*,
lag(editor) over (partition by doc_id order by rev_date) as prev_editor
from t
) t
where prev_editor is null or prev_editor <> editor;