multilevel sorting of column data(string datatype) in spark SQL - scala

for example lets consider there are three columns namely empid,year of joining and designation in employee table. initially the table is sorted based on the joining date in order. if in case two people have joined on same date then the table must have higher priority designation on top and down the lesser ones.
how to assign priority for the designation for the already sorted data in spark sql dataframe.
example if ceo and a project manager has joined an company on same date then the ceo details must be on top of projectmanager viewed in dataframe schema.

Let say you have a Dataframe with 3 columns - empid, joiningYear, designation. Then you can do something like this to sort on multiple columns:
import org.apache.spark.sql.functions._
val sorted = df.sort(asc("joiningYear"), asc("designation"))
In this case, data will firstly be sorted on joiningYear and for people with same joiningYear, it will be sorted by designation.

Considering there are three columns namely empid,year_of_joining and designation in employee table. You can sort or orderBy both columns but how will you order based on designation if it has e.g. "CEO", "projectManager". Using designation within sort() will sort it in its alphabetical order. So you must have some increasing numbers denoting the designation based on seniority then simply use below code.
import org.apache.spark.sql.functions._
val sortedEmp = df.sort(asc("year_of_joining"), desc("designation "))
Since you want higher priority designation on top and down the lesser ones so you should use desc for designation.
Since there wont be many designations, you can assign increasing numbers to these designations based on seniority.

Related

Pyspark filter a dataframe based on another dataframe containing distinct values

I have a df1 based on disticnt values containing two columns, date and value. There is df2 that has multiple column but contains the date and value column. For each distinct value from df1, i want to filter the df2 such that the records before the date from df1 are dropped. It would be rather easy for a single disticnt value, i can use something like filter by vlaue and then gt(lit(date), however i have over 500 such distinct pairs in df1. For single operation, it takes around 20 minute so if i use a loop then it is computationally not feasible. Perhaps some body can advice me on a better methodology here.
have tried multiple methodlogies but nothing has worked yet.

SQLite query not working when trying to run the same query in postgresql

I have a database of users purchase/sell of stocks and need to retrieve the data by summing all the shares for that specific user.
Ex: If I choose user_id = 7, it should return two rows, one with TESLA and one with Apple with the sums of the shares.
Database:
SQL queries I've tried include:
SELECT name,symbol,name,price,total, SUM(shares)
FROM symbol
WHERE user_id=7
GROUP BY name,symbol,name,price,total
Returns: but it should just be two rows
Your issue is your grouping.
Consider the uniqueness of what you are grouping by and that is how many rows you will get returned.
With your where clause and grouping by name,symbol,name,price,total there are three unique rows.
Remove the price and total columns from the grouping and you will get your desired two rows, or include them in your query as sum'd columns.
eg.
SELECT name,symbol,name,SUM(shares)
FROM symbol
WHERE user_id=7
GROUP BY name,symbol,name

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

SSRS Grouping Summary - with Max not working

This is the data that comes back from the database
Data Sample for one season (the report returns values for two):
What you can see is groupings, by Season, Theater then Performance number and lastly we have the revenue and ticket columns.
The SSRS Report Has three levels of groupings. Pkg (another ID that groups the below), venue -- the venue column and perf_desc -- the description column linked tot he perf_no.
Looks like this --
What I need to do is take the revenue column (a unique value) for each Performance and return it in a separate column -- so i use this formula.
sum(Max(Fields!perf_tix.Value, "perf_desc"))
This works great, gives me the total unique value for each performance -- and sums them up by the pkg level.
The catch is when i need to pull the data out by season.
I created a separate column looks like this
it's yellow because it's invisible and is referenced elsewhere. But the expression is if the Season value = to the Parameter (passed season value) -- then basically pull the sum of each of the tix values and sum them up. This also works great on the lower line - the line where the grouping exists for pkg -- light blue in my case.
=iif(Fields!season.Value = Parameters!season.Value, Sum(Max(Fields!perf_tix.Value, "perf_desc")), 0)
However, the line above -- the parent/header line its giving me the sum of the two seasons values. Basically adding it all up. This is not what I want and also why is it doing this. The season value is not equal to the passed parameter for the second season value so why is it adding it to the grouped value.
How do I fix this??
Since your aggregate function is inside your IIF function, only the first record in your dataset is being evaluated. If the first one matches the parameter, all records would be included.
This might work:
=IIF(Fields!season.Value = Parameters!season.Value, Sum(Max(Fields!perf_tix.Value, "perf_desc")), 0)
It might be better if your report was also grouping on the Venue, otherwise you count may include all values.

how to copy/move a column from one dataset to another dataset by matching variable

I have a Dataset1 which has 25 columns, with a single column for individual ID. I have Dataset2 with 15 columns and also a column for individual ID. I am trying to move or copy one of the columns from Dataset1 and put it into Dataset2 sorted by individual, but the same individuals are not necessarily in both datasets. Is there an easy way to do this? I've tried playing around with dplyr but I'm really new to R and haven't had any luck. I don't want to completely merge the datasets, I just want to add one column of data to the second dataset but without losing information on individual ID.
Thanks!
an example
library(nycflights13) # for data example
head(flights)
head(airlines)
left_join(airlines, flights %>% select(carrier, flight))
your use case
library(dplyr)
left_join(Dataset1, Dataset2 %>% select(individual ID, name_col_to_copy) , by = "individual ID")