calculate time between first and last var separated by subgroup in Tableau - tableau-api

I have a dataset which has from-To and an Id. In one Id, there are multiple from-To as sequence and I want to calculate the time difference from the first observation till the last observation for each Id.
In R it looks like:
tmp2 <- tmp %>% group_by(Id) %>% slice(c(1,n())) %>% ungroup()
Please any help would be highly appreciated

Related

Creating a pivot table in PySpark

New to PySpark and would like to make a table that counts the unique pairs of values from two columns and shows the average of another column over all rows with those pairs of values. My code so far is:
df1 = df.withColumn('trip_rate', df.total_amount / df.trip_distance)
df1.groupBy('PULocationID', 'DOLocationID').count().orderBy('count', ascending=False).show()
I want to add the average of the trip rate for each unique pair as a column. Can you help me please?

Power BI - Timeseries compare two different start dates

I want to compare how different campaigns are progressing based on number of days into the campaign rather than by date (see day1, day2, etc... on the x-axis below).
Here is my DAX code, but I can't get it to work. Any help would be much appreciated...
**Normalised Campaign Metrics =
VAR DateReached = CALCULATE(MIN(Days[Day]),db[PAYMENT_DATE]<> BLANK(), KEEPFILTERS(db[PRODUCT_CODE SWITCH]))
VAR MaxDate = CALCULATE(MAX(db[PAYMENT_DATE]),KEEPFILTERS(db[PRODUCT_CODE SWITCH]))
VAR DayNo = SELECTEDVALUE(Days[Day])
RETURN CALCULATE(count(db[PAYMENT_DATE]),
FILTER(ALL(db[PAYMENT_DATE]),
DateReached+DayNo && DateReached+DayNo<=MaxDate))**
Many thanks!
enter image description here
I would recommend solving this through manipulating your actual data rather than a complex DAX measure. If you are familiar with star schema modelling, I would solve this problem by adding a new column to your fact table that calculates how many days from the start date the payment occurred and then connect this column to a new "Days Passed" dimension that is simply a list of numbers from 1 to however many days you need. Then, you can use this new dimension as the source data for your x axis and use a standard payment amount measure for your y axis.
I recommend to create a dimension table as the relative basis to comparison with inactive relationship. Here is a video about it:
https://youtu.be/knXFVf2ipro

na.approx and na.locf not behaving properly

I'm trying to calculate imputated values for a time series for different countries. This piece of code worked fine before, but now the impuated values are all wrong ... I can't figure out the problem, I've tried everything I could think of.
Our rules are:
Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.
If values are missing in the middle of a time series, linear extrapolation is used.
# load library for imputation
library(zoo)
# expand table to show NAs
output_table_imp = expand(output_table, transport_mode, year, country_code)
output_table_imp = full_join(output_table_imp, output_table)
# add imputated values
output_table_imp <- output_table_imp %>%
group_by(transport_mode, country_code) %>%
mutate(fatalities_imp= na.approx(fatalities,na.rm=FALSE)) %>% # linear interpolation
mutate(fatalities_imp= na.locf.default(fatalities_imp,na.rm=FALSE)) %>% # missing values at the end of a time series (copy last non-NA value)
mutate(fatalities_imp= na.locf(fatalities_imp,fromLast=TRUE, na.rm=FALSE)) %>% # missing values at the start of a time series (copy first non-NA value)
My data frame consists of a couple of columns: transport_mode, country_code, year, fatalities. I'm not sure how I can share my data here? It's a large table with 3600 observations ...
These are the original numbers:
And these are the imputated values. You can see straight away that there is a problem for CY, IE and LT.
The data frame looks like this:
Your code looks somehow overly complicated. Don't know about the zoo details - but pretty sure you could get it also to work.
With the imputeTS package you could just take your whole data. frame (it assumes each column is a separate time series) and the package performs imputation for each of this series.
(unfortunately your code has no data, but I guess this would be your output_table_imp data.frame after expansion)
Just like this:
library("imputeTS")
na_interpolation(output_table_imp, option = "linear")
We also don't have to change something for NA treatment at the beginning and at the end, since your requirements are the default in the na_interpolation function.
These were your requirements:
Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.
Here a toy example:
# Test time series with NAs at start, middle, end
test <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
# Perform linear interpolation
na_interpolation(test, option = "linear")
#Results
> 1 1 1 2 3 4 5 6 7 8 8 8
So see, this works perfectly fine.
Works also perfectly with a data.frame (as a said, a column is interpreted as a time series):
# Create three time series and combine them into 1 data.frame
ts1 <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
ts2 <- c(NA,1,1,2,3,NA,3,6,7,8,NA,NA)
ts3 <- c(NA,3,1,2,3,NA,3,6,7,8,NA,NA)
df <- data.frame(ts1,ts2,ts3)
na_interpolation(df, option = "linear")

Combine Confidence Intervals and Odds Ratios + Adding Starts for P-Values in Gtsummary

Two simple questions here. First, is there a way to combine the confidence intervals and the odds ratios into a single column with gtsummary for the tblregression/tbl_uvyregression functions? Second, is it possible to include stars to indicate the significance level for the p-values (i.e. *<.05, **<.01, ***<.005) so I do not have to have a p-value column whenever I make a table? Thanks
As of gtsummary v1.4.0, this is possible with the add_significance_stars() function.
library(gtsummary)
#> #Uighur
packageVersion("gtsummary")
#> [1] '1.4.0'
tbl <-
lm(marker ~ age + grade, trial) %>%
tbl_regression() %>%
add_significance_stars(
pattern = "{estimate} ({conf.low}, {conf.high}){stars}",
hide_se = TRUE
) %>%
modify_header(estimate ~ "**Beta (95% CI)**")
Created on 2021-04-14 by the reprex package (v2.0.0)

Kdb qstudio line plot

I have a table of the form
Timestamp, Symbol, Vol
and I would like to plot the aggregate daily volume per symbol in a line chart
select sum(Vol) by `date$Timestamp from Trades
gives me the plot for the daily volume. How can I get a line per symbol?
select sum(Vol) by `date$Timestamp, Symbol from Trades
Gives me two lines, one for Vol and one constant line for the max in symbols( symbols are int values)
And as a side question... How can I tell the plot to exclude missing dates in the time series or at least have values of 0 for those dates?
If you want to multi-graph then each line to draw needs to be a separate column of our output table. Which means you'd need to pivot your results table: https://code.kx.com/q/cookbook/pivoting-tables/
For example, something like this:
{P:exec distinct sym from x;exec P#(sym!size) by minute:minute from x}select sum size by sym,time.minute from lseTradeRT where sym in `AHT.L`BARC.L`BP.L`VOD.L
but in your case replace the time.minute with `date$Timestamp. You should also filter on only a handful of syms or else the graph is unmanageable.