na.approx and na.locf not behaving properly - imputation

I'm trying to calculate imputated values for a time series for different countries. This piece of code worked fine before, but now the impuated values are all wrong ... I can't figure out the problem, I've tried everything I could think of.
Our rules are:
Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.
If values are missing in the middle of a time series, linear extrapolation is used.
# load library for imputation
library(zoo)
# expand table to show NAs
output_table_imp = expand(output_table, transport_mode, year, country_code)
output_table_imp = full_join(output_table_imp, output_table)
# add imputated values
output_table_imp <- output_table_imp %>%
group_by(transport_mode, country_code) %>%
mutate(fatalities_imp= na.approx(fatalities,na.rm=FALSE)) %>% # linear interpolation
mutate(fatalities_imp= na.locf.default(fatalities_imp,na.rm=FALSE)) %>% # missing values at the end of a time series (copy last non-NA value)
mutate(fatalities_imp= na.locf(fatalities_imp,fromLast=TRUE, na.rm=FALSE)) %>% # missing values at the start of a time series (copy first non-NA value)
My data frame consists of a couple of columns: transport_mode, country_code, year, fatalities. I'm not sure how I can share my data here? It's a large table with 3600 observations ...
These are the original numbers:
And these are the imputated values. You can see straight away that there is a problem for CY, IE and LT.
The data frame looks like this:

Your code looks somehow overly complicated. Don't know about the zoo details - but pretty sure you could get it also to work.
With the imputeTS package you could just take your whole data. frame (it assumes each column is a separate time series) and the package performs imputation for each of this series.
(unfortunately your code has no data, but I guess this would be your output_table_imp data.frame after expansion)
Just like this:
library("imputeTS")
na_interpolation(output_table_imp, option = "linear")
We also don't have to change something for NA treatment at the beginning and at the end, since your requirements are the default in the na_interpolation function.
These were your requirements:
Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.
Here a toy example:
# Test time series with NAs at start, middle, end
test <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
# Perform linear interpolation
na_interpolation(test, option = "linear")
#Results
> 1 1 1 2 3 4 5 6 7 8 8 8
So see, this works perfectly fine.
Works also perfectly with a data.frame (as a said, a column is interpreted as a time series):
# Create three time series and combine them into 1 data.frame
ts1 <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
ts2 <- c(NA,1,1,2,3,NA,3,6,7,8,NA,NA)
ts3 <- c(NA,3,1,2,3,NA,3,6,7,8,NA,NA)
df <- data.frame(ts1,ts2,ts3)
na_interpolation(df, option = "linear")

Related

Kdb qstudio line plot

I have a table of the form
Timestamp, Symbol, Vol
and I would like to plot the aggregate daily volume per symbol in a line chart
select sum(Vol) by `date$Timestamp from Trades
gives me the plot for the daily volume. How can I get a line per symbol?
select sum(Vol) by `date$Timestamp, Symbol from Trades
Gives me two lines, one for Vol and one constant line for the max in symbols( symbols are int values)
And as a side question... How can I tell the plot to exclude missing dates in the time series or at least have values of 0 for those dates?
If you want to multi-graph then each line to draw needs to be a separate column of our output table. Which means you'd need to pivot your results table: https://code.kx.com/q/cookbook/pivoting-tables/
For example, something like this:
{P:exec distinct sym from x;exec P#(sym!size) by minute:minute from x}select sum size by sym,time.minute from lseTradeRT where sym in `AHT.L`BARC.L`BP.L`VOD.L
but in your case replace the time.minute with `date$Timestamp. You should also filter on only a handful of syms or else the graph is unmanageable.

How can I speed up MATALAB nested for loop (timestamp correction)?

I have large data files with 10Hz resolution which have been split up into half hour files. Each half hour file should contain 18000 rows. However, this is not usually the case and there are gaps in the time stamps due to datalogging errors.
In order to be able to process this data further, I need uniform data files with exactly 18000 rows each. I have written a script in MATLAB which solves this problem by using a generated full timestamp for each half hour period. Where there are gaps in the original timestamp, I fill the rows with NaNs (except the timestamp row).
Using a simple example, here is the code:
Fs=1:10; % uniform timestamp
Fs=Fs'; %'//<-- prevents string markdown
Ts=[1 3 7 2 1; 3 4 3 3 2;6 5 2 1 3]; %Ts is the original data with the timestamp in the first column
Corrected=zeros(length(Fs),6); % corrected is the data after applying uniform timestamp
for ii=1:length(Fs)
for jj=1:length (Ts(:,1))
if Fs(ii)==Ts(jj)
Corrected(ii,:)=[Fs(ii) Ts(jj,:)];
break
else
Corrected(ii,:)=[Fs(ii), NaN*[ii,1:4]];
continue
end
end
end
The code works well enough but when I apply it to the 10Hz data, it is extremely slow. Any ideas on how I can improve this code?
Note that in the actual code I compare date strings:
if sum(Fs_str(ii,:)==Ts_str(jj,:))==23
Corrected(ii,:)=[Fs(ii) Ts(jj,:)];

How to draw time based graphs using ios-charts

I'm trying to draw a temperature graph using iso-charts where the x axis data would be set from a server timestamp but the labels would be readable text.
For instance the graph x-axis label would start at Monday 00:00 and end Tuesday 12pm but the LineChartDataSet would be a collection of temperature (y-axis) and timestamps for the x
To display the timestamp I have a custom valueFormatter set as follow (which works great)
lineChartView.xAxis.valueFormatter = timestampXAxisFormatter() //converts timestamp to Date string
My question: The LineChartDataSet seems to be indexed based which is causing some trouble: if I have 4 data points such as (9am, 10), (9:15am, 11), (12pm, 15), (1pm, 16) the 4 points are set in the chart at regular intervals (I was expecting 2 points to be on the left side of the graph and then last 2 points on the right side) - Is there a way to have a data set that is based on the x value instead of the index?
I saw ChartData has an init that takes an array of NSObjects but then it converts it to Strings...Thanks in advance for any suggestions you may have!
There is no good way to solve it, as you figured out the x axis is index based.
You have two options:
insert many x values between each real x value, like between 9:00 and 9:15, you manually insert 9:01, 9:02, ..., 9:14, but don't add any entry at these values, just ignore it and continue. ios-charts will skip if no entry found and go to next. This will works fine, if you don't have a large number of values to insert. I tried ~1000 values, the performance is acceptable.
you create your own chart, using two y axis, one as x axis and one as y axis, so the distances to 0 point are calculated by value. However this requires you understand the ios-chart logic deeply. If you succeed, you are more than welcome to file a PR.

Tableau 8.2 - how to get max and min from % difference values on table?

I'm facing problem in getting the max % and min % from a table containing % difference values.
Year-----A----------B---------C---------D---------Max %----Max Type----Min %----Min Type
2012
2013---4.30%---4.42%---4.34%---4.38%----4.42%---------B-----------4.30%---------A
The table above shows the % difference in sales from previous year. Thus 2012 shows no % (because there's no 2011). I used table calculation to compute the % difference, i.e. "Percent Difference From", compute using "Table (Down)" and "Previous".
The last four columns are what I'm having trouble doing. I want to get the max % and min % and also the corresponding types. I'm not trying to add the four columns to the existing table, but to get the correct results, as my ultimate goal is to display that results on the dashboard, i.e. on my dashboard, I want to display the highest % and its corresponding type; similarly the lowest % and its corresponding type. For example: on my dashboard, I want to display:
Highest % and type: 4.42% B
Lowest % and type: 4.30% A
So, I need to have the correct formulas to get the max % and min % and their types. These are what I did:
I tried to use WINDOW_MAX and WINDOW_MIN to display the max % and min % on the table but got funky wrong results.
1) I first get the formula in calculating the % difference from the "Customize" button from "Edit Table Calculation" window of SUM([Sales]): (ZN(SUM([Sales])) - LOOKUP(ZN(SUM([Sales])), -1)) / ABS(LOOKUP(ZN(SUM([Sales])), -1))
Then I created a calculated field of the above formula. I named the calculated field "Percent-Diff".
2) I created another calculated filed (named "Max % Difference") using the formula: WINDOW_MAX([Percent-Diff]). But it shows strange results. See image below. I don't know why it gives me 2.78% and 2.91% for 2012 and 2013 respectively. It should be 0% and 4.42% for 2012 and 2013 respectively. Something is not correct.
If it is just SUM([Sales]) instead of % difference, then I get the correct result of showing the max sales using the formula WINDOW_MAX(SUM([Sales])).
3) Also I don't know how to get the corresponding type. I tried using the formula: IF [Max % Difference] = [Percent-Diff] THEN ATTR([Product Type]). But it returns:
NULL
B
I'm not sure if the formula is correct. It looks correct on the result (i.e. "B" is correct), except that it also shows a NULL value which I don't know why. I think it's because I didn't include the ELSE part in my IF formula? But why the NULL value is shown as the first value? I want the formula to return just one value, "B". So, how to only just show "B"?
I've posted twice the problem in tableau forum, but as of now, nobody has answered my problem. I believe that my formulas are incorrect. So, if anyone here can correct the formulas to get the max % and min % from % difference values and also to get the corresponding type, then it'd be very much appreciated. Thanks a million!
It's hard to tell not knowing how your database looks like (as you didn't explicitly presented it, but I can try to infer based on the clues you left on your post). But I could reproduce something like you said using the Sample - Coffee Chain Database, and it worked out well, calculated yoy sales increase by product and then window_max of that.
What you're probably missing is the partitioning. I suggest avoiding using Table or Pane to create the partitions in more complex situations (as it will work only in that specific arrangement of fields), but rather use the dimensions to partition it.
So, your [Percent-Diff] field should be compute using [Date], and your [Max % Difference] should be compute using [Product Type]. IMPORTANT, for [Max % Difference], when you go to Edit Table Calculation, you'll have to choose the Compute using for [Percent-Diff] as well (you can choose on the top of the window)
Your formula to find which type is the max (or min) is also correct (and should only respect the partitions). Nevertheless, it is very hard to have the exact output you're expecting.
What I would do is to create 2 spreadsheets (and later combine them in a dashboard).
The 1st would be what you already got (Each product [Percent-Diff]
The second one I would change your formula (3) to just [Max % Difference] = [Percent-Diff], and use it as filter (filtering only true). I would drag both Date and Product to the sheet (you choose if you want it on columns, rows, or just detail) so I can use them to partition the table. And drag [Max % Difference] to be visualized.
That way you'll only see the product that is the max, and how much is that max.
Hope it helps

MATLAB Loop Programming

I've been stuck on a MATLAB coding problem where I needed to create market weights for many stocks from a large data file with multiple days and portfolios.
I received help from an expert the other day using 'nested loops' it worked, but I don't understand what he has done in the final line. I was wondering if anyone could shed some light and provide an explanation of the last coding line.
xp = x (where x = market value)
dates=unique(x(:,1)); (finds the unique dates in the data set Dates are column 1)
for i=1:size(dates,1) (creates an empty matrix to fill the data in)
for j=5:size(xp,2)
xp(xp(:,1)==dates(i),j)=xp(xp(:,1)==dates(i),j)./sum(xp(xp(:,1)==dates(i),j)); (help???)
end
end
Any comment are much appreciated!
To understand the code, you have to understand the colon operator, logical indexing and the difference between / and ./. If any of these is unclear, please look it up in the documentation.
The following code does the same, but is easier to read because I separated each step into a single line:
dates=unique(x(:,1));
%iterates over all dates. size(dates,1) returns the number of dates
for i=1:size(dates,1)
%iterates over the fifth to last column, which contains the data that will be normalised.
for j=5:size(xp,2)
%mdate is a logical vector, which is used to select the rows with the currently processed date.
mdate=(xp(:,1)==dates(i))
%s is the sums up all values in column j same date
s=sum(xp(mdate,j))
%divide all values in column j with the same date by s, which normalises to 1
xp(mdate,j)=xp(mdate,j)./s;
end
end
With this code, I suggest to use the debugger and step through the code.