Matlab average number of customers during a single day - matlab

I'm having problems creating a graph of the average number of people inside a 24h shopping complex. I have two columns of data on a spreadsheet of the times a customer comes in (intime) and when he leaves (outtime). The data spans a couple of years and is in datetime format (dd-mm-yyyy hh:mm:ss).
I want to make a graph of the data with time of day as x-axis, and average number of people as y-axis. So the graph would display the average number of people inside during the day.
Problems arise because the place is open 24h and the timespan of data is years. Also customer intime & outtime might be on different days.
Example:
intime 2.1.2017 21:50
outtime 3.1.2017 8:31
Any idea how to display the data easily using Matlab?
Been on this for multiple hours without any progress...

Seems like you need to decide what defines a customer being in the shop during the day, is 1 min enough? is there a minimum time length under which you don't want to count it as a visit?
In the former case you shouldn't be concerned with the hours at all, and just count it as 1 entry if the entry and exit are in the same day or as 2 different entries if not.
It's been a couple of years since I coded actively in matlab and I don't have a handy IDE but if you add the code you got so far, I can fix it for you.

I think you need to start by just plotting the raw count of people in the complex at the given times. Once that is visualized it may help you determine how you want to define "average people per day" and how to go about calculating it. Does that mean average at a given time or total "ins" per day? Ex. 100 people enter the complex in a day ... but on average there are only 5 in the complex at a given time. Which stat is more important? Maybe you want both.
Here is an example of how to get the raw plot of # of people at any given time. I simulated your in & out time with random numbers.
inTime = cumsum(rand(100,1)); %They show up randomly
outTime = inTime + rand(100,1) + 0.25; % Stay for 0.25 to 1.25 hrs
inCount = ones(size(inTime)); %Add one for each entry
outCount = ones(size(outTime))*-1; %Subtract one for each exit.
allTime = [inTime; outTime]; %Stick them together.
allCount = [inCount; outCount];
[allTime, idx] = sort(allTime);%Sort the timestamps
allCount = allCount(idx); %Sort counts by the timestamps
allCount = cumsum(allCount); %total at any given time.
plot(allTime,allCount);%total at any given time.
Note that the x-values are not uniformly spaced.
IF you decide are more interested in total customers per day then you could just find the intTimes with in a given time range (each day) & probably just ignore the outTimes all together.

Related

Why are my values multiplying when I apply Month/Year to my values?

When I apply Month/Year to Cases or Deaths from my data, the values explode. For Cases it goes from approximately 48 million to over 1 billion, and for Deaths it goes from about 700 thousand to over 22 million. However, when I try the same thing with Initial Claims or the Stringency Index, my values remain correct. I'm trying to find the month over month percentage change by the way. And I'm using the Date column. I only select 2020 and 2021 in the filter for Year.
What I'm asking about is Sheet 21.
Link to workbook: https://public.tableau.com/app/profile/nilajah.rivers/viz/CoronaVirusProject_16323687296770/Sheet21
Your problem is that the data points are daily cumulative deaths. If you change the date aggregation to anything other than days, Tableau will default to summing the numbers for all the days in the month. This will give the wrong result, obviously.
If you want to show the correct total deaths or cases regardless of the time aggregation (months, days, weeks etc.) then you could use the New Case or New Death numbers plus a running sum table calculation. This will always give the correct total for the time period.
Table calculations will also allow automatic calculation of the period to period % change from the same data fields.
This is a common problem when working with datasets that offer pre-calculated aggregations. Tableau doesn't need that as it can dynamically calculate the aggregation of a field over any given time period but it is easy to forget which field has pre-aggregated data and which has raw data. Pre-aggregated fields assume a particular time period and can't be used for different time periods without disentangling that assumption (which is unnecessary if you also have the raw data (in this case daily new deaths/cases).

Creating a calculated field with ratio of number of complaints to population in Tableau?

Count of complaints per State by Year
I have the count of complaints per state by year in the first picture. I have another dataset which has the population of each state by year. How do I account for population and present only the ratios of count of complaints to population?
Snapshot of first dataset
Snapshot of second dataset
My thought was to create a calculated field to create a ratio but I'm having trouble with adding up the number of complaint counts within a certain year and then dividing by population year. How do I write the formula that only counts complaints within 2011, 2012, etc and dividing it by that population year?
Let me know if there's an easier way to do it as well, thanks for your time.
Edit 1:
Second dataset Pivoted
Population & Complaint Count
I've pivoted my second dataset and now I'm trying to graph both the counts of population and complaints. The population count across the years increases but the count of complaints stay exactly the same; its the sum of all complaints for that particular state for all the years.
Also, when I graph population count with 'Date Received' from the first dataset, I get the total population count across all the years instead of that particular year, like so:
Population per year
How do I properly 'blend' in the two date variables so that it works with both population count and complaint counts in both datasets?
Edits 2:
Blended Year
I changed the [Years] datatype in data source 2 into a date to match the date type of [Date Received] in data source 1. I also took only the 'year' parts because it would only count things on 1/1 of each year if I used [Years] in data source 2.
Now the graphs look similar except when I'm using [Years] instead of [Date Received], all the values are about several thousand off. I tried adding another relationship except this time for month again and then it only counted values for that month.
How do I account for the discrepancy and make [Years] work just like [Date Received] ?
Reshape data source 2, following these instructions.
Then you'll be able to blend the 2 data sources on State and Month and Year.

Can I calculate a moving sum on a field in InfluxDB?

I'm trying to understand if it's possible to calculate a 1 month sum of revenue data in one of my measurements. For each day, I would like the sum of the previous 30 days.
Is this possible in InfluxDB or through Grafana's query interface?
A moving average is a moving sum, divided by the number of samples. So if you want a moving sum of the past 30 values:
select 30*moving_average(field_name, 30) from measurement
Edited to add:
As Peter Halicky points out in the comments, this is is not the past 30 days. It's the past 30 data points.
If you will always have data for every single day, it's not an issue.
If you're missing a day's data, you'll still get a 30-sample average, but it'll stretch over 31 days instead of 30.
If you don't actually care about the calendar, but want to know the past 30 days of activity, this is not a problem.
If it is a problem, there are a few work-arounds. One that's probably trickier than it sounds: ensure that there is always an entry for each day.
A more robust way is to have the reporting app do this in two steps. Something like this (haven't worked out all the details, but you get the idea):
find the number of data points in the past 30 days, using a query like select count(field_name) from measurement where time > now() - 30d.
Use this number (call it n) to form the query: select n*moving_average(field_name, n) from measurement where time > now - 30d.
Yes, definitely it's possible.
Just set this part of your query like this:
SELECT sum("value") FROM "YOUR_TAG_NAME"
WHERE $timeFilter GROUP BY time(30d) fill(null)
Just make sure that your dashboard time included Last 30 days (at least).

Find and Rank Time Series MATLAB

I know there must be a simple way that I can learn to do this but I cannot imagine how to start. I am tasked with finding a top 10 matching daily wind power time series in a 30-day plus/minus window from the first day in the time series (Jan 1st) matching a single daily wind power time series and it is out of my level of experience in MATLAB. I have successfully done this matching a single time series of the current year with the exact calendar days from previous years, but I need a more robust searching method to find the best correlated time series in a +/- window of time. For example, I'm comparing a 120 day time series (without leap years) with 25 previous years during the same 120-day period (Jan-Apr). The end result will show me the top 10 time series with the years and Julian day or cumulative day listed and a correlation or RMSE value associated with it. My data looks like this arranged in a 365 (days) X 25 (years) array and I thank you very much for your help!
1182573 470528 1638232 2105034 1070466 478257 1096999
879997 715531 1111498 1004556 1894202 1372178 1707984
636173 937769 2119436 742710 1625931 1275567 1228515
967360 1103082 2218855 1643898 1822868 554769 1325642

significant differences between means

Considering the picture below
each values X could be identified by the indeces X_g_s_d_h
g = group g=[1:5]
s = subject number (variable for each g)
d = day number (variable for each s)
h = hour h=[1:24]
so X_1_3_4_12 means that the value X is referred to the
12th hour
of 4th day
of 3rd subject
of group 1
First I calculate the mean (hour by hour) over all the days of each subject. Doing that the index d disappear and each subject is represented by a vector containing 24 values.
X_g_s_h will be the mean over the days of a subject.
Then I calculate the mean (subject by subject) of all the subjects belonging to the same group resulting in X_g_h. Each group is represented by 1 vector of 24 values
Then I calculate the mean over the hours for each group resulting in X_g. Each group now is represented by 1 single value
I would like to see if the means X_g are significantly different between the groups.
Can you tell me what is the proper way?
ps
The number of subjects per group is different and it is also different the number of days for each subject. I have more than 2 groups
Thanks
Ok so I am posting an answer to summarize some of the problems you may have.
Same subjects in both groups
Not averaging:
1-First if we assume that you have only one measure that is repeated every hour for a certain amount of days, that is independent on which day you pick and each hour, then you can reshape your matrix into one column for each subject, per group and perform a ttest with repetitive measures.
2-If you cannot assume that your measure is independent on the hour, but is in day (lets say the concentration of a drug after administration that completely vanish before your next day measure), then you can make a ttest with repetitive measures for each hour (N hours), having a total of N tests.
3-If you cannot assume that your measure is independent on the day, but is in hour (lets say a measure for menstrual cycle, which we will assume stable at each day but vary between days), then you can make a ttest with repetitive measures for each day (M days), having a total of M tests.
4-If you cannot assume that your measure is independent on the day and hour, then you can make a ttest with repetitive measures for each day and hour, having a total of NXM tests.
Averaging:
In the cases where you cannot assume independence you can average the dependent variables, therefore removing the variance but also lowering you statistical power and interpretation.
In case 2, you can average the hours to have a mean concentration and perform a ttest with repetitive measures, therefore having only 1 test. Here you lost the information how it changed from hour 1 to N, and just tested whether the mean concentration between groups within the tested hours is different.
In case 3, you can average both hour and day, and test if for example the mean estrogen is higher in one group than in another, therefore having only 1 test. Again you lost information how it changed between the different days.
In case 4, you can average both hour and day, therefore having only 1 test. Again you lost information how it changed between the different hours and days.
NOT same subjects in both groups
Paired tests are not possible. Follow the same ideology but perform an unpaired test.
You need to perform a statistical test for the null hypothesis H0 that the data in different groups comes from independent random samples from distributions with equal means. It's better to avoid sequential 'mean' operation, but just to regroup data on g. If you assume normality and independence of observations (as pointed out by #ASantosRibeiro below), that you can perform ttest (http://www.mathworks.nl/help/stats/ttest2.html)
clear all;
X = randn(6,5,4,3); %dummy data in g_s_d_h format
Y = reshape(X,5*4*3,6); %reshape data per group
h = zeros(6,6);
for i = 1 : 6
for j = 1 : 6
h(i,j)=ttest2(Y(:,i),Y(:,j));
end
end
If you want to take into account the different weights of the observations, you need to calculate t-value yourself (e.g., see here http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_ttest_a0000000126.htm)