Reading And Analysing .txt file in Matlab - matlab

Im new to matlab and am having some trouble with example.
The Colorado River Drainage Basin covers parts of seven western states. A series of dams has been constructed on the Colorado River and its tributaries to store runoff water and to generate low-cost hydroelectric power. The ability to regulate the flow of water has made the growth of agriculture and population in these arid desert states possible. Even during periods of extended drought, a steady, reliable source of water and electricity has been available to the basing states. Lake Powell is one of these reservoirs lake_powell.txt contains data on the water level in the reservoir for the eight years 2000 to 2007.
a) Use nested for loops to read one water level value at a time into the lake_powell matrix.
lake_powell(month,year) = fscanf(fileID, '%f', 1);
Print the lake_powell matrix with title and year column headings.
b) Use mean to determine the average elevation of the water level for each year and the overall average for the eight-year period over which the data were collected.
c) Use find and length to determine how many months of each year exceed the overall average for the eight-year period.
d) Create a report that lists the month (number) and the year for each of the months that
exceed the overall average. For example, June is month 6. Use find.
e) Determine and print the average elevation of the water for each month for the eight-year period. Use mean.
f) Plot the water level values in lake_powell using
date=2000:1/12:2008-1/12;
plot(date,lake_powell(:))
xlabel('Year')
ylabel('Water level, ft')

It sounds like you should be using textscan instead of fscanf.
testscan reads in a delimited file line by line where each line has a consistent format.
Read the documentation for textscan and you should have your solution.

Related

PERMANOVA - small and unequal sample sizes

I am comparing fish communities at 2 sites (upstream vs downstream) with data collected in two seasons (wet and dry) over several years (2017-2022), with data from the first pair of wet and dry seasons representing the period before a treatment and subsequent data representing periods after the treatment. During each season I sampled each site four times, and recorded abundance of each fish species from each site. I did not conduct sampling during the last dry season due to resource constraints. The community compositions of the two sites over different seasons and periods are visualised in the NMDS biplots.
I am trying to do further analysis using PERMANOVA to look for any spatial-temporal changes in the fish commnities, mainly if the two groups are becoming more similar in the years following the treatment. As there are samples with no fishes recorded I have to remove those samples from the dataset, which means I have only three instead of four replicates in some of the site x season x year groups.
My question is, does it still make sense to use PERMANOVA if I have unequal sample size among groups, given the number of replicates from each are small (3-4)? I am planning to run the test separately for wet and dry seasons, but that means I still need to do two-way (site x year) PERMANOVAs for each of the seasons.
I learned from some of the online discussions that unequal sample size would be a problem for two (or more)-way PERMANOVAs and the problem of unequal sample size would be more prominent when the sample sizes are small. Would be grateful for any comments or insight on this. Thanks a tonne!

Timeplot to show the sum of model units

I am working on a system dynamics model, whose units are days, in AnyLogic. The model tracks daily demand of water for 10,950 days (30 years). One of the model’s outputs is a timeplot that keeps track of this demand, but I don’t want it to plot the daily demand. Instead, I want the timeplot to show demand in years (i.e. the sum of 365 days across the 30 years). I am having a bit of trouble finding a way to do this and would appreciate any help. Thank you!
I assume your problem is twofold.
How to get the time plot to display 30 years of data
How to sum the annual demand for 30 years
Here is a simple example that I believe answers your question.
In this simple model, there is a daily event that simulates the daily demand, and adds it to a variable called annualDemand
There is another event that runs yearly and tasks the annualDemand and saves it to a data set, and rests the annual demand accumulator to 0.
In your time plot, you simply display the dataset which will at the end of the model only contain 30 entries, one for each year.
By following the same principles
Save annual demand
yearly event to add annual demand to data set and reset the annual demand
time plot to plot the dataset
You should be able to get what you need.

Cyclic transformation of dates

I would like to use the day of the year in a machine learning model. As the day of the year is not continuous (day 365 of 2019 is followed by day 1 in 2020), I think of performing cyclic (sine or cosine) transformation, following this link.
However, in each year, there are no unique values of the new transformed variable; for example, two values for 0.5 in the same year, see figures below.
I need to be able to use the day of the year in model training and also in prediction. For a value of 0.5 in the sine transformation, it can be on either 31.01.2019 or 31.05.2019, then using 0.5 value can be confusing for the model.
Is it possible to make the model to differentiate between the two values of 0.5 within the same year?
I am modelling the distribution of a species using Maxent software. The species data is continuous every day in 20 years. I need the model to capture the signal of the day or the season, without using either of them explicitly as categorical variable.
Thanks
EDIT1
Based on furcifer's comment below. However, I find the Incremental modelling approach not useful for my application. It solves the issue of consistent difference between subsequent days; e.g. 30.12.2018, 31.12.2018, and 01.01.2019. But it does not differ than counting the number of days from a certain reference day (weight = 1). Having much higher values on the same date for 2019 than 2014 does not make ecological sense. I hope that interannual changes to be captured from the daily environmental conditions used (explanatory variables). The reason for my need to use day in the model is to capture the seasonal trend of the distribution of a migratory species, without the explicit use of month or season as a categorical variable. To predict suitable habitats for today, I need to make this prediction not only depends on the environmental conditions of today but also on the day of the year.
This is a common problem, but I'm not sure if there is a perfect solution. One thing I would note is that there are two things that you might want to model with your date variable:
Seasonal effects
Season-independent trends and autocorrelation
For seasonal effects, the cyclic transformation is sometimes used for linear models, but I don't see the sense for ML models - with enough data, you would expect a nice connection at the edges, so what's the problem? I think the posts you link to are a distraction, or at least they do not properly explain why and when a cyclic transformation is useful. I would just use dYear to model the seasonal effect.
However, the discontinuity might be a problem for modelling trends / autocorrelation / variation in the time series that is not seasonal, or common between years. For that reason, I would add an absolute date to the model, so use
y = dYear + dAbsolute + otherPredictors
A well-tuned ML model should be able to do the rest, with the usual caveats, and if you have enough data.
This may not the right choice depending on your needs, there are two choices that comes to my mind.
Incremental modeling
In this case, the dates are modeled in a linear fashion, so say 12 Dec, 2018 < 12, Dec, 2019.
For this you just need some form of transformation function that converts dates to numeric values.
As there are many dates that need to be converted to numeric representation, the first thing to make sure is that the output list also has the same order as Lukas mentioned. The easiest way to do this is by adding weight to each unit (weight_year > weight_month > weight_day).
def date2num(date_time):
d, m, y = date_time.split('-')
num = int(d)*10 + int(m)*100 + int(y)*1000 # these weights can be anything as long as
# they are ordered
return num
Now, it's important to normalize the numeric values.
import numpy as np
date_features = []
for d in list(df['date_time']):
date_features.append(date2num(d))
date_features = np.array(date_features)
date_features_normalized = (date_features - np.min(date_features))/(np.max(date_features) - np.min(date_features))
Using the day, month, year as separate features. So, instead of considering the date as whole, we segregate. The motivation is that maybe there will be some relations between the output and a specific date, month, etc. Like, maybe the output suddenly increases in the summer season (specific months) or maybe on weekends (specific days)

Interesting results from LSTM RNN : lagged results for train and validation data

As an introduction to RNN/LSTM (stateless) I'm training a model with sequences of 200 days of previous data (X), including things like daily price change, daily volume change, etc and for the labels/Y I have the % price change from current price to that in 4 months. Basically I want to estimate the market direction, not to be 100% accurate. But I'm getting some odd results...
When I then test my model with the training data, I notice the output from the model is a perfect fit when compared to the actual data, it just lags by exactly 4 months:
When I shift the data by 4 months, you can see it's a perfect fit.
I can obviously understand why the training data would be a very close fit as it has seen it all during training - but why the 4 months lag?
It does the same thing with the validation data (note the area I highlighted with the red box for future reference):
Time-shifted:
It's not as close-fitting as the training data, as you'd expect, but still too close for my liking - I just don't think it can be this accurate (see the little blip in the red rectangle as an example). I think the model is acting as a naive predictor, I just can't work out how/why it's possibly doing it.
To generate this output from the validation data, I input a sequence of 200 timesteps, but there's nothing in the data sequence that says what the %price change will be in 4 months - it's entirely disconnected, so how is it so accurate? The 4-month lag is obviously another indicator that something's not right here, I don't know how to explain that, but I suspect the two are linked.
I tried to explain the observation based on some general underlying concept:
If you don't provide a time-lagged X input dataset (lagged t-k where k is the time steps), then basically you will be feeding the LSTM with like today's closing price to predict the same today's closing price..in the training stage. The model will (over fit) and behave Exactly as the answer is known already (data leakage)
If the Y is the predicted percentage change (ie. X * (1 + Y%) = 4 months future price), the present value Yvalue predicted really is just the future discounted by the Y%
so the predicted value will have 4 months shift
Okay, I realised my error; the way I was using the model to generate the forecast line was naive. For every date in the graph above, I was getting an output from the model, and then apply the forecasted % change to the actual price for that date - that would give predicted price in 4 months' time.
Given the markets usually only move within a margin of 0-3% (plus or minus) over a 4 month period, that would mean my forecasts was always going to closely mirror the current price, just with a 4 month lag.
So at every date the predicted output was being re-based, so the model line would never deviate far from the actual; it'd be the same, but within a margin of 0-3% (plus or minus).
Really, the graph isn't important, and it doesn't reflect the way I'll use the output anyway, so I'm going to ditch trying to get a visual representation, and concentrate on trying to find different metrics that lower the validation loss.

Dymola / Modelica - District heating

I am trying to validate a district heating model I built using Dymola.
In this case, I am trying to find the mass flow during a year period. I have two models running. both with the same loads and pipes with same characteristics as this picture:
pipes
Both models are as follows:
models
My results are making sense at least regarding the time of the year my flow should be higher, I am getting very high values during January, February and March, then again by the end of the year.
However those high peaks are VERY different, the first model on the picture is giving me peaks of almost 400kg/s whereas the second one is reaching up to 70kg/s.
Can anyone suggest a way to validate the model? I have the heat loads for the year hour by hour (this is the input I am giving to Dymola), I know that the min temperature of the water is 70 and the max is 85 celsius.
But I am really struggling to validate my model. Any suggestions?