Training LIBSVM with multivariate data in MATLAB - matlab

How LIBSVM works performs multivariate regression is my generalized question?
In detail, I have some data for certain number of links. (Example 3 links). Each link has 3 dependent variables which when used in a model gives output Y. I have data collected on these links in some interval.
LinkId | var1 | var2 | var3 | var4(OUTPUT)
1 | 10 | 12.1 | 2.2 | 3
2 | 11 | 11.2 | 2.3 | 3.1
3 | 12 | 12.4 | 4.1 | 1
1 | 13 | 11.8 | 2.2 | 4
2 | 14 | 12.7 | 2.3 | 2
3 | 15 | 10.7 | 4.1 | 6
1 | 16 | 8.6 | 2.2 | 6.6
2 | 17 | 14.2 | 2.3 | 4
3 | 18 | 9.8 | 4.1 | 5
I need to perform prediction to find the output of
(2,19,10.2,2.3).
How can I do that using above data for training in Matlab using LIBSVM? Can I train the whole data as input to the svmtrain to create a model or do I need to train each link separate and use the model create for prediction? Does it make any difference?
NOTE : Notice each link with same ID has same value.

This is not really a matlab or libsvm question but rather a generic svm related one.
How LIBSVM works performs multivariate regression is my generalized question?
LibSVM is just a library, which in particular - implements the Support Vector Regression model for the regression tasks. In short words, in a linear case, SVR tries to find a hyperplane for which your data points are placed in some margin around it (which is quite a dual approach to the classical SVM which tries to separate data with as big margin as possible).
In non linear case the kernel trick is used (in the same fashion as in SVM), so it is still looking for a hyperplane, but in a feature space induced by the particular kernel, which results in the non linear regression in the input space.
Quite nice introduction to SVRs' can be found here:
http://alex.smola.org/papers/2003/SmoSch03b.pdf
How can I do that using above data for training in Matlab using LIBSVM? Can I train the whole data as input to the svmtrain to create a model or do I need to train each link separate and use the model create for prediction? Does it make any difference? NOTE : Notice each link with same ID has same value.
You could train SVR (as it is a regression problem) with the whole data, but:
seems that var3 and LinkId are the same variables (1->2.2, 2->2.3, 3->4.1), if this is a case you should remove the LinkId column,
are values of var1 unique ascending integers? If so, these are also probably a useless featues (as they do not seem to carry any information, they seem to be your id numbers),
you should preprocess your data before applying SVM so eg. each column contains values from the [0,1] interval, otherwise some features may become more important than others just because of their scale.
Now, if you would like to create a separate model for each link, and follow above clues, you end up with 1 input variable (var2) and 1 output variable var4, so I would not recommend such a step. In general it seems that you have very limited featues set, it would be valuable to gather more informative features.

Related

Simple cumulative increase in Prometheus

I have an application that increments a Prometheus counter when it receives a particular HTTP request. The application runs in Kubernetes, has multiple instances and redeploys multiple times a day. Using the query http_requests_total{method="POST",path="/resource/aaa",statusClass="2XX"} produces a graph displaying cumulative request counts per instance as is expected.
I would like to create a Grafana graph that shows the cumulative frequency of requests received over the last 7 days.
My first thought was use increase(...[7d]) in order to account for any metrics starting outside of the 7 day window (like in the image shown) and then sum those values.
I've come to the realisation that sum(increase(http_requests_total{method="POST",path="/resource/aaa",statusClass="2XX"}[7d])) does in fact give the correct answer for points in time. However, resulting graph isn't quite what was asked for because the component increase(...) values increase/decrease along the week.
How would I go about creating a graph that shows the cumulative sum of the increase in these metrics over the passed 7 days? For example, given the simplified following data
| Day | # Requests |
|-----|------------|
| 1 | 10 |
| 2 | 5 |
| 3 | 15 |
| 4 | 10 |
| 5 | 20 |
| 6 | 5 |
| 7 | 5 |
| 8 | 10 |
If I was to view a graph of day 2 to day 8 I would like the graph to render a line as follows,
| Day | Cumulative Requests |
|-----|---------------------|
| d0 | 0 |
| d1 | 5 |
| d2 | 20 |
| d3 | 30 |
| d4 | 50 |
| d5 | 55 |
| d6 | 60 |
| d7 | 70 |
Where d0 represents the initial value in the graph
Thanks
Prometheus doesn't provide functionality, which can be used for returning cumulative increase over multiple time series on the selected time range.
If you still need this functionality, then try VictoriaMetrics - Prometheus-like monitoring solution I work on. It allows calculating cumulative increase over multiple counters. For example, the following MetricsQL query returns cumulative increase over all the time series with http_requests_total name on the selected time range in Grafana:
running_sum(sum(increase(http_requests_total)))
How does it work?
It calculates increase per each time series with the http_requests_total name. Note that the increase() in the query above doesn't contain lookbehind window in square brackets. VictoriaMetrics automatically sets the lookbehind window to the step value, which is passed by Grafana to /api/v1/query_range endpoint. The step value is the interval between points on the graph.
It sums increases returned at step 1 with the sum() function individually per each point on the graph.
It calculates cumulative increase over per-step increases returned at step 2 with the running_sum function.
If I understood your question's idea correctly, I think I managed to create such graph with a query like this
sum(max_over_time(counterName{someLabel="desiredlabelValue"}[7d]))
A graph produced by it looks like the blue one:
The reasons why the future part of the graph decreases are both because the future processing hasn't obviously yet happened and because the more-than-7-days-old processing slides out of the moving 7-day inspection window.

Date use in linear regression and conversion of date to numbers using spark mllib

I want to use a date in linear regression.
So I have to convert it to a number. And I have to set lowest date 0 and continuously increase a number as per date difference.
Then I can use date field in Linear Regression using Scala, Spark MLlib.
I have dataframe ready with some fields including date.
For example,
| date | id |
| 01-01-2017 | 12 |
| 01-02-2016 | 13 |
| 05-05-2016 | 22 |
For a string, I have implemented using one hot encoding technique. But for date how can I set first date to 0 and then increase number as per difference?
Thanks.
This depend purely on a model you want to create. For very basic trend modeling you can just cast your data to Unix timestamp:
import org.apache.spark.sql.functions._
val parsed = df.withColumn("date", unix_timestamp($"date", "dd-MM-yyyy"))
No additional processing should be necessary, but you can of course shift it to start at 0, or rescale to more convenient scale.
More advance modeling would including extracting different components like month or dayofweek. These in general should be treated as categorical variables, and one-hot-encoded.

Plot a curve in grafana

I have a 2-D time series for which I take 1-minute snapshots that I put in my influxdb.
To give a concrete example, consider a yield curve : this is a curve giving the interest rate by maturity date and looks like this:
maturity | 1YEAR | 2 YEARS | 2 YEARS | 3 YEARS | 4 YEARS | 5 YEARS |
interest | 0.5 | 0.75 | 0.83 | 0.99 | 1.01 | 1.05 |
My application takes snapshots of the curve and stores them in influxdb.
Now I want to plot these snapshots in grafana. So at one particular time stamp I want to plot the curve (X axis will be my maturities, and Y axis the corresponding interest rates for each maturity).
Can this be done in Grafana?
To the best of my knowledge, this is not currently possible with Grafana. One of your axes must always be time.

Using Landsat 7 to go from NDVI to Emissivity

I am using Landsat 7 to calculate land surface derived temperature.
I understand the main concepts behind the conversion, however, I am confused on how to factor Emissivity into my model.
I am using the model builder for my calculations and have created several modules that uses the instruments Gain, Bias Offset, Landsat K1, and Landsat K2 correction variables.
I converted the DN to radiance values as well.
Now, I need to factor in the last and probably the most confusing (for me) part: Emissivity.
I would like to calculate Emissivity using the NDVI.
I have a model procedure built to calculate the NDVI layer (band4- band3)/(band4+ band3).
I have also calculated Pv, which is the fraction of vegetation calculated by: [NDVI - NDVI_min]/[NDVI_max-NDVI_min]^2.
Now, by using the Vegetation Cover Method, all I need is Ev and Eg.
I do not understand how to find these values to calculate the Total Emissivity value per cell.
Does anyone have any idea on how I can incorporate the Emissivity into my formulation?
I am slightly confused on how to derive this value...
I believe Emissivity is frequently included as part of the dataset. Alternatively, emissivity databases do exist (such as the ASTER database here: https://lpdaac.usgs.gov/about/news_archive/aster_global_emissivity_database_ged_product_release, and others usually maintained by academic departments.)
Values of Ev = 0.99 and Eg = 0.97 are used, and the method of selection discussed, on p. 436 here: ftp://atmosfera.cl/pub/elias/Paula/2004_Sobrino_RSE.pdf (J.A. Sobrino et al. Land surface temperature retrieval from LANDSAT TM 5, Remote Sensing of Environment 90, 2004, p. 434–440).
Another approach is taken here: http://fromgistors.blogspot.com/2014/01/estimation-of-land-surface-temperature.html
Estimation of Land Surface Temperature
There are several studies about the calculation of land surface temperature. For instance, using NDVI for the estimation of land surface emissivity (Sobrino, et al., 2004), or using a land cover classification for the definition of the land surface emissivity of each class (Weng, et al. 2004).
For instance, the emissivity (e) values of various land cover types are provided in the following table (from Mallick, et al. 2012).
Soil: 0.928
Grass: 0.982
Asphalt: 0.942
Concrete: 0.937
Therefore, the land surface temperature can be calculated as (Weng, et al. 2004):
T = TB / [ 1 + (? * TB / ?) lne ]
where:
? = wavelength of emitted radiance
? = h * c / s (1.438 * 10^-2 m K)
h = Planck’s constant (6.626 * 10^-34 Js)
s = Boltzmann constant (1.38 * 10^-23 J/K)
c = velocity of light (2.998 * 10^8 m/s)
The values of ? for the thermal bands of Landsat setellites are listed in the following table:
| Satellite | Band | Center wavelength (µm) |
| Landsat 4, 5, and 7 | 6 | 11.45 |
| Landsat 8 | 10 | 10.8 |
| Landsat 8 | 11 | 12 |
Further reading on emissivity selection, see section 2.3, Emissivity Retrieval, here: https://books.google.com/books?id=XN4uAYlexnsC&lpg=PA51&ots=YQrmDa2S1G&dq=vegetation%20and%20bare%20soil%20emissivity&pg=PA50#v=onepage&q&f=false

Bootstrap weighted data - Matlab

I have a simple dataset with values and absolute frequencies, like the table below:
value|freq
-----------
1 | 10
3 | 20
4 | 10
3 | 10
And now I'd like to calculate the frequency table, like:
value| %
-----------
1 | 1/5
3 | 3/5
4 | 1/5
And last step, I'd like to compute the bootstrap CI with matlab. I have a lot of rows in the dataset.
I've calculated the frequency table via grpstatscommand in Matlab, but I don't know how I can use it in the boostrp function in matlab.
Any help or suggestions would be really appreciated.