I have a (maybe simple) question about Matlab data import. I want to import a huge dataset (~1GB) which has a comma separated format like this:
08:05, 12, 33, 124, 13, 08:06, 22, 84, 12, 35, ..
Every 5th value is a timestamp. I want to import it with a fixed numbers of colums (5 colums), but there is no delimiter for the end of row. It should look like this in the end:
08:05 12 33 124 13
08:06 22 14 1 35
08:07 22 124 12 34
08:08 22 12 12 0
I thought about replacing every 5th comma by a subroutine, but it's too time consuming. Do you know a better solution? I'm hoping for a nice build-in function.
You can use fscanf and C-type format strings to accomplish this. For example:
fid=fopen('filename.txt');
A=reshape(fscanf(fid,'%d:%d, %d, %d, %d, %d, '),6,[])';
fclose(fid);
This stores your answer in a matrix A which will contain
A =
8 5 12 33 124 13
8 6 22 84 12 35
If you want to format this into a string or output file as you listed, you could use:
fprintf('%02d:%02d %-3d %-3d %-3d %-3d\n',A')
Related
Predicting next month values using linear regression.
I am using 6 month based historical values to predict future values.
I use vaccinated count on dependent variable and use months for independent variable and converted it to integer starts on 1.
Example.
Historical Data:
Month dependent variable independent variable
Jun 15 1
Jul 14 2
Aug 18 3
Sep 19 4
Oct 20 5
Nov 22 6
Is that correct?
Dependent Variable = Vaccinated Count
Independent Variable = Month converted to number start from 1
Expecting to give me some ideas if my data is correct
See picture below.
Python simple linear regression:
Hardcover
Date
2000-04-01 139
2000-04-02 128
2000-04-03 172
2000-04-04 139
2000-04-05 191
df['Time'] = np.arange(len(df.index))
Hardcover Time
Date
2000-04-01 139 0
2000-04-02 128 1
2000-04-03 172 2
2000-04-04 139 3
2000-04-05 191 4
fig, ax = plt.subplots()
ax.plot('Time', 'Hardcover', data=df, color='0.75')
ax = sns.regplot(x='Time', y='Hardcover', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_title('Time Plot of Hardcover Sales');
I've made a calendar sheet and would like to fill it using an Arrayformula or some kind of Lookup.
The problem is, the code in each cell is different, do I need it all to be the same code or is it possible to do an Arrayformula that does a different formula for each line?
I spent ages getting the calendar code working but would now like to simplify the code and I'm not sure what my next step should be:
https://docs.google.com/spreadsheets/d/1u_J7bmOFyDlYXhcL5dW3CHFJ1esySAKK_yPc6nFTdLA/edit?usp=sharing
Any advice would be much appreciated.
I've added a new sheet in your file called 'Aresvik'.
The green cells have new formula.
Cell B3 can be =date(B1,1,1)
Then each successive month can be =eomonth(B3,0)+1, =eomonth(J3,0)+1 etc.
The date formula in cell B5 is:
=arrayformula(iferror(vlookup(sequence(7,7,1),{array_constrain(sequence(40,1),day(eomonth(B3,0))+weekday(B3,3),1),query({flatten(split(rept(",",day(eomonth(B3,0))-1),",",0,0));sequence(day(eomonth(B3,0)),1,1)},"offset "&day(eomonth(B3,0))-weekday(B3,3)&" ",0)},2,false),))
It can be copied to each other cell below Mo, so B5 will change to J5, R5, Z5 etc.
Notes
The concept revolves around using the SEQUENCE function to generate a grid of numbers, 6 rows, 7 columns:
sequence(6,7)
which looks like this:
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31 32 33 34 35
36 37 38 39 40 41 42
Then using these numbers in a VLOOKUP to get a corresponding date for the calendar. If the first of the month falls on a Thursday (April 2021), the vlookup range needs 3 gaps at the top of the list of dates. player0 has a more elegant solution than my original query using offset, so I've incorporated it below. Cell Z3 is the date 1/4/2021:
=arrayformula(
iferror(
vlookup(sequence(6,7),
{sequence(day(eomonth(Z3,0))+weekday(Z3,2),1,0),
{iferror(sequence(weekday(Z3,2),1)/0,);sequence(day(eomonth(Z3,0)),1,Z3)}},
2,false)
,))
The first column in the vlookup range is:
sequence(day(eomonth(Z3,0))+weekday(Z3,2),1,0)
which is an array of numbers from 0, corresponding with the number of days in the month plus the number of gaps before the 1st day.
The second column in the vlookup range is:
{iferror(sequence(weekday(Z3,2),1)/0,);sequence(day(eomonth(Z3,0)),1,Z3)}},
It is an array of 2 columns in this format: {x;y}, where y sits below x because of the ;.
These are the gaps: iferror(sequence(weekday(Z3,2),1)/0,), followed by the date numbers: sequence(day(eomonth(Z3,0)),1,Z3)
(Example below is April 2021):
0
1
2
3
4
5
6 44317
7 44318
8 44319
9 44320
10 44321
11 44322
12 44323
13 44324
14 44325
15 44326
16 44327
17 44328
18 44329
19 44330
20 44331
21 44332
22 44333
23 44334
24 44335
25 44336
26 44337
27 44338
28 44339
29 44340
30 44341
31 44342
32 44343
33 44344
34 44345
35 44346
36 44347
The vlookup takes each number in the initial sequence (6x7 layout), and brings back the corresponding date from col2 in the range, based on a match in col1.
When the first day of the month is a Monday, iferror(sequence(weekday(BB1,2),1)/0,) generates a gap in col2 of the vlookup range. This is why col1 in the vlookup range has to start with 0.
I've updated the sheet at https://docs.google.com/spreadsheets/d/1u_J7bmOFyDlYXhcL5dW3CHFJ1esySAKK_yPc6nFTdLA/edit#gid=68642071
Values on the calendar are dates so the formatting has to be d.
If you want numbers, then use:
=arrayformula(
iferror(
vlookup(sequence(6,7),
{sequence(day(eomonth(Z3,0))+weekday(Z3,2),1,0),
{iferror(sequence(weekday(Z3,2),1)/0,);sequence(day(eomonth(Z3,0)),1)}},
2,false)
,))
shorter solution:
=INDEX(IFNA(VLOOKUP(SEQUENCE(6, 7), {SEQUENCE(DAY(EOMONTH(B3, ))+WEEKDAY(B3, 2), 1, ),
{IFERROR(ROW(INDIRECT("1:"&WEEKDAY(B3, 2)))/0); SEQUENCE(DAY(EOMONTH(B3, )), 1, B3)}}, 2, )))
I have a function that takes as input some of the values in a table and returns a tuple if you will - three separate return values, which I want to transpose into the output of a query. Here's a simplified example of what I want to achieve:
multiplier:{(x*2;x*3;x*3)};
select twoX:multiplier[price][0]; threeX:multiplier[price][1]; fourX:multiplier[price][2] from data;
The above basically works (I think I've got the syntax right for the simplified example - if not then hopefully my intention is clear), but is inefficient because I'm calling the function three times and throwing away most of the output each time. I want to rewrite the query to only call the function once, and I'm struggling.
Update
I think I missed a crucial piece of information in my explanation of the problem which affects the outcome - I need to get other data in the query alongside the output of my function. Here's a hopefully more realistic example:
multiplier:{(x*2;x*3;x*4)};
select average:avg price, total:sum price, twoX:multiplier[sum price][0]; threeX:multiplier[sum price][1]; fourX:multiplier[sum price][2] by category from data;
I'll have a go at adapting your answers to fit this requirement anyway, and apologies for missing this bit of information. The real function if a proprietary and fairly complex algorithm and the real query has about 30 output columns, hence the attempt at simplifying the example :)
If you're just looking for the results themselves you can extract (exec) as lists, create dictionary and then flip the dictionary into a table:
q)exec flip`twoX`threeX`fourX!multiplier[price] from ([]price:til 10)
twoX threeX fourX
-----------------
0 0 0
2 3 4
4 6 8
6 9 12
8 12 16
10 15 20
12 18 24
14 21 28
16 24 32
18 27 36
If you need other columns from the original table too then its trickier but you could join the tables sideways using ,'
q)t:([]price:til 10)
q)t,'exec flip`twoX`threeX`fourX!multiplier[price] from t
An apply # can also achieve what you want. Here data is just a table with 10 random prices. # is then used to apply the multiplier function to the price column while also assigning a column name to each of the three resulting lists:
q)data:([] price:10?100)
q)multiplier:{(x*2;x*3;x*3)}
q)#[data;`twoX`threeX`fourX;:;multiplier data`price]
price twoX threeX fourX
-----------------------
80 160 240 240
24 48 72 72
41 82 123 123
0 0 0 0
81 162 243 243
10 20 30 30
36 72 108 108
36 72 108 108
16 32 48 48
17 34 51 51
Want to convert the alphabet to numerical values and transform it back to alphabets using some mathematical techniques like fast Fourier transform in MATLAB.
Example:
The following is the text saved in "text2figure.txt" file
Hi how r u am fine take care of your health
thank u very much
am 2.0
Reading it in MATLAB:
data=fopen('text2figure.txt','r')
d=fscanf(data,'%s')
temp = fileread( 'text2figure.txt' )
temp = regexprep( temp, ' {6}', ' NaN' )
c=cellstr(temp(:))'
Now I wish to convert cell array with spaces to numerical values/integers:
coding = 'abcdefghijklmnñopqrstuvwxyz .,;'
str = temp %// example text
[~, result] = ismember(str, coding)
y=result
result =
Columns 1 through 18
0 9 28 8 16 24 28 19 28 22 28 1 13 28 6 9 14 5
Columns 19 through 36
28 21 1 11 5 28 3 1 19 5 28 16 6 28 26 16 22 19
Columns 37 through 54
28 8 5 1 12 21 8 28 0 0 21 8 1 14 11 28 22 28
Columns 55 through 71
23 5 19 26 28 13 22 3 8 0 0 1 13 28 0 29 0
Now I wish to convert the numerical values back to alphabets:
Hi how r u am fine take care of your health
thank u very much
am 2.0
How to write a MATLAB code to return the numerical values in the variable result to alphabets?
Most of the code in the question doesn't have any useful effects. These three lines are the ones that lead to result:
str = fileread('test2figure.txt');
coding = 'abcdefghijklmnñopqrstuvwxyz .,;';
[~, result] = ismember(str, coding);
ismember returns, in the second output argument, the indices into coding for each element of str. Thus, result are indices that we can use to index into coding:
out = coding(result);
However, this does not work because some elements of str do not occur in coding, and for those elements ismember returns 0, which is not a valid index. We can replace the zeros with a new character:
coding = ['*',coding];
out = coding(result+1);
Basically, we're shifting each code by one, adding a new code for 1.
One of the characters we're missing here is the newline character. Thus the three lines have become one line. You can add a code for the newline character by adding it to the coding table:
str = fileread('test2figure.txt');
coding = ['abcdefghijklmnñopqrstuvwxyz .,;',char(10)]; % char(10) is the newline character
[~, result] = ismember(str, coding);
coding = ['*',coding];
out = coding(result+1);
All of this is easier to achieve just using the ASCII code table:
str = fileread('test2figure.txt');
result = double(str);
out = char(result);
I am a beginner with MATLAB and I am struggling with this assignment. Can anyone guide me through it?
Consider the data given below:
x = [ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , ...
34, 47, 455, 21, , 22, 100 ];
Once the data is loaded, see if you can find any:
Outliers or
Missing data in the data file
Correct the missing values using median, mode and noisy data using median binning, mean binning and bin boundaries.
This isn't so bad. First off, take a look at the distribution of your data. You can see that the majority of your data has double digits. The outliers are those with single digits, or those that are way larger than double digits. Mind you, this is totally subjective so someone else may tell you that the single digits are part of your data too. Also, the missing data are those numbers that are spaces in between the commas. Let's write some MATLAB code and change these to NaN (or not-a-number), because if you try copying and pasting this code directly into MATLAB, it will give you a syntax error because if you are explicitly defining numbers this way, you have to be sure all of them are there.
To do this, use regexprep so that any parts of this string that have a comma, space, then another comma, put a NaN in between. To do this, we need to put this statement as a string first. We then use eval to convert this string to an actual MATLAB statement:
x = '[ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , 34, 47, 455, 21, , 22, 100 ];'
y = eval(regexprep(x, ', ,', ', NaN, '));
If we display this data, we get:
y =
Columns 1 through 6
1 48 81 2 10 25
Columns 7 through 12
NaN 14 18 53 41 56
Columns 13 through 18
89 0 1000 NaN 34 47
Columns 19 through 23
455 21 NaN 22 100
As such, to answer our first question, any values that are missing are denoted as NaN and those numbers that are bigger than double digits are outliers.
For the next question, we simply extract those values that are not missing, calculate the mean and median of what is not missing, and fill in those NaN values with the mean and median. For the bin boundaries, this is the same thing as using the values to the left (or right... depends on your definition, but let's use left) of the missing value and fill those in. As such:
yMissing = isnan(y); %// Which values are missing?
y_noNaN = y(~yMissing); %// Extract the non-missing values
meanY = mean(y_noNaN); %// Get the mean
medianY = median(y_noNaN); %// Get the median
%// Output - Fill in missing values with median
yMedian = y;
yMedian(yMissing) = medianY;
%// Same for mean
yMean = y;
yMean(yMissing) = meanY;
%// Bin boundaries
yBinBound = y;
yBinBound(yMissing) = y(find(yMissing)-1);
The mean and median for the data of the non-missing values is:
meanY =
105.8500
medianY =
37.5000
The outputs for each of these, in addition to the original data with the missing values looks like:
format bank; %// Do this to show just the first two decimal places for compact output
format compact;
y =
Columns 1 through 5
1 48 81 2 10
Columns 6 through 10
25 NaN 14 18 53
Columns 11 through 15
41 56 89 0 1000
Columns 16 through 20
NaN 34 47 455 21
Columns 21 through 23
NaN 22 100
yMean =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 105.85 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
105.85 34.00 47.00 455.00 21.00
Columns 21 through 23
105.85 22.00 100.00
yMedian =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 37.50 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
37.50 34.00 47.00 455.00 21.00
Columns 21 through 23
37.50 22.00 100.00
yBinBound =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 25.00 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
1000.00 34.00 47.00 455.00 21.00
Columns 21 through 23
21.00 22.00 100.00
If you take a look at each of the output values, this fills in our data with the mean, median and also the bin boundaries as per the question.