subset from a data.frame based on a date time colum - date

I have the following data.frame with about 18 millions of records
Gender Age Bici DepartingSta DateTimeDepa ArrivingSta DateTimeArri TravelTime
1 M 28 69 85 2010-02-16 12:42:32 85 2010-02-16 12:45:37 3.1
2 M 30 11 85 2010-02-16 12:53:29 26 2010-02-16 13:22:23 28.9
3 M 37 43 85 2010-02-16 13:21:46 13 2010-02-16 13:49:47 28.0
4 M 37 826 22 2010-02-16 14:06:40 85 2010-02-16 14:23:13 16.6
5 M 19 662 27 2010-02-16 15:31:15 74 2010-02-16 16:29:17 58.0
6 F 25 8 85 2010-02-16 16:31:53 20 2010-02-16 16:49:26 17.6
17919307 F 26 2760 121 2015-01-30 23:58:33 106 2015-01-31 00:22:08 23.6
17919308 M 22 4077 71 2015-01-30 23:58:50 190 2015-01-31 00:13:24 14.6
17919309 M 32 699 154 2015-01-30 23:58:55 165 2015-01-31 00:02:25 3.5
17919310 F 26 4044 64 2015-01-30 23:59:20 50 2015-01-31 00:05:38 6.3
17919311 M 26 3114 26 2015-01-30 23:59:23 127 2015-01-31 00:12:29 13.1
17919312 M 25 4115 165 2015-01-30 23:59:55 73 2015-01-31 00:12:39 12.7
I want to write a function to subset the trips from January 2015. The input is "201501" and the results is
Gender Age Bici DepartingSta DateTimeDepa ArrivingSta DateTimeArri TravelTime
17919307 F 26 2760 121 2015-01-30 23:58:33 106 2015-01-31 00:22:08 23.6
17919308 M 22 4077 71 2015-01-30 23:58:50 190 2015-01-31 00:13:24 14.6
17919309 M 32 699 154 2015-01-30 23:58:55 165 2015-01-31 00:02:25 3.5
17919310 F 26 4044 64 2015-01-30 23:59:20 50 2015-01-31 00:05:38 6.3
17919311 M 26 3114 26 2015-01-30 23:59:23 127 2015-01-31 00:12:29 13.1
17919312 M 25 4115 165 2015-01-30 23:59:55 73 2015-01-31 00:12:39 12.7

As per suggested in this answer, you could convert your dataset into an xts object and then use intelligent subsetting options:
xtsdf <- xts::xts(df, order.by = df$DateTimeDepa)
xtsdf["201501"]
Which gives:
# Gender Age Bici DepartingSta DateTimeDepa #ArrivingSta
#2015-01-30 23:58:33 "F" "26" "2760" "121" "2015-01-30 23:58:33" "106"
#2015-01-30 23:58:50 "M" "22" "4077" " 71" "2015-01-30 23:58:50" "190"
#2015-01-30 23:58:55 "M" "32" " 699" "154" "2015-01-30 23:58:55" "165"
#2015-01-30 23:59:20 "F" "26" "4044" " 64" "2015-01-30 23:59:20" " 50"
#2015-01-30 23:59:23 "M" "26" "3114" " 26" "2015-01-30 23:59:23" "127"
#2015-01-30 23:59:55 "M" "25" "4115" "165" "2015-01-30 23:59:55" " 73"
# DateTimeArri TravelTime
#2015-01-30 23:58:33 "2015-01-31 00:22:08" "23.6"
#2015-01-30 23:58:50 "2015-01-31 00:13:24" "14.6"
#2015-01-30 23:58:55 "2015-01-31 00:02:25" " 3.5"
#2015-01-30 23:59:20 "2015-01-31 00:05:38" " 6.3"
#2015-01-30 23:59:23 "2015-01-31 00:12:29" "13.1"
#2015-01-30 23:59:55 "2015-01-31 00:12:39" "12.7"

Here's how you can solve this using base R format(), vectorized string comparison, and subset():
df <- data.frame(Gender=c('M','M','M','M','M','F','F','M','M','F','M','M'),Age=c(28L,30L,37L,37L,19L,25L,26L,22L,32L,26L,26L,25L),Bici=c(69L,11L,43L,826L,662L,8L,2760L,4077L,699L,4044L,3114L,4115L),DepartingSta=c(85L,85L,85L,22L,27L,85L,121L,71L,154L,64L,26L,165L),DateTimeDepa=as.POSIXct(c('2010-02-16 12:42:32','2010-02-16 12:53:29','2010-02-16 13:21:46','2010-02-16 14:06:40','2010-02-16 15:31:15','2010-02-16 16:31:53','2015-01-30 23:58:33','2015-01-30 23:58:50','2015-01-30 23:58:55','2015-01-30 23:59:20','2015-01-30 23:59:23','2015-01-30 23:59:55')),ArrivingSta=c(85L,26L,13L,85L,74L,20L,106L,190L,165L,50L,127L,73L),DateTimeArri=as.POSIXct(c('2010-02-16 12:45:37','2010-02-16 13:22:23','2010-02-16 13:49:47','2010-02-16 14:23:13','2010-02-16 16:29:17','2010-02-16 16:49:26','2015-01-31 00:22:08','2015-01-31 00:13:24','2015-01-31 00:02:25','2015-01-31 00:05:38','2015-01-31 00:12:29','2015-01-31 00:12:39')),TravelTime=c(3.1,28.9,28,16.6,58,17.6,23.6,14.6,3.5,6.3,13.1,12.7),row.names=c('1','2','3','4','5','6','17919307','17919308','17919309','17919310','17919311','17919312'),stringsAsFactors=F);
ym <- '201501';
df;
## Gender Age Bici DepartingSta DateTimeDepa ArrivingSta DateTimeArri TravelTime
## 1 M 28 69 85 2010-02-16 12:42:32 85 2010-02-16 12:45:37 3.1
## 2 M 30 11 85 2010-02-16 12:53:29 26 2010-02-16 13:22:23 28.9
## 3 M 37 43 85 2010-02-16 13:21:46 13 2010-02-16 13:49:47 28.0
## 4 M 37 826 22 2010-02-16 14:06:40 85 2010-02-16 14:23:13 16.6
## 5 M 19 662 27 2010-02-16 15:31:15 74 2010-02-16 16:29:17 58.0
## 6 F 25 8 85 2010-02-16 16:31:53 20 2010-02-16 16:49:26 17.6
## 17919307 F 26 2760 121 2015-01-30 23:58:33 106 2015-01-31 00:22:08 23.6
## 17919308 M 22 4077 71 2015-01-30 23:58:50 190 2015-01-31 00:13:24 14.6
## 17919309 M 32 699 154 2015-01-30 23:58:55 165 2015-01-31 00:02:25 3.5
## 17919310 F 26 4044 64 2015-01-30 23:59:20 50 2015-01-31 00:05:38 6.3
## 17919311 M 26 3114 26 2015-01-30 23:59:23 127 2015-01-31 00:12:29 13.1
## 17919312 M 25 4115 165 2015-01-30 23:59:55 73 2015-01-31 00:12:39 12.7
ym;
## [1] "201501"
subset(df,format(DateTimeDepa,'%Y%m')==ym);
## Gender Age Bici DepartingSta DateTimeDepa ArrivingSta DateTimeArri TravelTime
## 17919307 F 26 2760 121 2015-01-30 23:58:33 106 2015-01-31 00:22:08 23.6
## 17919308 M 22 4077 71 2015-01-30 23:58:50 190 2015-01-31 00:13:24 14.6
## 17919309 M 32 699 154 2015-01-30 23:58:55 165 2015-01-31 00:02:25 3.5
## 17919310 F 26 4044 64 2015-01-30 23:59:20 50 2015-01-31 00:05:38 6.3
## 17919311 M 26 3114 26 2015-01-30 23:59:23 127 2015-01-31 00:12:29 13.1
## 17919312 M 25 4115 165 2015-01-30 23:59:55 73 2015-01-31 00:12:39 12.7

Related

range of number to subset slices

I would like to reshape a vector into a number 'slices' (in Matlab) but find myself in a brain freeze and can't come up with a good way (e.g. a one-liner) to do it:
a=1:119;
slices=[47 24 1 47];
result={1:47,48:71,...};
the result doesn't need to be stored in a cell array.
Thanks
This is what mat2cell does:
>> a=1:119;
>> slices=[47 24 1 47];
>> result = mat2cell(a, 1, slices) % 1 is # of rows in result
result =
{
[1,1] =
Columns 1 through 15:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Columns 16 through 30:
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Columns 31 through 45:
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
Columns 46 and 47:
46 47
[1,2] =
Columns 1 through 15:
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
Columns 16 through 24:
63 64 65 66 67 68 69 70 71
[1,3] = 72
[1,4] =
Columns 1 through 13:
73 74 75 76 77 78 79 80 81 82 83 84 85
Columns 14 through 26:
86 87 88 89 90 91 92 93 94 95 96 97 98
Columns 27 through 39:
99 100 101 102 103 104 105 106 107 108 109 110 111
Columns 40 through 47:
112 113 114 115 116 117 118 119
}

Matlab Spearman Correlation PVAL = 0?

I am conducting Spearman's Correlation with two data sets with 300 objects. These are my variables and commands:
a = [1:300]
b = [1 2 5 11 9 7 24 10 31 23 3 40 6 17 14 20 16 12 33 46 70 37 87 43 98 26 59 58 77 100 35 42 78 80 243 36 33327 4 83 160 163 198 86 94 406 111 28 29 55 113 239 295 110 196 177 32679 229 342 305 300 254 96 210 514 167 172 232 190 117 32081 25 158 19333 241 82 149 159 66 178 24487 68 30 1016 725 266 391 638 348 320 681 242 319 228 381 408 442 202 369 471 821 191 426 8 270 211 2266 619 576 441 680 3431 1167 723 74 318 556 640 395 1059 579 614 212 325 437 323 687 373 599 26637 985 54 84 802 724 154 417 240 1120 818 2309 462 109 104 509 494 427 57 2475 549 396 419 123 580 79 225 1132 351 76 16859 596 862 315 470 992 257 120 409 751 832 285 1534 714 1665 1376 2129 678 416 721 209 31971 183 356 1346 1015 1003 188 1076 1634 608 1056 338 308 145 418 625 1313 121 2484 996 783 329 1185 697 157 1100 175 622 235 456 277 166 2700 1439 461 653 433 540 1191 234 774 1894 1004 741 1062 948 48 99 405 797 237 1104 2286 22620 1429 30672 1808 169 458 22 1115 10660 872 474 1063 88 1727 1017 1107 1398 1519 703 1092 1027 272 263 1152 1770 1099 507 385 2118 19356 1778 2458 410 2110 7522 17166 4065 15136 13294 10876 17174 2434 9898 5663 13594 10506 11552 15635 9322 3223 8949 12388 13216 13851 13852 6696 12177 4700 17199 2067 11110 15486 5664 6593 4701 527 8616 268]
[RHO,PVAL] = corr(b',a','Type', 'Spearman')
RHO =
0.6954
PVAL =
0
Out of the 5 comparisons I made with other data sets of 300 objects, only 1 returned significant P-values. Is there an explanation for this?
I tried a different data set and got a value that was not significant (PVAL > 0.05). I also displayed the answer in a long (15 digits) and exponential form and got 0.00000000000000e+000 using:
format longEng
I also checked with another statistics program that reported the p-value as < 0.0001. This means that the p-value is just really, really small.

MRT function [1] "error code = 0"

When I make my MRT, I got two errors:
[1] "error code = 0" and Error in indval.default(Ynode, clustering =
clustnode, numitr = 1000) : All species must occur in at least one
plot. Does anyone have an idea of why? I checked and all my species
have an abundance >0...
MRTtest=mvpart(vegetation~ Placette+ Tourb + Transect + Largcanal + Annouvert + Elevation + Profnappe + Litiere+ Solnu+ Deblign+ Densiometre+ EpaissMO+ Vonpostvingt+ Vonpostsoixante+ Pyrovingt+ Pyrosoixante+ Sommesurfterr,tot,margin=0.08,cp=0,xv="pick",xval=10,xvmult=150,which=4,pca=F)
X-Val rep : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
106 107 108 109 110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 125 126 127 128 129 130 131 132 133
134 135 136 137 138 139 140 141 142 143 144 145 146 147
148 149 150
Minimum tree sizes
tabmins
2 3 4 6
2 125 5 18
MRTtest1=MRT(MRTtest,percent=10,species=colnames(vegetation))
summary(MRTtest1)
Portion (%) of deviance explained by species for every particular node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--- Node 1 ---
Complexity(R2) 14.87422
Sommesurfterr>=6.024 Sommesurfterr< 6.024
~ Discriminant species :
THOnmtot THOmtot
% of expl. deviance 17.61057298 38.419650816
Mean on the left 0.37621604 0.430818462
Mean on the right 0.08877576 0.006259911
[1] "error code = 0"
~ INDVAL species for this node: : left is 1, right is 2
cluster indicator_value probability
THOmtot 1 0.9597 0.001
THOnmtot 1 0.7878 0.001
LEG 1 0.5802 0.031
LIB 1 0.5078 0.010
MELnmtot 1 0.4710 0.047
EPNnmtot 1 0.4404 0.026
Sum of probabilities = 87.497
Sum of Indicator Values = 30.02
Sum of Significant Indicator Values = 12.67
Number of Significant Indicators = 29
Significant Indicator Distribution
1 2
8 21
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--- Node 2 ---
Complexity(R2) 7.920283
Densiometre< 19.88 Densiometre>=19.88
~ Discriminant species :
TRA THOmtot
% of expl. deviance 10.54536819 27.8848051
Mean on the left 0.02754503 0.5158733
Mean on the right 0.20823621 0.2220475
Error in indval.default(Ynode, clustering = clustnode, numitr = 1000)
: All species must occur in at least one plot

missing glyphs when displaying vector field in paraview?

When visualizing the file below (paraview 4.2.0 64 bit linux), not all glyphs are rendered. For instance, the glyph at [7,0,0] (corresponding to pointID 7) is missing.
Can anyone confirm that behaviour?
file.vti
<?xml version="1.0"?>
<VTKFile type="ImageData" version="0.1" byte_order="LittleEndian" compressor="vtkZLibDataCompressor">
<ImageData WholeExtent="0 7 0 15 0 0" Origin="0 0 0" Spacing="1 1 1">
<Piece Extent="0 7 0 15 0 0">
<PointData>
<DataArray type="Float64" Name="thedata" NumberOfComponents="3" format="ascii" RangeMin="1.7320508076" RangeMax="221.70250337">
1 -1 1 2 -2 2
3 -3 3 4 -4 4
5 -5 5 6 -6 6
7 -7 7 8 -8 8
9 -9 9 10 -10 10
11 -11 11 12 -12 12
13 -13 13 14 -14 14
15 -15 15 16 -16 16
17 -17 17 18 -18 18
19 -19 19 20 -20 20
21 -21 21 22 -22 22
23 -23 23 24 -24 24
25 -25 25 26 -26 26
27 -27 27 28 -28 28
29 -29 29 30 -30 30
31 -31 31 32 -32 32
33 -33 33 34 -34 34
35 -35 35 36 -36 36
37 -37 37 38 -38 38
39 -39 39 40 -40 40
41 -41 41 42 -42 42
43 -43 43 44 -44 44
45 -45 45 46 -46 46
47 -47 47 48 -48 48
49 -49 49 50 -50 50
51 -51 51 52 -52 52
53 -53 53 54 -54 54
55 -55 55 56 -56 56
57 -57 57 58 -58 58
59 -59 59 60 -60 60
61 -61 61 62 -62 62
63 -63 63 64 -64 64
65 -65 65 66 -66 66
67 -67 67 68 -68 68
69 -69 69 70 -70 70
71 -71 71 72 -72 72
73 -73 73 74 -74 74
75 -75 75 76 -76 76
77 -77 77 78 -78 78
79 -79 79 80 -80 80
81 -81 81 82 -82 82
83 -83 83 84 -84 84
85 -85 85 86 -86 86
87 -87 87 88 -88 88
89 -89 89 90 -90 90
91 -91 91 92 -92 92
93 -93 93 94 -94 94
95 -95 95 96 -96 96
97 -97 97 98 -98 98
99 -99 99 100 -100 100
101 -101 101 102 -102 102
103 -103 103 104 -104 104
105 -105 105 106 -106 106
107 -107 107 108 -108 108
109 -109 109 110 -110 110
111 -111 111 112 -112 112
113 -113 113 114 -114 114
115 -115 115 116 -116 116
117 -117 117 118 -118 118
119 -119 119 120 -120 120
121 -121 121 122 -122 122
123 -123 123 124 -124 124
125 -125 125 126 -126 126
127 -127 127 128 -128 128
</DataArray>
</PointData>
<CellData>
</CellData>
</Piece>
</ImageData>
</VTKFile>
Try changing the Glyph Mode to All Points. In 4.2, the default is to use a sampling mechanism to attempt to get uniformly distributed glyphs.

Comparing area between two matrices with multiple points of intersection (MATLAB)

I have two matrices that contain the points for the top boundaries of two jigsaw puzzles. I am trying to calculate the area contained between these two matrices (of unequal rows). They have multiple points of intersection. The picture below gives a better idea of what I'm trying to accomplish. The output should be a numerical value (of the total area highlighted in blue). If this is too difficult to achieve, is there a better way to compare matrices to see which ones "fit" the best?
Un-Highlighted Picture
Area(in blue) that I am trying to calculate (numerical value)
The matrix values are below if it helps:
Matrix 1:
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 3
10 3
11 3
12 2
13 2
14 2
15 2
16 2
17 2
18 2
19 2
20 2
21 2
22 2
23 2
24 2
25 2
26 2
27 2
28 2
29 2
30 2
31 2
32 2
33 2
34 2
35 2
36 2
37 2
38 2
39 2
40 2
41 2
42 2
43 2
44 2
45 2
46 2
47 2
48 2
49 2
50 2
51 2
52 2
53 2
54 2
55 2
56 2
57 2
58 2
59 2
60 2
61 2
62 2
63 2
64 2
65 2
66 2
67 2
68 2
69 2
70 2
71 2
72 3
73 3
74 3
75 4
76 5
77 6
78 7
79 8
79 9
80 10
80 11
80 12
80 13
80 14
80 15
79 16
79 17
79 18
78 19
78 20
78 21
77 22
77 23
77 24
76 25
76 26
76 27
75 28
75 29
75 30
74 31
74 32
74 33
73 34
73 35
73 36
73 37
73 38
72 39
72 40
72 41
72 42
72 43
72 44
73 45
73 46
73 47
74 48
75 49
76 50
77 51
78 52
79 53
80 53
81 54
82 54
83 55
84 55
85 56
86 56
87 57
88 57
89 57
90 58
91 58
92 58
93 58
94 59
95 59
96 59
97 59
98 59
99 59
100 59
101 59
102 59
103 59
104 59
105 59
106 59
107 59
108 59
109 59
110 59
111 59
112 58
113 58
114 58
115 58
116 57
117 57
118 57
119 56
120 56
121 56
122 55
123 55
124 54
125 53
126 53
127 52
128 51
129 51
130 50
131 49
132 48
132 47
133 46
133 45
133 44
133 43
133 42
133 41
133 40
133 39
133 38
132 37
132 36
132 35
131 34
131 33
131 32
131 31
130 30
130 29
130 28
130 27
129 26
129 25
128 24
128 23
127 22
127 21
127 20
127 19
126 18
126 17
126 16
126 15
126 14
126 13
126 12
126 11
126 10
126 9
127 8
128 7
129 6
130 5
131 4
132 3
133 3
134 2
135 2
136 2
137 2
138 2
139 2
140 2
141 2
142 2
143 2
144 2
145 2
146 2
147 2
148 2
149 2
150 2
151 2
152 2
153 2
154 2
155 2
156 2
157 2
158 2
159 2
160 2
161 2
162 2
163 2
164 2
165 2
166 2
167 2
168 1
169 1
170 1
171 1
172 1
Matrix 2:
173 3
172 3
171 3
170 2
169 2
168 2
167 2
166 2
165 2
164 2
163 2
162 2
161 2
160 2
159 2
158 2
157 2
156 2
155 2
154 2
153 2
152 2
151 2
150 2
149 2
148 2
147 2
146 2
145 2
144 2
143 2
142 2
141 2
140 2
139 2
138 2
137 2
136 2
135 3
134 3
133 3
132 3
131 4
130 4
129 4
128 5
127 6
127 7
127 8
126 9
127 10
127 11
127 12
127 13
127 14
126 15
127 16
127 17
127 18
127 19
127 20
127 21
128 22
128 23
128 24
128 25
129 26
129 27
129 28
130 29
130 30
130 31
131 32
131 33
131 34
132 35
132 36
132 37
132 38
133 39
133 40
133 41
133 42
133 43
132 44
132 45
132 46
131 47
130 48
129 49
128 50
127 51
126 52
125 53
124 54
123 54
122 55
121 55
120 55
119 56
118 56
117 57
116 58
115 58
114 59
113 59
112 59
111 59
110 60
109 60
108 60
107 60
106 60
105 60
104 60
103 60
102 60
101 60
100 60
99 60
98 60
97 60
96 60
95 59
94 59
93 59
92 59
91 59
90 58
89 58
88 57
87 57
86 56
85 56
84 55
83 55
82 54
81 54
80 53
79 52
78 51
77 50
76 49
75 48
74 47
73 46
73 45
73 44
73 43
73 42
73 41
73 40
73 39
73 38
73 37
73 36
74 35
74 34
74 33
75 32
75 31
75 30
76 29
76 28
76 27
77 26
77 25
77 24
78 23
78 22
78 21
79 20
79 19
80 18
80 17
80 16
81 15
81 14
81 13
81 12
81 11
81 10
80 10
79 9
79 8
78 7
77 6
76 5
75 4
74 4
73 3
72 3
71 2
70 2
69 2
68 1
67 2
66 2
65 2
64 2
63 2
62 2
61 2
60 2
59 2
58 2
57 2
56 2
55 2
54 2
53 2
52 2
51 2
50 2
49 2
48 2
47 2
46 2
45 2
44 2
43 2
42 2
41 2
40 2
39 2
38 2
37 2
36 2
35 2
34 2
33 2
32 2
31 2
30 2
29 2
28 2
27 2
26 2
25 2
24 2
23 2
22 2
21 2
20 2
19 2
18 2
17 2
16 2
15 2
14 2
13 2
12 2
11 2
10 2
9 2
8 2
7 2
6 2
5 2
4 2
3 2
2 2
1 3
Mark everything above one curve and mark everything below the other curve. Then you can get the area between the curves by finding where there are two marks.
You should take a look at this question Similarity measures between curves?