subset from a data.frame based on a date time colum - date
I have the following data.frame with about 18 millions of records
Gender Age Bici DepartingSta DateTimeDepa ArrivingSta DateTimeArri TravelTime
1 M 28 69 85 2010-02-16 12:42:32 85 2010-02-16 12:45:37 3.1
2 M 30 11 85 2010-02-16 12:53:29 26 2010-02-16 13:22:23 28.9
3 M 37 43 85 2010-02-16 13:21:46 13 2010-02-16 13:49:47 28.0
4 M 37 826 22 2010-02-16 14:06:40 85 2010-02-16 14:23:13 16.6
5 M 19 662 27 2010-02-16 15:31:15 74 2010-02-16 16:29:17 58.0
6 F 25 8 85 2010-02-16 16:31:53 20 2010-02-16 16:49:26 17.6
17919307 F 26 2760 121 2015-01-30 23:58:33 106 2015-01-31 00:22:08 23.6
17919308 M 22 4077 71 2015-01-30 23:58:50 190 2015-01-31 00:13:24 14.6
17919309 M 32 699 154 2015-01-30 23:58:55 165 2015-01-31 00:02:25 3.5
17919310 F 26 4044 64 2015-01-30 23:59:20 50 2015-01-31 00:05:38 6.3
17919311 M 26 3114 26 2015-01-30 23:59:23 127 2015-01-31 00:12:29 13.1
17919312 M 25 4115 165 2015-01-30 23:59:55 73 2015-01-31 00:12:39 12.7
I want to write a function to subset the trips from January 2015. The input is "201501" and the results is
Gender Age Bici DepartingSta DateTimeDepa ArrivingSta DateTimeArri TravelTime
17919307 F 26 2760 121 2015-01-30 23:58:33 106 2015-01-31 00:22:08 23.6
17919308 M 22 4077 71 2015-01-30 23:58:50 190 2015-01-31 00:13:24 14.6
17919309 M 32 699 154 2015-01-30 23:58:55 165 2015-01-31 00:02:25 3.5
17919310 F 26 4044 64 2015-01-30 23:59:20 50 2015-01-31 00:05:38 6.3
17919311 M 26 3114 26 2015-01-30 23:59:23 127 2015-01-31 00:12:29 13.1
17919312 M 25 4115 165 2015-01-30 23:59:55 73 2015-01-31 00:12:39 12.7
As per suggested in this answer, you could convert your dataset into an xts object and then use intelligent subsetting options:
xtsdf <- xts::xts(df, order.by = df$DateTimeDepa)
xtsdf["201501"]
Which gives:
# Gender Age Bici DepartingSta DateTimeDepa #ArrivingSta
#2015-01-30 23:58:33 "F" "26" "2760" "121" "2015-01-30 23:58:33" "106"
#2015-01-30 23:58:50 "M" "22" "4077" " 71" "2015-01-30 23:58:50" "190"
#2015-01-30 23:58:55 "M" "32" " 699" "154" "2015-01-30 23:58:55" "165"
#2015-01-30 23:59:20 "F" "26" "4044" " 64" "2015-01-30 23:59:20" " 50"
#2015-01-30 23:59:23 "M" "26" "3114" " 26" "2015-01-30 23:59:23" "127"
#2015-01-30 23:59:55 "M" "25" "4115" "165" "2015-01-30 23:59:55" " 73"
# DateTimeArri TravelTime
#2015-01-30 23:58:33 "2015-01-31 00:22:08" "23.6"
#2015-01-30 23:58:50 "2015-01-31 00:13:24" "14.6"
#2015-01-30 23:58:55 "2015-01-31 00:02:25" " 3.5"
#2015-01-30 23:59:20 "2015-01-31 00:05:38" " 6.3"
#2015-01-30 23:59:23 "2015-01-31 00:12:29" "13.1"
#2015-01-30 23:59:55 "2015-01-31 00:12:39" "12.7"
Here's how you can solve this using base R format(), vectorized string comparison, and subset():
df <- data.frame(Gender=c('M','M','M','M','M','F','F','M','M','F','M','M'),Age=c(28L,30L,37L,37L,19L,25L,26L,22L,32L,26L,26L,25L),Bici=c(69L,11L,43L,826L,662L,8L,2760L,4077L,699L,4044L,3114L,4115L),DepartingSta=c(85L,85L,85L,22L,27L,85L,121L,71L,154L,64L,26L,165L),DateTimeDepa=as.POSIXct(c('2010-02-16 12:42:32','2010-02-16 12:53:29','2010-02-16 13:21:46','2010-02-16 14:06:40','2010-02-16 15:31:15','2010-02-16 16:31:53','2015-01-30 23:58:33','2015-01-30 23:58:50','2015-01-30 23:58:55','2015-01-30 23:59:20','2015-01-30 23:59:23','2015-01-30 23:59:55')),ArrivingSta=c(85L,26L,13L,85L,74L,20L,106L,190L,165L,50L,127L,73L),DateTimeArri=as.POSIXct(c('2010-02-16 12:45:37','2010-02-16 13:22:23','2010-02-16 13:49:47','2010-02-16 14:23:13','2010-02-16 16:29:17','2010-02-16 16:49:26','2015-01-31 00:22:08','2015-01-31 00:13:24','2015-01-31 00:02:25','2015-01-31 00:05:38','2015-01-31 00:12:29','2015-01-31 00:12:39')),TravelTime=c(3.1,28.9,28,16.6,58,17.6,23.6,14.6,3.5,6.3,13.1,12.7),row.names=c('1','2','3','4','5','6','17919307','17919308','17919309','17919310','17919311','17919312'),stringsAsFactors=F);
ym <- '201501';
df;
## Gender Age Bici DepartingSta DateTimeDepa ArrivingSta DateTimeArri TravelTime
## 1 M 28 69 85 2010-02-16 12:42:32 85 2010-02-16 12:45:37 3.1
## 2 M 30 11 85 2010-02-16 12:53:29 26 2010-02-16 13:22:23 28.9
## 3 M 37 43 85 2010-02-16 13:21:46 13 2010-02-16 13:49:47 28.0
## 4 M 37 826 22 2010-02-16 14:06:40 85 2010-02-16 14:23:13 16.6
## 5 M 19 662 27 2010-02-16 15:31:15 74 2010-02-16 16:29:17 58.0
## 6 F 25 8 85 2010-02-16 16:31:53 20 2010-02-16 16:49:26 17.6
## 17919307 F 26 2760 121 2015-01-30 23:58:33 106 2015-01-31 00:22:08 23.6
## 17919308 M 22 4077 71 2015-01-30 23:58:50 190 2015-01-31 00:13:24 14.6
## 17919309 M 32 699 154 2015-01-30 23:58:55 165 2015-01-31 00:02:25 3.5
## 17919310 F 26 4044 64 2015-01-30 23:59:20 50 2015-01-31 00:05:38 6.3
## 17919311 M 26 3114 26 2015-01-30 23:59:23 127 2015-01-31 00:12:29 13.1
## 17919312 M 25 4115 165 2015-01-30 23:59:55 73 2015-01-31 00:12:39 12.7
ym;
## [1] "201501"
subset(df,format(DateTimeDepa,'%Y%m')==ym);
## Gender Age Bici DepartingSta DateTimeDepa ArrivingSta DateTimeArri TravelTime
## 17919307 F 26 2760 121 2015-01-30 23:58:33 106 2015-01-31 00:22:08 23.6
## 17919308 M 22 4077 71 2015-01-30 23:58:50 190 2015-01-31 00:13:24 14.6
## 17919309 M 32 699 154 2015-01-30 23:58:55 165 2015-01-31 00:02:25 3.5
## 17919310 F 26 4044 64 2015-01-30 23:59:20 50 2015-01-31 00:05:38 6.3
## 17919311 M 26 3114 26 2015-01-30 23:59:23 127 2015-01-31 00:12:29 13.1
## 17919312 M 25 4115 165 2015-01-30 23:59:55 73 2015-01-31 00:12:39 12.7
Related
range of number to subset slices
I would like to reshape a vector into a number 'slices' (in Matlab) but find myself in a brain freeze and can't come up with a good way (e.g. a one-liner) to do it: a=1:119; slices=[47 24 1 47]; result={1:47,48:71,...}; the result doesn't need to be stored in a cell array. Thanks
This is what mat2cell does: >> a=1:119; >> slices=[47 24 1 47]; >> result = mat2cell(a, 1, slices) % 1 is # of rows in result result = { [1,1] = Columns 1 through 15: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Columns 16 through 30: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Columns 31 through 45: 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Columns 46 and 47: 46 47 [1,2] = Columns 1 through 15: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 Columns 16 through 24: 63 64 65 66 67 68 69 70 71 [1,3] = 72 [1,4] = Columns 1 through 13: 73 74 75 76 77 78 79 80 81 82 83 84 85 Columns 14 through 26: 86 87 88 89 90 91 92 93 94 95 96 97 98 Columns 27 through 39: 99 100 101 102 103 104 105 106 107 108 109 110 111 Columns 40 through 47: 112 113 114 115 116 117 118 119 }
Matlab Spearman Correlation PVAL = 0?
I am conducting Spearman's Correlation with two data sets with 300 objects. These are my variables and commands: a = [1:300] b = [1 2 5 11 9 7 24 10 31 23 3 40 6 17 14 20 16 12 33 46 70 37 87 43 98 26 59 58 77 100 35 42 78 80 243 36 33327 4 83 160 163 198 86 94 406 111 28 29 55 113 239 295 110 196 177 32679 229 342 305 300 254 96 210 514 167 172 232 190 117 32081 25 158 19333 241 82 149 159 66 178 24487 68 30 1016 725 266 391 638 348 320 681 242 319 228 381 408 442 202 369 471 821 191 426 8 270 211 2266 619 576 441 680 3431 1167 723 74 318 556 640 395 1059 579 614 212 325 437 323 687 373 599 26637 985 54 84 802 724 154 417 240 1120 818 2309 462 109 104 509 494 427 57 2475 549 396 419 123 580 79 225 1132 351 76 16859 596 862 315 470 992 257 120 409 751 832 285 1534 714 1665 1376 2129 678 416 721 209 31971 183 356 1346 1015 1003 188 1076 1634 608 1056 338 308 145 418 625 1313 121 2484 996 783 329 1185 697 157 1100 175 622 235 456 277 166 2700 1439 461 653 433 540 1191 234 774 1894 1004 741 1062 948 48 99 405 797 237 1104 2286 22620 1429 30672 1808 169 458 22 1115 10660 872 474 1063 88 1727 1017 1107 1398 1519 703 1092 1027 272 263 1152 1770 1099 507 385 2118 19356 1778 2458 410 2110 7522 17166 4065 15136 13294 10876 17174 2434 9898 5663 13594 10506 11552 15635 9322 3223 8949 12388 13216 13851 13852 6696 12177 4700 17199 2067 11110 15486 5664 6593 4701 527 8616 268] [RHO,PVAL] = corr(b',a','Type', 'Spearman') RHO = 0.6954 PVAL = 0 Out of the 5 comparisons I made with other data sets of 300 objects, only 1 returned significant P-values. Is there an explanation for this?
I tried a different data set and got a value that was not significant (PVAL > 0.05). I also displayed the answer in a long (15 digits) and exponential form and got 0.00000000000000e+000 using: format longEng I also checked with another statistics program that reported the p-value as < 0.0001. This means that the p-value is just really, really small.
MRT function [1] "error code = 0"
When I make my MRT, I got two errors: [1] "error code = 0" and Error in indval.default(Ynode, clustering = clustnode, numitr = 1000) : All species must occur in at least one plot. Does anyone have an idea of why? I checked and all my species have an abundance >0... MRTtest=mvpart(vegetation~ Placette+ Tourb + Transect + Largcanal + Annouvert + Elevation + Profnappe + Litiere+ Solnu+ Deblign+ Densiometre+ EpaissMO+ Vonpostvingt+ Vonpostsoixante+ Pyrovingt+ Pyrosoixante+ Sommesurfterr,tot,margin=0.08,cp=0,xv="pick",xval=10,xvmult=150,which=4,pca=F) X-Val rep : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 Minimum tree sizes tabmins 2 3 4 6 2 125 5 18 MRTtest1=MRT(MRTtest,percent=10,species=colnames(vegetation)) summary(MRTtest1) Portion (%) of deviance explained by species for every particular node ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --- Node 1 --- Complexity(R2) 14.87422 Sommesurfterr>=6.024 Sommesurfterr< 6.024 ~ Discriminant species : THOnmtot THOmtot % of expl. deviance 17.61057298 38.419650816 Mean on the left 0.37621604 0.430818462 Mean on the right 0.08877576 0.006259911 [1] "error code = 0" ~ INDVAL species for this node: : left is 1, right is 2 cluster indicator_value probability THOmtot 1 0.9597 0.001 THOnmtot 1 0.7878 0.001 LEG 1 0.5802 0.031 LIB 1 0.5078 0.010 MELnmtot 1 0.4710 0.047 EPNnmtot 1 0.4404 0.026 Sum of probabilities = 87.497 Sum of Indicator Values = 30.02 Sum of Significant Indicator Values = 12.67 Number of Significant Indicators = 29 Significant Indicator Distribution 1 2 8 21 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --- Node 2 --- Complexity(R2) 7.920283 Densiometre< 19.88 Densiometre>=19.88 ~ Discriminant species : TRA THOmtot % of expl. deviance 10.54536819 27.8848051 Mean on the left 0.02754503 0.5158733 Mean on the right 0.20823621 0.2220475 Error in indval.default(Ynode, clustering = clustnode, numitr = 1000) : All species must occur in at least one plot
missing glyphs when displaying vector field in paraview?
When visualizing the file below (paraview 4.2.0 64 bit linux), not all glyphs are rendered. For instance, the glyph at [7,0,0] (corresponding to pointID 7) is missing. Can anyone confirm that behaviour? file.vti <?xml version="1.0"?> <VTKFile type="ImageData" version="0.1" byte_order="LittleEndian" compressor="vtkZLibDataCompressor"> <ImageData WholeExtent="0 7 0 15 0 0" Origin="0 0 0" Spacing="1 1 1"> <Piece Extent="0 7 0 15 0 0"> <PointData> <DataArray type="Float64" Name="thedata" NumberOfComponents="3" format="ascii" RangeMin="1.7320508076" RangeMax="221.70250337"> 1 -1 1 2 -2 2 3 -3 3 4 -4 4 5 -5 5 6 -6 6 7 -7 7 8 -8 8 9 -9 9 10 -10 10 11 -11 11 12 -12 12 13 -13 13 14 -14 14 15 -15 15 16 -16 16 17 -17 17 18 -18 18 19 -19 19 20 -20 20 21 -21 21 22 -22 22 23 -23 23 24 -24 24 25 -25 25 26 -26 26 27 -27 27 28 -28 28 29 -29 29 30 -30 30 31 -31 31 32 -32 32 33 -33 33 34 -34 34 35 -35 35 36 -36 36 37 -37 37 38 -38 38 39 -39 39 40 -40 40 41 -41 41 42 -42 42 43 -43 43 44 -44 44 45 -45 45 46 -46 46 47 -47 47 48 -48 48 49 -49 49 50 -50 50 51 -51 51 52 -52 52 53 -53 53 54 -54 54 55 -55 55 56 -56 56 57 -57 57 58 -58 58 59 -59 59 60 -60 60 61 -61 61 62 -62 62 63 -63 63 64 -64 64 65 -65 65 66 -66 66 67 -67 67 68 -68 68 69 -69 69 70 -70 70 71 -71 71 72 -72 72 73 -73 73 74 -74 74 75 -75 75 76 -76 76 77 -77 77 78 -78 78 79 -79 79 80 -80 80 81 -81 81 82 -82 82 83 -83 83 84 -84 84 85 -85 85 86 -86 86 87 -87 87 88 -88 88 89 -89 89 90 -90 90 91 -91 91 92 -92 92 93 -93 93 94 -94 94 95 -95 95 96 -96 96 97 -97 97 98 -98 98 99 -99 99 100 -100 100 101 -101 101 102 -102 102 103 -103 103 104 -104 104 105 -105 105 106 -106 106 107 -107 107 108 -108 108 109 -109 109 110 -110 110 111 -111 111 112 -112 112 113 -113 113 114 -114 114 115 -115 115 116 -116 116 117 -117 117 118 -118 118 119 -119 119 120 -120 120 121 -121 121 122 -122 122 123 -123 123 124 -124 124 125 -125 125 126 -126 126 127 -127 127 128 -128 128 </DataArray> </PointData> <CellData> </CellData> </Piece> </ImageData> </VTKFile>
Try changing the Glyph Mode to All Points. In 4.2, the default is to use a sampling mechanism to attempt to get uniformly distributed glyphs.
Comparing area between two matrices with multiple points of intersection (MATLAB)
I have two matrices that contain the points for the top boundaries of two jigsaw puzzles. I am trying to calculate the area contained between these two matrices (of unequal rows). They have multiple points of intersection. The picture below gives a better idea of what I'm trying to accomplish. The output should be a numerical value (of the total area highlighted in blue). If this is too difficult to achieve, is there a better way to compare matrices to see which ones "fit" the best? Un-Highlighted Picture Area(in blue) that I am trying to calculate (numerical value) The matrix values are below if it helps: Matrix 1: 1 1 2 2 3 2 4 2 5 2 6 3 7 3 8 3 9 3 10 3 11 3 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25 2 26 2 27 2 28 2 29 2 30 2 31 2 32 2 33 2 34 2 35 2 36 2 37 2 38 2 39 2 40 2 41 2 42 2 43 2 44 2 45 2 46 2 47 2 48 2 49 2 50 2 51 2 52 2 53 2 54 2 55 2 56 2 57 2 58 2 59 2 60 2 61 2 62 2 63 2 64 2 65 2 66 2 67 2 68 2 69 2 70 2 71 2 72 3 73 3 74 3 75 4 76 5 77 6 78 7 79 8 79 9 80 10 80 11 80 12 80 13 80 14 80 15 79 16 79 17 79 18 78 19 78 20 78 21 77 22 77 23 77 24 76 25 76 26 76 27 75 28 75 29 75 30 74 31 74 32 74 33 73 34 73 35 73 36 73 37 73 38 72 39 72 40 72 41 72 42 72 43 72 44 73 45 73 46 73 47 74 48 75 49 76 50 77 51 78 52 79 53 80 53 81 54 82 54 83 55 84 55 85 56 86 56 87 57 88 57 89 57 90 58 91 58 92 58 93 58 94 59 95 59 96 59 97 59 98 59 99 59 100 59 101 59 102 59 103 59 104 59 105 59 106 59 107 59 108 59 109 59 110 59 111 59 112 58 113 58 114 58 115 58 116 57 117 57 118 57 119 56 120 56 121 56 122 55 123 55 124 54 125 53 126 53 127 52 128 51 129 51 130 50 131 49 132 48 132 47 133 46 133 45 133 44 133 43 133 42 133 41 133 40 133 39 133 38 132 37 132 36 132 35 131 34 131 33 131 32 131 31 130 30 130 29 130 28 130 27 129 26 129 25 128 24 128 23 127 22 127 21 127 20 127 19 126 18 126 17 126 16 126 15 126 14 126 13 126 12 126 11 126 10 126 9 127 8 128 7 129 6 130 5 131 4 132 3 133 3 134 2 135 2 136 2 137 2 138 2 139 2 140 2 141 2 142 2 143 2 144 2 145 2 146 2 147 2 148 2 149 2 150 2 151 2 152 2 153 2 154 2 155 2 156 2 157 2 158 2 159 2 160 2 161 2 162 2 163 2 164 2 165 2 166 2 167 2 168 1 169 1 170 1 171 1 172 1 Matrix 2: 173 3 172 3 171 3 170 2 169 2 168 2 167 2 166 2 165 2 164 2 163 2 162 2 161 2 160 2 159 2 158 2 157 2 156 2 155 2 154 2 153 2 152 2 151 2 150 2 149 2 148 2 147 2 146 2 145 2 144 2 143 2 142 2 141 2 140 2 139 2 138 2 137 2 136 2 135 3 134 3 133 3 132 3 131 4 130 4 129 4 128 5 127 6 127 7 127 8 126 9 127 10 127 11 127 12 127 13 127 14 126 15 127 16 127 17 127 18 127 19 127 20 127 21 128 22 128 23 128 24 128 25 129 26 129 27 129 28 130 29 130 30 130 31 131 32 131 33 131 34 132 35 132 36 132 37 132 38 133 39 133 40 133 41 133 42 133 43 132 44 132 45 132 46 131 47 130 48 129 49 128 50 127 51 126 52 125 53 124 54 123 54 122 55 121 55 120 55 119 56 118 56 117 57 116 58 115 58 114 59 113 59 112 59 111 59 110 60 109 60 108 60 107 60 106 60 105 60 104 60 103 60 102 60 101 60 100 60 99 60 98 60 97 60 96 60 95 59 94 59 93 59 92 59 91 59 90 58 89 58 88 57 87 57 86 56 85 56 84 55 83 55 82 54 81 54 80 53 79 52 78 51 77 50 76 49 75 48 74 47 73 46 73 45 73 44 73 43 73 42 73 41 73 40 73 39 73 38 73 37 73 36 74 35 74 34 74 33 75 32 75 31 75 30 76 29 76 28 76 27 77 26 77 25 77 24 78 23 78 22 78 21 79 20 79 19 80 18 80 17 80 16 81 15 81 14 81 13 81 12 81 11 81 10 80 10 79 9 79 8 78 7 77 6 76 5 75 4 74 4 73 3 72 3 71 2 70 2 69 2 68 1 67 2 66 2 65 2 64 2 63 2 62 2 61 2 60 2 59 2 58 2 57 2 56 2 55 2 54 2 53 2 52 2 51 2 50 2 49 2 48 2 47 2 46 2 45 2 44 2 43 2 42 2 41 2 40 2 39 2 38 2 37 2 36 2 35 2 34 2 33 2 32 2 31 2 30 2 29 2 28 2 27 2 26 2 25 2 24 2 23 2 22 2 21 2 20 2 19 2 18 2 17 2 16 2 15 2 14 2 13 2 12 2 11 2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 3
Mark everything above one curve and mark everything below the other curve. Then you can get the area between the curves by finding where there are two marks. You should take a look at this question Similarity measures between curves?