Failure Message: java.io.IOException: Failed to write 0 bytes in 'gs://<my_bucket_name>/events/1645565167200/.temp-beam-0a2f3ce9-1df3-45a5-a7f0-1cb1' - google-cloud-storage

I am using Spotify's flink on kubernetes operator and running a simple Apache Beam pipeline which reads from Kafka and writes to GCS in parquet format.
I am new to Apache Beam, flink and still trying to figure out the concepts.
The pipeline has a window of 2 minutes.
I use withCreateTime(Duration.standardMinutes(0)) to get Kafka create time and withNaming(FileNaming.getNaming()) to create directories with yyyy-MM-dd-HH format to write files for hourly window in respective directories.
I also use withTempDirectory() to place temporary files.
When I run the pipeline and there is a huge lag of millions of records in Kafka I get the following exception after writing the first batch of files.
ERROR org.apache.beam.sdk.io.FileBasedSink$Writer - Beginning write to gs://my_bucket_name/events/1645565167200/.temp-beam-0a2f3ce9-1df3-45a5-a7f0-149a57848a85/cb125f19-61c1-46c1-bc24-89239206c75b failed, closing channel.
java.io.IOException: Failed to write 0 bytes in 'gs://my_bucket_name/events/1645565167200/.temp-beam-0a2f3ce9-1df3-45a5-a7f0-149a57848a85/cb125f19-61c1-46c1-bc24-89239206c75b'
234
at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.write(BaseAbstractGoogleAsyncWriteChannel.java:187) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
233
at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) ~[?:1.8.0_312]
232
at java.nio.channels.Channels.writeFully(Channels.java:101) ~[?:1.8.0_312]
231
at java.nio.channels.Channels.access$000(Channels.java:61) ~[?:1.8.0_312]
230
at java.nio.channels.Channels$1.write(Channels.java:174) ~[?:1.8.0_312]
229
at org.apache.beam.sdk.io.parquet.ParquetIO$Sink$BeamOutputStream.write(ParquetIO.java:1303) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
228
at org.apache.beam.sdk.io.parquet.ParquetIO$Sink$BeamOutputStream.write(ParquetIO.java:1298) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
227
at org.apache.parquet.hadoop.ParquetFileWriter.start(ParquetFileWriter.java:394) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
226
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:293) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
225
at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:641) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
224
at org.apache.beam.sdk.io.parquet.ParquetIO$Sink.open(ParquetIO.java:1233) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
223
at org.apache.beam.sdk.io.FileIO$Write$ViaFileBasedSink$1$1.prepareWrite(FileIO.java:1377) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
222
at org.apache.beam.sdk.io.FileBasedSink$Writer.open(FileBasedSink.java:961) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
221
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:921) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
220
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source) ~[?:?]
219
at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:232) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
218
at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:188) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
217
at org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:62) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
216
at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:645) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
215
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
214
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
213
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
212
at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:50) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
211
at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:28) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
210
at org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:50) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
209
at org.apache.beam.runners.flink.FlinkStreamingTransformTranslators$ToGroupByKeyResult.flatMap(FlinkStreamingTransformTranslators.java:1416) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
208
at org.apache.beam.runners.flink.FlinkStreamingTransformTranslators$ToGroupByKeyResult.flatMap(FlinkStreamingTransformTranslators.java:1388) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
207
at org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:47) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
206
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
205
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
204
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
203
at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:50) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
202
at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:28) ~[flink-dist_2.12-1.13.3.jar:1.13.3]
201
at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:1171) ~[blob_p-ed6ce4c16e52d488153c17d014c718f2ca9e97c5-bfc37bda64724db1bd6699a32347c2d6:?]
Please advice

Related

How to apply k-means without normalization and k-means with normalization on my dataset?

I want to calculate the accuracy of my k-means clustering without normalization and k-means clustering with normalization and want to compare the results.
My dataset looks like this:
age chol
63 233
37 250
41 204
56 236
57 354
57 192
56 294
44 263
52 199
57 168
54 239
48 275
49 266
64 211
58 283
50 219
58 340
66 226
43 247
69 239
59 234
44 233
42 226
61 243
40 199
71 302
59 212
51 175
65 417
53 197
41 198
65 177
44 219
54 273
51 213
46 177
54 304
54 232
65 269
65 360
51 308
48 245
45 208
53 264
39 321
52 325
44 235
47 257
53 216
53 234
How can I write the code for it in Matlab and plot it?

RMarkdown: Creating two side-by-side heatmaps with full figure borders using the pheatmap package

I am writing my first report in RMarkdown and struggling with specific figure alignments.
I have some data that I am manipulating into a format friendly for the package pheatmap such that it produces heatmap HTML output. The code that produces one of these looks like:
cleaned_mayo<- cleaned_mayo[which(cleaned_mayo$Source=="MayoBrainBank_Dickson"),]
# Segregate data
ad<- cleaned_mayo[which(cleaned_mayo$Diagnosis== "AD"),-c(1:13)]
control<- cleaned_mayo[which(cleaned_mayo$Diagnosis== "Control"),-c(1:13)]
# Average data across patients and assign diagnoses
ad<- as.data.frame(t(apply(ad,2, mean)))
control<- as.data.frame(t(apply(control,2, mean)))
ad$Diagnosis<- "AD"
control$Diagnosis<- "Control"
# Combine
avg_heat<- rbind(ad, control)
# Rearrange columns
avg_heat<- avg_heat[,c(32, 1:31)]
# Mean shift all expression values
avg_heat[,2:32]<- apply(avg_heat[,2:32], 2, function(x){x-mean(x)})
#################################
# CREATE HEAT MAP
#################################
# Plot average heat map
pheatmap(t(avg_heat[,2:32]), cluster_col= F, labels_col= c("AD", "Control"),gaps_col = c(1), labels_row = colnames(avg_heat)[2:32],
main= "Mayo Differential Expression for Genes of Interest: Averaged Across \n Patients within a Diagnosis",
show_colnames = T)
Where the numeric columns of cleaned_mayo look like:
C1QA C1QC C1QB LAPTM5 CTSS FCER1G PLEK CSF1R CD74 LY86 AIF1 FGD2 TREM2 PTK2B LYN UNC93B1 CTSC NCKAP1L TMEM119 ALOX5AP LCP1
1924_TCX 1101 1392 1687 1380 380 279 198 1889 6286 127 252 771 338 5795 409 494 337 352 476 170 441
1926_TCX 881 770 950 1064 239 130 132 1241 3188 76 137 434 212 5634 327 419 292 217 464 124 373
1935_TCX 3636 4106 5196 5206 1226 583 476 5588 27650 384 1139 1086 756 14219 1269 869 868 1378 1270 428 1216
1925_TCX 3050 4392 5357 3585 788 472 350 4662 11811 340 865 1051 468 13446 638 420 1047 850 756 616 1008
1963_TCX 3169 2874 4182 2737 828 551 208 2560 10103 204 719 585 499 9158 546 335 598 593 606 418 707
7098_TCX 1354 1803 2369 2134 634 354 245 1829 8322 227 593 371 411 10637 504 294 750 458 367 490 779
ITGAM LPCAT2 LGALS9 GRN MAN2B1 TYROBP CD37 LAIR1 CTSZ CYTH4
1924_TCX 376 649 699 1605 618 392 328 628 1774 484
1926_TCX 225 381 473 1444 597 242 290 321 1110 303
1935_TCX 737 1887 998 2563 856 949 713 1060 2670 569
1925_TCX 634 1323 575 1661 594 562 421 1197 1796 595
1963_TCX 508 696 429 1030 355 556 365 585 1591 360
7098_TCX 418 1011 318 1574 354 353 179 471 1471 321
All of this code is wrapped around the following header in the RMarkdown environment: {r heatmaps, echo=FALSE, results="asis", message=FALSE}.
What I would like to achieve is the two heatmaps side-by-side with black boxes around each individual heat map (i.e. containing the title and legend of the heatmap as well).
If anyone could tell me how to do this, or either one individually it would be greatly appreciated.
Thanks!

Trying to plot a CSV file

I'm trying to plot a CSV file, and this is what it looks like:
Date Ebola: Case counts and deaths from the World Health Organization and WHO situation reports
3/22/2014 49
3/24/2014 86
3/25/2014 86
3/26/2014 86
3/27/2014 103
3/28/2014 112
3/29/2014 112
3/31/2014 122
4/1/2014 127
4/4/2014 143
4/7/2014 151
4/9/2014 158
4/11/2014 159
4/14/2014 168
4/16/2014 197
4/17/2014 203
4/20/2014 208
4/23/2014 218
4/26/2014 224
5/1/2014 226
5/3/2014 231
5/5/2014 235
5/7/2014 236
5/10/2014 233
5/12/2014 248
5/23/2014 258
5/27/2014 281
5/28/2014 291
6/1/2014 328
6/3/2014 344
6/10/2014 351
6/16/2014 398
6/18/2014 390
6/20/2014 390
6/30/2014 413
7/2/2014 412
7/6/2014 408
7/8/2014 409
7/12/2014 406
7/14/2014 411
7/17/2014 410
7/20/2014 415
7/23/2014 427
7/27/2014 460
7/30/2014 472
I imported it into my MATLAB workspace. Now I want to plot this data using MATLAB, but how do I do this? The variables I have for each column are Date and EbolaCaseCountsAndDeathsFromTheWorldHealthOrganizationAndWHOsit (sorry I don't know how to make the latter variable shorter).
I tried doing plot(Date, EbolaCa[...]) but it gives me an error. What is the right way to do it?
You must use both datenum() and datetick() to actually show dates on the x-axis. I was able to create your table snippet as follows:
T={'3/22/2014' 49
'3/24/2014' 86
'3/25/2014' 86
'3/26/2014' 86
'3/27/2014' 103
'3/28/2014' 112
'3/29/2014' 112
'3/31/2014' 122
'4/1/2014' 127
'4/4/2014' 143
'4/7/2014' 151
'4/9/2014' 158
'4/11/2014' 159
'4/14/2014' 168
'4/16/2014' 197
'4/17/2014' 203
'4/20/2014' 208
'4/23/2014' 218
'4/26/2014' 224
'5/1/2014' 226
'5/3/2014' 231
'5/5/2014' 235
'5/7/2014' 236
'5/10/2014' 233
'5/12/2014' 248
'5/23/2014' 258
'5/27/2014' 281
'5/28/2014' 291
'6/1/2014' 328
'6/3/2014' 344
'6/10/2014' 351
'6/16/2014' 398
'6/18/2014' 390
'6/20/2014' 390
'6/30/2014' 413
'7/2/2014' 412
'7/6/2014' 408
'7/8/2014' 409
'7/12/2014' 406
'7/14/2014' 411
'7/17/2014' 410
'7/20/2014' 415
'7/23/2014' 427
'7/27/2014' 460
'7/30/2014' 472};
T=cell2table(T);
T.Properties.VariableNames={'Date','Ebola'};
where the first column is composed by strings and the second column is composed by numbers. To generate the plot() you might want to do something like
figure(1);
plot(datenum(T.Date,'m/dd/yyyy'),T.Ebola);
datetick('x','dd/mmm/yyyy'); grid on;
which shows
However, feel free to adjust datenum() and datetick() format(s) as you wish.

Saving (in a matrix) the elapsed time and number of iterations for a large number of cases

I have a program that outputs the number of iterations and a test value, given inputs A1,A2,A3,A4.
I want to run through 5 values each of A1, A2, A3, A4, thus making 625 runs. In the process, I want to save the time elapsed for each run, the number of iterations, and test value in 3 separate matrices.
I have tried using 4 nested for loops, and made progress, but need some help on indexing the elements of the matrices. The iterator variables in the for loops don't match the indexing variables...
The code for the 4 nested loops is below:
m = logspace(-4,4,5);
n = logspace(0,8,5);
eltime = zeros(5,length(m)*length(m)*length(m));
for A1 = m
for A2 = m
for A3 = m
for A4 = n
tic
SmallMAX(A1,A2,A3,A4)
toc;
for i=1:numel(eltime)
for j = 1:length(n)
eltime(j,i) = toc;
end
end
end
end
end
end
The code for the main program is excerpted below:
function [k,test] = SmallMAX(A1,A2,A3,A4)
...
end
Thanks for any help.
In your case, the easiest way is to use A1, A2, A3 and A4 as counters instead of the actual values. This way you them to index the entries of eltime. We can then easily calculate the index in the second dimension with sub2ind and use A4 to index the first dimension of eltime. We need to adjust the arguments in SmallMAX as well.
Here is the code of the proposed method:
m = logspace(-4,4,5);
n = logspace(0,8,5);
eltime = zeros(length(n),length(m)*length(m)*length(m));
res_k = zeros(length(n),length(m)*length(m)*length(m)); % or zeros(size(eltime));
res_test = zeros(length(n),length(m)*length(m)*length(m)); % or zeros(size(eltime));
for A1 = 1:length(m)
for A2 = 1:length(m)
for A3 = 1:length(m)
for A4 = 1:length(n)
ind = sub2ind([length(m),length(m),length(m)],A3,A2,A1);
tic
[k,test] = SmallMAX(m(A1),m(A2),m(A3),n(A4));
eltime(A4,ind) = toc;
res_k(A4,ind) = k;
res_test(A4,ind) = test;
end
end
end
end
This is the order of the addressed entries of eltime:
eltime_order =
Columns 1 through 18
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86
2 7 12 17 22 27 32 37 42 47 52 57 62 67 72 77 82 87
3 8 13 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88
4 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79 84 89
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Columns 19 through 36
91 96 101 106 111 116 121 126 131 136 141 146 151 156 161 166 171 176
92 97 102 107 112 117 122 127 132 137 142 147 152 157 162 167 172 177
93 98 103 108 113 118 123 128 133 138 143 148 153 158 163 168 173 178
94 99 104 109 114 119 124 129 134 139 144 149 154 159 164 169 174 179
95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180
Columns 37 through 54
181 186 191 196 201 206 211 216 221 226 231 236 241 246 251 256 261 266
182 187 192 197 202 207 212 217 222 227 232 237 242 247 252 257 262 267
183 188 193 198 203 208 213 218 223 228 233 238 243 248 253 258 263 268
184 189 194 199 204 209 214 219 224 229 234 239 244 249 254 259 264 269
185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270
Columns 55 through 72
271 276 281 286 291 296 301 306 311 316 321 326 331 336 341 346 351 356
272 277 282 287 292 297 302 307 312 317 322 327 332 337 342 347 352 357
273 278 283 288 293 298 303 308 313 318 323 328 333 338 343 348 353 358
274 279 284 289 294 299 304 309 314 319 324 329 334 339 344 349 354 359
275 280 285 290 295 300 305 310 315 320 325 330 335 340 345 350 355 360
Columns 73 through 90
361 366 371 376 381 386 391 396 401 406 411 416 421 426 431 436 441 446
362 367 372 377 382 387 392 397 402 407 412 417 422 427 432 437 442 447
363 368 373 378 383 388 393 398 403 408 413 418 423 428 433 438 443 448
364 369 374 379 384 389 394 399 404 409 414 419 424 429 434 439 444 449
365 370 375 380 385 390 395 400 405 410 415 420 425 430 435 440 445 450
Columns 91 through 108
451 456 461 466 471 476 481 486 491 496 501 506 511 516 521 526 531 536
452 457 462 467 472 477 482 487 492 497 502 507 512 517 522 527 532 537
453 458 463 468 473 478 483 488 493 498 503 508 513 518 523 528 533 538
454 459 464 469 474 479 484 489 494 499 504 509 514 519 524 529 534 539
455 460 465 470 475 480 485 490 495 500 505 510 515 520 525 530 535 540
Columns 109 through 125
541 546 551 556 561 566 571 576 581 586 591 596 601 606 611 616 621
542 547 552 557 562 567 572 577 582 587 592 597 602 607 612 617 622
543 548 553 558 563 568 573 578 583 588 593 598 603 608 613 618 623
544 549 554 559 564 569 574 579 584 589 594 599 604 609 614 619 624
545 550 555 560 565 570 575 580 585 590 595 600 605 610 615 620 625

Working with .PGM files?

I need to do some number crunching in .PGM image files.
I'll use MatLab for that.
Now some files (apparently "P2" type) are plain-text and everything is straightforward because they look like this
P2
256 256
255
203 197 197 186 181 181 182 170 165 161 167 171 169 175 163 154 146
138 146 156 166 161 162 164 166 167 177 175 169 167 171 163 153 161
159 159 145 183 181 148 149 151 149 143 175 172 162 156 168 159 159
...
But some files (apparently "P5" type) are like this
P5
256 256
255
*all kinds of random symbols here*
...
Wikipedia here says that the difference is that the latter uses binary encoding. How should I deal with it? I doubt I can import binary data into MatLab...
Have you tired reading the images using imread?
I = imread( pngFilename );