Why are zero byte files written to GCS when running a pipeline? - google-cloud-storage

Our job/pipeline is writing the results of a ParDo transformation back out to GCS i.e. using TextIO.Write.to("gs://...")
We've noticed that when the job/pipeline completes, it leaves numerous 0 byte files in the output bucket.
The input to the pipeline is from multiple files from GCS, so I'm assuming the results are sharded, which is fine.
But why do we get empty files?

It is likely that these empty shards are the results of an intermediate pipeline step which turned out to be somewhat sparse and some pre-partitioned shards had no records in them.
E.g. if there was a GroupByKey right before the TextIO.Write and, say, the keyspace was sharded into ranges [00, 01), [01, 02), ..., [fe, ff) (255 shards total), but all actual keys emitted from the input of this GroupByKey were in the range [34, 81) and [a3, b5), then 255 output files will be produced, but most of them will turn out empty. (this is a hypothetical partitioning scheme, just to give you the idea)
The rest of my answer will be in the form of Q&A.
Why produce empty files at all? If there's nothing to output, don't create the file!
It's true that it would be technically possible to avoid producing them, e.g. by opening them lazily when writing output when the first element is written. AFAIK we normally don't do this because empty output files are usually not an issue, and it is easier to understand an empty file than absence of a file: it would be pretty confusing if, say, only the first of 50 shards turned out non-empty and you would only have a single output file named 00001-of-000050: you'd wonder what happened to the 49 other ones.
But why not add a post-processing step to delete the empty files? In principle we could add a post-processing step of deleting the empty outputs and renaming the rest (to be consistent with the xxxxx-of-yyyyy filepattern) if empty outputs became a big issue.
Does existence of empty shards signal a problem in my pipeline?
A lot of empty shards might mean that the system-chosen sharding was suboptimal/uneven and we should have split the computation into fewer, more uniform shards. If this is a problem for you, could you give more details about your pipeline's output, e.g.: your screenshot shows that the non-empty outputs are also pretty small: do they contain just a handful of records? (if so, it may be difficult to achieve uniform sharding without knowing the data in advance)
But the shards of my original input are not empty, doesn't sharding of output mirror sharding of input? If your pipeline has GroupByKey (or derived) operations, there will be intermediate steps where the number of shards in input and output are different: e.g. an operation may consume 30 shards of input but produce 50 shards of output, or vice versa. Different number of shards in input and output is also possible in some other cases not involving GroupByKey.
TL;DR If your overall output is correct, it's not a bug, but tell us if it is a problem for you :)

Related

Is sorting necessary for merging BAM files using BamTools?

I have a pair of Illumina paired-end read files (say, A_1.fastq.gz and A_2.fastq.gz) produced from a single bacterial isolate for variant calling. First of all, I used FLASH to merge overlapping reads because of the read length (100 bp), insertion size (about 230 bp) and its standard deviation (about 50 bp). FLASH produced three read files, two for non-overlapping paired-end reads and one for merged reads (single-end). Then I aligned them against a common reference genome using bowtie, which generated two bam files (one for paired-end reads and the other for single-end reads).
To gain a higher coverage and read depth for variant calling, I would like to merge both BAM files into a single one. I plan to use BamTools for this task as it is dedicated to handle BAM files. However, I am not sure whether it is necessary to sort input BAM files prior to calling the "bamtools merge" command? It is not covered in the software tutorial or elsewhere. I would appreciate it if you could help.
Well, it is a merge so, by definition, the input has to be sorted. Otherwise it won't be a merge.
Merge is the action of joining two or more sorted lists keeping the ordering. The good thing about the merge is that you don't have to do an extra sorting when your inputs are already sorted.
If the inputs are not sorted, then you can simply concatenate them and sort the final result, or sort the inputs and merge the intermediate results.
BTW, it is quite probable that if you feed unsorted bams to the merge command, it will complain about it.

How to distribute processing to find waldos in csv using spark scala in a clustered environment?

I have a spark cluster of 1 master, 3 workers. I have a simple, but gigantic CSV file like this:
FirstName, Age
Waldo, 5
Emily, 7
John, 4
Waldo, 9
Amy, 2
Kevin, 4
...
I want to get all the records where FirstName is waldo. Normally on one machine and local mode, if I have an RDD, I can do a ".parallelize()" to get an RDD, then assuming the variable is "mydata", I can do:
mydata.foreach(x => // check if the first row's first column value contains "Waldo", if so, then print it)
From my understanding, using the above method, every spark slave would have to perform that iteration on the entire gigantic csv to get the result instead of each machine processing a third of the file (correct me if I am wrong).
Problem is, if I have 3 machines, how do I make it so that:
The csv file is broken up into 3 different "sets" to each of the
workers so that each worker is working with a much smaller file
(1/3rd of the original)
Each worker processes it, and finds all the "FirstName=Waldo"s"
The resulting list of "Waldo" records are
reported back to me in a way that actually takes advantage of the
cluster.
Mmm, lot of points to make here. First, if you are using HDFS, you file is already partitioned, it is a distributed file system. You probably even have the data replicated 3 times, as that is the default (depends on your config though).
Second, Spark will indeed make use of this partitioning when you told it to load data, and will process chunks locally. Shuffling data around is only required when you want to, for instance, re-partition you data by some criteria, like keys in a key/value pair, etc.
Third, Spark is indeed great for doing batch processing and some datamining if you don't want to structure a database or don't have predefined access patterns. In short, for what you seem to need. You don't even need to write and compile code since you can run a Spark Shell and try with a few lines. I do recommend you to look at the docs, since you don't seem to have a clear grasp of the platform yet.
Fourth, I don't have an IDE or anything here, but the code you need should be something line this (sort of pseudocode, but should be VERY close):
sc
.textFile("my_hdfs_path")
.keyBy(_.split('\t')(0))
.filter(_._1 == "Waldo")
.map(_._2)
.saveAsTextFile("my_hdfs_out")
if not too big, you can also use collect to bring all results to the driver location instead of saving to file, but after that you are back in a single machine. Hope it helps!

Neo4j Import tool - Console output meaning

What does the console output of the Neo4j import tool mean?
Example lines:
[INPUT--------------PROPERTIES(2)======|WRITER: W:71.] 3M
[INPUT------|PREPARE(|RELATIO||] 49M
[Relationship --> Relationship + counts-------]282M
When I try to import a large dataset through this tool, it seems that at 248M, importing is hanging in the ‘calculate dense nodes’ step. What exactly does 'calculating dense nodes' do?
The import stages are:
Nodes
Prepare node index
Calculate dense nodes
Node --> Relationship Sparse
Relationship --> Relationship Sparse
Node counts
Relationship counts
As for interpreting the statistics, I guess #mattias-persson wrote it in the Neo4J manual. Copying it, for the record:
10.1.2.5. Output and statistics
While an import is running through its different stages, some statistics and figures are printed in the console. The general interpretation of that output is to look at the horizontal line, which is divided up into sections, each section representing one type of work going on in parallel with the other sections. The wider a section is, the more time is spent there relative to the other sections, the widest being the bottleneck, also marked with *. If a section has a double line, instead of just a single line, it means that multiple threads are executing the work in that section. To the far right a number is displayed telling how many entities (nodes or relationships) have been processed by that stage.
As an example:
[*>:20,25 MB/s-----------|PREPARE(3)==========|RELATIONSHIP(2)===========] 16M
Would be interpreted as:
> data being read, and perhaps parsed, at 20,25 MB/s, data that is being passed on to …​
PREPARE preparing the data for …​
RELATIONSHIP creating actual relationship records and …​
v writing the relationships to the store. This step is not visible in this example, because it is so cheap compared to the other sections.
Observing the section sizes can give hints about where performance can be improved. In the example above, the bottleneck is the data read section (marked with >), which might indicate that the disk is being slow, or is poorly handling simultaneous read and write operations (since the last section often revolves around writing to disk).
After some offline discussion, it seems that most, if not all "missing" nodes were due to one line in the CSV file had a property starting with quotation " but did not contain the end quote. This resulted in the parser reading until the next quote, i.e. through new-lines, thinking that it still read that same property value for that node.
It would be great with some sort of detection for such missing quotes, but that's not straight-forward given that it might mess with nodes/relationships actually spanning multiple lines.

Sharding key, chunkSize and pre-splitting

I have set up a sharded cluster on a single machine, following the steps mentioned here:
http://www.mongodb.org/display/DOCS/A+Sample+Configuration+Session
But I don't understand the '--chunkSize' option:
$ ./mongos --configdb localhost:20000 --chunkSize 1 > /tmp/mongos.log &
With N shards, each shard is supposed to have 1/N number of documents, dividing the shard-key's range into N almost equal parts, right? This automatically fixes the chunkSize/shard-size. Which chunk is the above command then dealing with?
Also, there's provision to split a collection manually at a specific value of key and then migrate a chunk to any other shard you want. This can be done manually and is even handled by a 'balancer' automatically. Doesn't it clash with the sharding settings and confuse the config servers or they are reported about any such movement immediately?
Thanks for any help.
You might be confusing a few things. The --chunkSize parameter sets the chunk size for the doing splits. The "settings" collection in the "config" database with _id "chunksize" to have a look at the current value, if set. The --chunkSize option will only set this value, or make changes to the system, if there is no value set already, otherwise it will be ignored.
The chunk size is the size in megabytes above which the system will keep chunk under. This is done in two places, 1) when writes pass through the mongos instances and 2) prior to moving chunks to another shard during balancing. As such it does not follow from the "data size / shard count" formula. Your example of 1Mb per chunk is almost always a bad idea.
You can indeed split and move chunks manually and although that might result in a less than ideal chunk distribution it will never confuse or break the config meta data and the balancer. The reason is relatively straightforward; the balancer uses the same commands and follows the same code paths. From MongoDB's perspective there is no significant difference between a balancer process splitting and moving chunks and you doing it.
There are a few valid use cases for manually splitting and moving chunks though. For example, you might want to do it manually to prepare a cluster for very high peak loads from a cold start -- pre-splitting. Typically you will write a script to do this, or load splits from a performance test which already worked well. Also, you may watch for hot chunks to split/move those chunks to move evenly distribute based on "load" as monitored from your application.
Hope that helps.
Great, thanks! I think I get it now..Correct me if I'm wrong:I was thinking that if there are N servers, then first 1/Nth part of the collection (=chunk1) will go to shard1, the second 1/Nth (=chunk2) will go to shard2 and so on.. When you said that there's no such "formula", I searched a bit more and found these links MongoDB sharding, how does it rebalance when adding new nodes?How to define sharding range for each shard in Mongo?From the definition of "chunk" in the documentation, I think it is to be thought of as merely a unit of data migration. When we shard a collection among N servers, then the total no. of chunks is not necessarily N. And they need not be of equal size either. The maximum size of one chunk is either already set as a default (usually 64MB) in the settings collection of config database, or can be set manually by specifying a value using the --chunkSize parameter as shown in the above code. Depending on the values of the shard-key, one shard may have more chunks than the other. But MongoDB uses a balancer process that tries to distribute these chunks evenly among the shards. By even distribution, I mean it tends to split chunks and migrate them to other shards if they grow bigger than their limit or if one particular shard is getting heavily loaded. This can be done manually as well, by following the same set of commands that the balancer process uses.

Merge of key-value stores

Is there some merge strategy or program which is aware of key-value stores, in the sense that the sequence of the lines does not matter*? For a real example, jEdit does not keep the order of options, so there are hundreds of lines which are shuffled around. It would be nice to diff/merge these without having to sort the file first, for example to see how values are changed and keys are added/removed by configuration modifications while the program is running.
* I know it matters for some file types, like shell scripts where you can have references to other keys. These of course should be merged normally.
if the stores are unsorted then comparing them will cost O(n*m) time, if you first sort them you can run it in O(n log n + m log m) for the sort plus O(n+m) for the check, so if the stores are reasonably large then sorting is way faster