Merging code chunks in knitr output - knitr

I sometimes find myself in a situation where I end up with two or more adjacent code chunks in an Rmarkdown document. Typically this happens when the code in these chunks forms a logical unit but I want to process them differently in some way. For example, a sequence of commands may contain a function call that takes particularly long to run and I want to cache it separately to avoid re-computing results as much as possible. This leads to separate code blocks in the output directly following each other.
My question is whether it is possible to automatically merge these adjacent chunks into a single output block. I can't find a knitr option to do this but a knitr hook or pandoc filter may be able to do this. Any advice on a good approach to solve this issue is appreciated.

Related

How to read large CSV with Beam?

I'm trying to figure out how to use Apache Beam to read large CSV files. By "large" I mean, several gigabytes (so that it would be impractical to read the entire CSV into memory at once).
So far, I've tried the following options:
Use TextIO.read(): this is no good because a quoted CSV field could contain a newline. In addition, this tries to read the entire file into memory at once.
Write a DoFn that reads the file as a stream and emits records (e.g. with commons-csv). However, this still reads the entire file all at once.
Try a SplittableDoFn as described here. My goal with this is to have it gradually emit records as an Unbounded PCollection - basically, to turn my file into a stream of records. However, (1) it's hard to get the counting right (2) it requires some hacky synchronizing since ParDo creates multiple threads, and (3) my resulting PCollection still isn't unbounded.
Try to create my own UnboundedSource. This seems to be ultra-complicated and poorly documented (unless I'm missing something?).
Does Beam provide anything simple to allow me to parse a file the way I want, and not have to read the entire file into memory before moving on to the next transform?
The TextIO should be doing the right thing from Beam's prospective, which is reading in the text file as fast as possible and emitting events to the next stage.
I'm guessing you are using the DirectRunner for this, which is why you are seeing a large memory footprint. Hopefully this isn't too much explanation: The DirectRunner is a test runner for small jobs and so it buffers intermediate steps in memory rather then to disk. If you are still testing your pipeline, you should use a small sample of your data until you think it is working. Then you can use the Apache Flink runner or Google Cloud Dataflow runner which will both write intermediate stages to disk when needed.
In general, splitting csv files with quoted newlines is hard as it may require arbitrary look-back to determine whether a given newline is or is not in a quoted segment. If you can arrange such that the CSV has no quoted newlines, TextIO.read() works well. Otherwise
If you're using BeamPython, consider the dataframe operation apache_beam.dataframe.io.read_csv which will handle quotation correctly (and efficiently).
In another language, you can either use that as a cross-language transform, or create a PCollection of file paths (e.g. via FileIO.MatchAll) followed by a DoFn that reads and emits rows incrementally using your CSV library of choice. With the exception of a direct/local runner, this should not require reading the entire file into memory (though it will cause each individual file to be read by a single worker, possibly limiting parallelism).
You can use the logic in Text to Cloud Spanner for handling new lines while reading a CSV.
This template reads data from a CSV and writes to Cloud Spanner.
The specific files containing the logic to read CSV with newlines are in ReadFileShardFn and SplitIntoRangesFn.

How to do parallel pipeline?

I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.
At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X doing a computation, I take the resulting data and call a method of component Y that takes the data as input.
I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:
However, I do not know how to best approach this problem.
I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
I have seen that Apache Spark comes with a Pipeline class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time)
there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?
What would you suggest for building work/dataflows in Spark?
I would suggest to use Spark's MLlib pipeline functionality, what you describe sounds like it would fit the case well. One nice thing about it is that it allows Spark to optimize the flow for you, in a way that is probably smarter than you can.
You mention it can't read the two Parquet files in parallel, but it can read each separate file in a distributed way. So rather than having N/2 nodes process each file separately, you would have N nodes process them in series, which I'd expect to give you a similar runtime, especially if the mapping to y-c is 1-to-1. Basically, you don't have to worry about Spark underutilizing your resources (if your data is partitioned properly).
But actually things may even be better, because Spark is smarter at optimising the flow than you are. An important thing to keep in mind is that Spark may not do things exactly in the way and in the separate steps as you define them: when you tell it to compute y-c it doesn't actually do that right away. It is lazy (in a good way!) and waits until you've built up the whole flow and ask it for answers, at which point it analyses the flow, applies optimisations (e.g. one possibility is that it can figure out it doesn't have to read and process a large chunk of one or both of the Parquet files, especially with partition discovery), and only then executes the final plan.

indexing of large text files line by line for fast access

I have a very large text file around 43GB which I use to process them to generate another files in different forms. and i don't want to setup any databases or any indexing search engines
the data is in the .ttl format
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lb.dbpedia.org/resource/Mohandas_Karamchand_Gandhi> .
target is generating all combinations from all triples who share same subject:
for example for the subject Q1000 :
<http://nl.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://en.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
the problem:
the Dummy code to start with is iterating with complexity O(n^2) where n is the number of lines of the 45GB text file ,needless to say that it would take years to do so.
what i thought of to optimize :
loading a HashMap [String,IntArray] for indexing lines of appearance each key and using any library to access the file by line number for example:
Q1000 | 1,2,433
Q1001 | 2334,323,2124
drawbacks is that the index could be relatively large as well , considering that we will have another index for the access with specific line number , plus the overloaded i didnt try the performance of the
making a text file for each key like Q1000.txt for all triples contains subject Q1000 and iterating over them one by one and making combinations
drawbacks : this seems the fastest one and least memory consuming but certainly creating around 10 million files and accessing them will be a problem , is there and alternative for that ?
i'm using scala scripts for the task
Take the 43GB file in chunks that fit comfortably in memory and sort on the subject. Write the chunks separately.
Run a merge sort on the chunks (sorted by subject). It's really easy: you have as input iterators over two files, and you write out whichever input is less, then read from that one again (if there's any left).
Now you just need to make one pass through the sorted data to gather the groups of subjects.
Should take O(n) space and O(n log n) time, which for this sort of thing you should be able to afford.
A possible solution would be to use some existing map-reduce library. After all, your task is exactly what map-reduce is for. Even if you don't parallelize your computation on multiple machines, the main advantage is that it handles the management of splitting and merging for you.
There is an interesting library Apache Crunch with Scala API. I haven't used it myself, but it looks it could solve your problem well. Your lines would be split according to their subjects and then

Perl read file vs traverse array Performance

I need to test lines in a file against multiple values
What are the difference in terms of time between opening a file and reading line by line each time vs opening the file once placing it in an array and traversing the array each time?
To expand upon what #mpacpec said in his comment, file IO is always slower than memory read/writes. But there's more to the story. "Test lines in a file against multiple values" can be interpreted in a lot of ways, so without knowing more about what exactly you are trying to do, then no one can tell you anything more specifically. So the answer is, "It depends". It depends on the file size, what you're testing and how often, and how you're testing.
However, pragmatically speaking, based upon my understanding of what you've said, you'll have to read the whole file one way or another, and you'll have to test every line, one way or another. Do what's easiest to write/read/understand, and see if that's fast enough. If it isn't, you have a much more useful baseline from which to ask the question. Personally, I'd start with a linewise read and test loop and work from there, simply because I think that'd be easier and faster to write correctly.
Make it work, then make it fast :)
Provided in the former case you can do all the tests you need on each line (rather than re-reading file each time), then the two approaches should be roughly the same speed and I/O, CPU efficiency (ignoring second-order effects such as whether the disk IO gets distracted by other processes more easily). However, the latter case - reading whole file - may hit memory limits for large files, which may cause it to lose performance dramatically or even fail.
The main cost of processing the file line by line is loss of flexibility - for instance if you need to cross-reference the lines, it would not be easy (whilst if they are all in memory, the code to do that would be simpler and faster).

Recover standard out from a failed Hadoop job

I'm running a large Hadoop streaming job where I process a large list of files with each file being processed as a single unit. To do this, my input to my streaming job is a single file with a list of all the file names on separate lines.
In general, this works well. However, I ran into an issue where I was partially through a large job (~36%) when Hadoop ran into some files with issues and for some reason it seemed to crash the entire job. If the job had completed successfully, what would have been printed to standard out would be a line for each file as it was completed along with some stats from my program that's processing each file. However, with this failed job, when I try to look at the output that would have been sent to standard out, it is empty. I know that roughly 36% of the files were processed (because I'm saving the data to a database), but it's not easy for me to generate a list of which files were successfully processed and which ones remain. Is there anyway to recover this logging to standard out?
One thing I can do is look at all of the log files for the completed/failed tasks, but this seems more difficult to me and I'm not sure how to go about retrieving the good/bad list of files this way.
Thanks for any suggestions.
Hadoop captures system.out data here :
/mnt/hadoop/logs/userlogs/task_id
However, I've found this unreliable, and Hadoop jobs dont usually use standard out for debugging, rather - the convetion is to use counters.
For each of your documents, you can summarize document characteristics : like length, number of normal ascii chars, number of new lines.
Then, you can have 2 counters: a counter for "good" files, and a counter for "bad" files.
It probably be pretty easy to note that the bad files have something in common [no data, too much data, or maybe some non printable chars].
Finally, you obviously will have to look at the results after the job is done running.
The problem, of course, with system.out statements is that the jobs running on various machines can't integrate their data. Counters get around this problem - they are easily integrated into a clear and accurate picture of the overall job.
Of course, the problem with counters is the information content is entirely numeric, but, with a little creativity, you can easily find ways to quantitatively describe the data in a meaningfull way.
WORST CASE SCENARIO : YOU REALLY NEED TEXT DEBUGGING, and you dont want it in a temp file
In this case, you can use MultipleOutputs to write out ancillary files with other data in them. You can emit records to these files in the same way as you would for the part-r-0000* data.
In the end, I think you will find that, ironically, the restriction of having to use counters will increase the readability of your jobs : it is pretty intuitive, once you think about it, to debug using numerical counts rather than raw text --- i find, quite often that much of my debugging print statements are, when cut down to their raw information content, are basically just counters...