how to transform kafka logs stored on file system into csv - apache-kafka

I have some logs that were generated with kafka that are currently stored in a .log format looking like this, on my computer:
I would like to convert those files into csv records, with message and times.
I know the question might probably seem too vague or unclear, sorry, but i'm really looking for a starting point to achieve this;
note: this is linked to the isoblue project and datasets here

Those files are encrypted.
Isn't easier if you to just write a consumer for those topics and write a CSV file?

You're looking for the DumpLogSegments command. However, this will not output CSV, so you'd have to parse something
https://cwiki.apache.org/confluence/display/KAFKA/System+Tools#SystemTools-DumpLogSegment
Dump Log Segment
This can print the messages directly from the log files or just verify
the indexes correct for the logs
bin/kafka-run-class.sh kafka.tools.DumpLogSegments
required argument "[files]"
Option Description
------ -----------
--deep-iteration if set, uses deep instead of shallow iteration
--files <file1, file2, ...> REQUIRED: The comma separated list of data and index log files to be dumped
--max-message-size <Integer: size> Size of largest message. (default: 5242880)
--print-data-log if set, printing the messages content when dumping data logs
--verify-index-only if set, just verify the index log without printing its content

Related

Why the output data are different between 2 sinks from 1 same PCollection

I have a Apache beam application which running with spark runner in yarn cluster, it reads multiple inputs, does transforms and produce 2 outputs, one is in parquet and the other is in text file format.
In my transforms, one of the step is to generate a uuid to give to one attribute of my pojo, then I got a PCollection, after that from this PC I applied transforms to convert myPojo to String and Generic Record, and applied TextIO and ParquetIO to save to my storage.
Just now I observed one strange issue is that, in the output files, the uuid attribute is different between parquet data and text data for the same record!
I expect that they are from one same PCollection, they are just output into different formats, so the data must be same, right?
The issue happens only with big input file volume. In my unit test case, it gives me same value in both formats.
I assumed that there happened kinds of recalculation? When sink to different IOs. But I can't confirm.. anyone can help to explain?.
Thanks

How to read large CSV with Beam?

I'm trying to figure out how to use Apache Beam to read large CSV files. By "large" I mean, several gigabytes (so that it would be impractical to read the entire CSV into memory at once).
So far, I've tried the following options:
Use TextIO.read(): this is no good because a quoted CSV field could contain a newline. In addition, this tries to read the entire file into memory at once.
Write a DoFn that reads the file as a stream and emits records (e.g. with commons-csv). However, this still reads the entire file all at once.
Try a SplittableDoFn as described here. My goal with this is to have it gradually emit records as an Unbounded PCollection - basically, to turn my file into a stream of records. However, (1) it's hard to get the counting right (2) it requires some hacky synchronizing since ParDo creates multiple threads, and (3) my resulting PCollection still isn't unbounded.
Try to create my own UnboundedSource. This seems to be ultra-complicated and poorly documented (unless I'm missing something?).
Does Beam provide anything simple to allow me to parse a file the way I want, and not have to read the entire file into memory before moving on to the next transform?
The TextIO should be doing the right thing from Beam's prospective, which is reading in the text file as fast as possible and emitting events to the next stage.
I'm guessing you are using the DirectRunner for this, which is why you are seeing a large memory footprint. Hopefully this isn't too much explanation: The DirectRunner is a test runner for small jobs and so it buffers intermediate steps in memory rather then to disk. If you are still testing your pipeline, you should use a small sample of your data until you think it is working. Then you can use the Apache Flink runner or Google Cloud Dataflow runner which will both write intermediate stages to disk when needed.
In general, splitting csv files with quoted newlines is hard as it may require arbitrary look-back to determine whether a given newline is or is not in a quoted segment. If you can arrange such that the CSV has no quoted newlines, TextIO.read() works well. Otherwise
If you're using BeamPython, consider the dataframe operation apache_beam.dataframe.io.read_csv which will handle quotation correctly (and efficiently).
In another language, you can either use that as a cross-language transform, or create a PCollection of file paths (e.g. via FileIO.MatchAll) followed by a DoFn that reads and emits rows incrementally using your CSV library of choice. With the exception of a direct/local runner, this should not require reading the entire file into memory (though it will cause each individual file to be read by a single worker, possibly limiting parallelism).
You can use the logic in Text to Cloud Spanner for handling new lines while reading a CSV.
This template reads data from a CSV and writes to Cloud Spanner.
The specific files containing the logic to read CSV with newlines are in ReadFileShardFn and SplitIntoRangesFn.

Spark: partition .txt.gz files and convert to parquet

I need to convert all text files in a folder that are gzipped to parquet. I wonder if I need to gunzip them first or not.
Also, I'd like to partition each file in 100 parts.
This is what I have so far:
sc.textFile("s3://bucket.com/files/*.gz").repartition(100).toDF()
.write.parquet("s3://bucket.com/parquet/")
Is this correct? Am I missing something?
Thanks.
You don't need to uncompress files individually. The only problem with reading gzip files directly is that your reads won't be parallelized. That means, irrespective of the size of the file, you will only get one partition per file because gzip is not a splittable compression codec.
You might face problems if individual files are greater than a certain size (2GB?) because there's an upper limit to Spark's partition size.
Other than that your code looks functionally alright.

How to add record numbers to TextIO file sources in Apache Beam or Dataflow

I am using Dataflow (and now Beam) to process legacy text files to replicate the transformations of an existing ETL tool. The current process adds a record number (the record number for each row within each file) and the filename. The reason they want to keep this additional info is so that they can tell which file and record offset the source data came from.
I want to get to a point where I have a PCollection which contains File record number and filename as additional fields in the value or part of the key.
I've seen a different article where the filename can be populated into the resulting PCollection, however I do not have a solution for adding the record numbers per row. Currently the only way I can do it is to pre-process the files before I start the Dataflow process (which is a shame since I would want to have Dataflow/Beam to do it all)

Recover standard out from a failed Hadoop job

I'm running a large Hadoop streaming job where I process a large list of files with each file being processed as a single unit. To do this, my input to my streaming job is a single file with a list of all the file names on separate lines.
In general, this works well. However, I ran into an issue where I was partially through a large job (~36%) when Hadoop ran into some files with issues and for some reason it seemed to crash the entire job. If the job had completed successfully, what would have been printed to standard out would be a line for each file as it was completed along with some stats from my program that's processing each file. However, with this failed job, when I try to look at the output that would have been sent to standard out, it is empty. I know that roughly 36% of the files were processed (because I'm saving the data to a database), but it's not easy for me to generate a list of which files were successfully processed and which ones remain. Is there anyway to recover this logging to standard out?
One thing I can do is look at all of the log files for the completed/failed tasks, but this seems more difficult to me and I'm not sure how to go about retrieving the good/bad list of files this way.
Thanks for any suggestions.
Hadoop captures system.out data here :
/mnt/hadoop/logs/userlogs/task_id
However, I've found this unreliable, and Hadoop jobs dont usually use standard out for debugging, rather - the convetion is to use counters.
For each of your documents, you can summarize document characteristics : like length, number of normal ascii chars, number of new lines.
Then, you can have 2 counters: a counter for "good" files, and a counter for "bad" files.
It probably be pretty easy to note that the bad files have something in common [no data, too much data, or maybe some non printable chars].
Finally, you obviously will have to look at the results after the job is done running.
The problem, of course, with system.out statements is that the jobs running on various machines can't integrate their data. Counters get around this problem - they are easily integrated into a clear and accurate picture of the overall job.
Of course, the problem with counters is the information content is entirely numeric, but, with a little creativity, you can easily find ways to quantitatively describe the data in a meaningfull way.
WORST CASE SCENARIO : YOU REALLY NEED TEXT DEBUGGING, and you dont want it in a temp file
In this case, you can use MultipleOutputs to write out ancillary files with other data in them. You can emit records to these files in the same way as you would for the part-r-0000* data.
In the end, I think you will find that, ironically, the restriction of having to use counters will increase the readability of your jobs : it is pretty intuitive, once you think about it, to debug using numerical counts rather than raw text --- i find, quite often that much of my debugging print statements are, when cut down to their raw information content, are basically just counters...