Sorting file contents idiomatically with spring-batch - spring-batch

I have a CSV file with a number of fields. What is an idiomatic way to read the file, sort the file using a subset of fields, and then write another CSV as output.
Should I even attempt to do this in spring-batch? I understand that *nix-based OSes have the sort utility to do this, but I'd like to contain all my work within spring batch if possible.
The Batch Processing Strategies section of the documentation seems to suggest that might be standard utility steps to accomplish this:
In addition to the main building blocks, each application may use one or more of standard utility steps, such as:
Sort: A program that reads an input file and produces an output file where records have been re-sequenced according to a sort key field in the records. Sorts are usually performed by standard system utilities.
But I am not able to locate this. Any pointers most welcome!
Thanks very much!

Unless you really should do it inside Spring Batch I would suggest you do it with OS based commands.
But your point is correct, adding intermediary Steps to your Jobs to Sort/Filter or even clean DATA is a mainstream pattern used in Batch Processing or ETL Jobs.
Hope this helps.

I found out that there is a SystemCommandTasklet that is meant to run OS commands. This can be used to do things like sorting, finding unique items, etc.

Related

Compare tables to ensure non regression in postgresql

Here is my issue: I often need to compare the same postgresql tables (or views that depend on it) between some ETL code refactoring to check for non regressions in my developments.
Let's say I have an ETL code I want to refactor, which regularly uploads data in a table. Currently, once my modifs are done, I often download my data from postgresql as a .csv file as a first step, then empty it, fill it again using my refactored code, and download the data again. Then, I compare the .csv files using for instance Python in a Jupyter Notebook.
That does not seem like the way to go at all. That notably supposes I am the only one to use that table during the operation, and so many other things I can't list them all here.
Is there a better way to go ?
It sounds to me like you have the correct approach. There's no magic to the CSV export operation: whatever tool you use runs a query and formats its resultset into the file. Any other before-and-after comparison operation would have to run the same query.
If you're doing this sort of regression test on an active database, it's probably wise to put some sort of distinctive tag on your test records, maybe prepend ETLTEST- to your customer names, so it's ETLTEST-John Bull. Then you can make your queries handle only your test records. And make sure you do something reliable for ORDER BY.
Juptyer seems a complex way to diff your csv files. Most operating systems have lightweight fast difftools.

Create data as a load on J-Meter

I'm working on Kubernetes on Microsoft azure with real data. Now, I need to generate a sample of data on JMeter then use it as workload to stress the CPU in Tea-Store microservices on Kubernetes. Any hint or source about How to do that, and which type of files work with JMeter?
If you want a specific answer you need to ask more specific question.
The most common parameterization options are:
If you need to ingest data from external data sources:
CSV Data Set Config allows reading CSV files into JMeter Variables so each virtual user on each iteration reads next line from the CSV file
__CSVRead() function does more or less the same however it can be declared/used in the runtime so you can have dynamic filename/path and you decide when to proceed to next column/row
JDBC Request sampler allows reading test data from the database or creating test data in the database
__StringFromFile() function reads next line from file each time it's being called
__FileToString() function reads whole file into memory/variable
If you need to generate brand new/random data:
__threadNum() - number of current thread
__time() and __timeShift() - current timestamp in various formats plus possibility to generate dates in future or past
__Random() - generate a random number
__RandomString() - generate a random string out of provided characters
__UUID() - generate unique GUID-like structure
__groovy() - for everything else, it executes arbitrary Groovy code and returns the result
IN addition to great Dmitri's answer I would like to add few cents.
Please take a look to 13-Step Guide to Performance Testing in Kubernetes, especially to
Step 12: Automating the Performance Tests
When running performance tests, we need to run these tests for a range
of workload scenarios (e.g. concurrency levels, heap sizes, message
sizes, etc.). Running the tests manually for each of these scenarios
is time-consuming and likely to cause errors. Therefore it is
important to automate the performance tests prior to executing them.
We automate our performance tests using a shell script:
start_performance_test.sh.
This script can have an idea for smth similar for you. Also overall the article introduces you Jmeter usage with some examples.

How to read large CSV with Beam?

I'm trying to figure out how to use Apache Beam to read large CSV files. By "large" I mean, several gigabytes (so that it would be impractical to read the entire CSV into memory at once).
So far, I've tried the following options:
Use TextIO.read(): this is no good because a quoted CSV field could contain a newline. In addition, this tries to read the entire file into memory at once.
Write a DoFn that reads the file as a stream and emits records (e.g. with commons-csv). However, this still reads the entire file all at once.
Try a SplittableDoFn as described here. My goal with this is to have it gradually emit records as an Unbounded PCollection - basically, to turn my file into a stream of records. However, (1) it's hard to get the counting right (2) it requires some hacky synchronizing since ParDo creates multiple threads, and (3) my resulting PCollection still isn't unbounded.
Try to create my own UnboundedSource. This seems to be ultra-complicated and poorly documented (unless I'm missing something?).
Does Beam provide anything simple to allow me to parse a file the way I want, and not have to read the entire file into memory before moving on to the next transform?
The TextIO should be doing the right thing from Beam's prospective, which is reading in the text file as fast as possible and emitting events to the next stage.
I'm guessing you are using the DirectRunner for this, which is why you are seeing a large memory footprint. Hopefully this isn't too much explanation: The DirectRunner is a test runner for small jobs and so it buffers intermediate steps in memory rather then to disk. If you are still testing your pipeline, you should use a small sample of your data until you think it is working. Then you can use the Apache Flink runner or Google Cloud Dataflow runner which will both write intermediate stages to disk when needed.
In general, splitting csv files with quoted newlines is hard as it may require arbitrary look-back to determine whether a given newline is or is not in a quoted segment. If you can arrange such that the CSV has no quoted newlines, TextIO.read() works well. Otherwise
If you're using BeamPython, consider the dataframe operation apache_beam.dataframe.io.read_csv which will handle quotation correctly (and efficiently).
In another language, you can either use that as a cross-language transform, or create a PCollection of file paths (e.g. via FileIO.MatchAll) followed by a DoFn that reads and emits rows incrementally using your CSV library of choice. With the exception of a direct/local runner, this should not require reading the entire file into memory (though it will cause each individual file to be read by a single worker, possibly limiting parallelism).
You can use the logic in Text to Cloud Spanner for handling new lines while reading a CSV.
This template reads data from a CSV and writes to Cloud Spanner.
The specific files containing the logic to read CSV with newlines are in ReadFileShardFn and SplitIntoRangesFn.

DATASTAGE capabilities

I'm a Linux programmer.
I used to write code in order to get things done: java perl php c.
I need to start working with DATA STAGE.
All I see is that DATA STAGE is working on table/csv style data and doing it line by line.
I want to know if DATA STAGE can work on file that are not table/csv like. can it load
data into data structures and run function on them, or is it limited to working
only on one line at a time.
thank you for any information that you can give on the capabilities of DATA SATGE
IBM (formerly Ascential) DataStage is an ETL platform that, indeed, works on data sets by applying various transformations.
This does not necessarily mean that you are constrained on applying only single line transformations (you can also aggregate, join, split etc). Also, DataStage has it's own programming language - BASIC - that allows you to modify the design of your jobs as needed.
Lastly, you are still free to call external scripts from within DataStage (either using the DSExecute function, Before Job property, After Job property or the Command stage).
Please check the IBM Information Center for a comprehensive documentation on BASIC Programming.
You could also check the DSXchange forums for DataStage specific topics.
Yes it can, as Razvan said you can join, aggregate, split. It can uses loops and external scripts, it can also handles XML.
My advice for you is that if you have large quantities of data you're gonna have to work on then datastage is your friend, else if the data that you're going to have to load is not very big then it's going to be easier to use JAVA, c, or any programming language that you know.
You can all times of functions , conversions , manipulate the data. mainly Datastage is used for ease of use when you handling humongous data from datamart /datawarehouse.
The main process of datastage would be ETL - Extraction Transformation Loading.
If a programmer uses 100 lines of code to connect to some database here we can do it with one click.
Anything can be done here even c , c++ coding in a rountine activity.
If you are talking about hierarchical files, like XML or JSON, the answer is yes.
If you are talking about complex files, such as are produced by COBOL, the answer is yes.
All using in-built functionality (e.g. Hierarchical Data stage, Complex Flat File stage). Review the DataStage palette to find other examples.

Recover standard out from a failed Hadoop job

I'm running a large Hadoop streaming job where I process a large list of files with each file being processed as a single unit. To do this, my input to my streaming job is a single file with a list of all the file names on separate lines.
In general, this works well. However, I ran into an issue where I was partially through a large job (~36%) when Hadoop ran into some files with issues and for some reason it seemed to crash the entire job. If the job had completed successfully, what would have been printed to standard out would be a line for each file as it was completed along with some stats from my program that's processing each file. However, with this failed job, when I try to look at the output that would have been sent to standard out, it is empty. I know that roughly 36% of the files were processed (because I'm saving the data to a database), but it's not easy for me to generate a list of which files were successfully processed and which ones remain. Is there anyway to recover this logging to standard out?
One thing I can do is look at all of the log files for the completed/failed tasks, but this seems more difficult to me and I'm not sure how to go about retrieving the good/bad list of files this way.
Thanks for any suggestions.
Hadoop captures system.out data here :
/mnt/hadoop/logs/userlogs/task_id
However, I've found this unreliable, and Hadoop jobs dont usually use standard out for debugging, rather - the convetion is to use counters.
For each of your documents, you can summarize document characteristics : like length, number of normal ascii chars, number of new lines.
Then, you can have 2 counters: a counter for "good" files, and a counter for "bad" files.
It probably be pretty easy to note that the bad files have something in common [no data, too much data, or maybe some non printable chars].
Finally, you obviously will have to look at the results after the job is done running.
The problem, of course, with system.out statements is that the jobs running on various machines can't integrate their data. Counters get around this problem - they are easily integrated into a clear and accurate picture of the overall job.
Of course, the problem with counters is the information content is entirely numeric, but, with a little creativity, you can easily find ways to quantitatively describe the data in a meaningfull way.
WORST CASE SCENARIO : YOU REALLY NEED TEXT DEBUGGING, and you dont want it in a temp file
In this case, you can use MultipleOutputs to write out ancillary files with other data in them. You can emit records to these files in the same way as you would for the part-r-0000* data.
In the end, I think you will find that, ironically, the restriction of having to use counters will increase the readability of your jobs : it is pretty intuitive, once you think about it, to debug using numerical counts rather than raw text --- i find, quite often that much of my debugging print statements are, when cut down to their raw information content, are basically just counters...