Hi I am trying to create an application which would read a binary file and then depending on the data within the binary file I will have to build a sequence of steps.
I have tried using FlatFileItemReader but I understand that in order to read a Binary File you would have to use SimpleBinaryBufferedReaderFactory.
Can some one please help me on how to read the binary data.
Related
I have an AS400 with an IBM DB2 database and I need to create a Format Description File (FDF) for each table in the DB. I can create the FDF file using the IBM Export tool but it will only create one file at a time which will take several days to complete. I have not found a way to create the files systematically using a tool or query. Is this possible or should this be done using scripting?
First of all, to correct a misunderstanding...
A Format Description File has nothing at all to do with the format of a Db2 table. It actually describes the format of the data in a stream file that you are uploading into the Db2 table. Sure you can turn on an option during the download from Db2 to create the FDF file, but it's still actually describing the data in the stream file you've just downloaded the data into. You can use the resulting FDF file to upload a modified version of the downloaded data or as the starting point for creating an FDF file that matches the actual data you want to upload.
Which explain why there's no built-in way to create an appropriate FDF file for every table on the system.
I question why you think you actually to generate an FDF file for every table.
As I recall, the format of the FDF (or it's newer variant FDFX) is pretty simple; it shouldn't be all that difficult to generate if you really wanted to. But I don't have one handy at the moment, and my Google-FU has failed me.
How to read the first line and store header data in Apache Beam Python?
Check out this example. See how the UsCovidDataCsvReader parses the input.
The basic ideas are
read the header and build a schema from it
read the file with skip_header_lines=1
parse the input with the schema to build a PCollection
I would like to use data factory to regularly download 500000 json files from a web API and store them in a blob storage container. Then I need to parse the json files to extract some values from each file and store these values together with an ID (part of filename) in a database. I can do this using a ForEach activity and run a custom activity for each file, but this is very slow, so I would prefer some batch activity which could run the same parsing code on each file. Is there some way to do this?
If your source json files have same schema, you can leverage the Copy Activity which can parse those files in a single run. But if possible, I would suggest to split those files into different sub folder (e.g. 1000 files per folder), so that each copy run needs less time and ease the management.
Refer to this doc for more details: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.
I am trying to understand how to correctly load a large file into a database. I understand how to get the file from the database and stream it back without using too many resources by using a DataReader to read into a buffer and then writing the buffer to the OutputStream.
When it comes to storing the file all of the examples I could find read then entire file into a byte array and then supply it to a data parameter.
Is there a way to store the file into a database without having to read then entire file into memory first?
I am using ASP.NET and Sql Server
If you can use .Net 4.5, there is new support for streaming. Also, see Using ADO's new Async methods which gives some complimentary examples.