Is there a way to replay a gzip-compressed log file in kdb+ without uncompressing it first? - kdb

Streaming execute, -11!, does not work on named pipes, so an obvious solution of redirecting the gzip -cd output to a named pipe and passing it to -11! does not work.

-11! accepts a compressed file and streams it so long as it was compressed with -19! (using 2 as the compression algorithm parameter, which is gzip).
The only difference between a normal gzipped file and a kdb compressed one is a few bytes at the beginning of the file.
EDIT (see comment) Thanks, this isn't true - the bytes are different at the end of the file
So a possible solution is to prepend your gzipped files (if they weren't produced by -19!) with the appropriate byte array first.

For anyone using kdb v3.4+ streaming execution for named pipes was introduced as the function .Q.fps.
Here is a simple example of .Q.fps in action, first create the pipe via command line:
echo "aa" > test.pipe
Next in a q session:
q).Q.fps[0N!]`:test.pipe
,"aa"
Where 0N! is a function used to display the contents of the file.

Related

Is it possible to read files with uncommon extensions (not like .txt or .csv) in Apache Beam using Python SDK? For example, file with .set extension

Is it possible to read files with uncommon extensions (not like .txt or .csv or .json) in Apache Beam using Python SDK? For example, I want to read file from local with .set extension (this is special file with EEG record). I could't find any information about how to implement this on the official page.
If I understand correctly, beam.Create creates PCollection from iterable, but what if my data is not iterable (like data in .set file)? How to read it?
If you have only one file to process, you can pass a list to beam.Create that contains your set file.
p | 'Initialise pipeline' >> beam.Create(['your_file.set'])
Regarding reading the set file, even if it's not supported officially by beam I/O connectors, you can create your own connector with python
class ReadSetContent(beam.DoFn):
def process(self, file):
# your set file path will passed here, so you can read it
# yield the content of the set file and it will be processed by the net transformation
So you pipeline start would look like this
(p | 'Initialise pipeline' >> beam.Create(['your_file.set'])
| 'Reading content' >> beam.ParDo(ReadSetContent())
| 'Next transformation' >> ...)
You can use the Beam fileio library to read arbitrary files. Any custom processing should be done by subsequent ParDo transforms.

dumpcap, save to text file and line separated

I'm trying to build a solution where dumpcap saves to text file in the format:
timestamp_as_detailed_as_possible, HEX-raw-packet
My goal is to have this continuously streaming each single data packet to the file, separated by newline.
2 questions?:
Is it possible for dumpcap to take care of fragmented packets, so I'm guaranteed each line contains 1 single full packet?
Is it OK to have another thread afterwards running and reading lines from the same file, do something with the data and then delete the line when processed - without this interfering with dumpcap?
Is it OK to have another thread afterwards running and reading lines from the same file, do something with the data and then delete the line when processed - without this interfering with dumpcap?
No. But this is the wrong approach. A pipe is actually what you should use here, i.e. dumpcap writing to a pipe and the analyzing process reading from it, i.e.
dumpcap -w - | analyzer
Is it possible for dumpcap to take care of fragmented packets, so I'm guaranteed each line contains 1 single full packet?
No, and it is also unclear here what exactly you expect. Usually there is no fragmentation done at the IP level and all since TCP tries to adjust the packet size to not be larger than the MTU anyway. And TCP should be treated as a byte stream only, i.e. don't expect anything you send to end up in a single packet or that multiple send will actually result in multiple packets.
I'm trying to build a solution where dumpcap saves to text file
Dumpcap doesn't save to text files, it saves to binary pcap or pcapng files.
You might want to consider using tcpdump instead, although you'd have to pipe it to a separate program/script to massage its output into the format you want.

how to append to a file using scala/breeze library

I wish to write to a file the output row of result matrix (produced in iterations) so that I can support checkpointing.
I figured we can use the csvwrite command to write the entire matrix to a file but how can I append to a file?
I am looking for something like below:
breeze.linalg.csvwrite(new File("small.txt"),myMatrix(currRow,::).t.asDenseMatrix)
However the above command overwrites the file each time the command is executed.
There's nothing built-in to Breeze. (Contributions welcome!)
you can use breeze.io.CSVWriter.write directly if you would like.

is partial gz decompression possible?

For working with images that are stored as .gz files (my image processing software can read .gz files for shorter/smaller disk time/space) I need to check the header of each file.
The header is just a small struct of a fixed size at the start of each image, and for images that are not compressed, checking it is very fast. For reading the compressed images, I have no choice but to decompress the whole file and then check this header, which of course slows down my program.
Would it be possible to read the first segment of a .gz file (say a couple of K), decompress this segment and read the original contents? My understanding of gz is that after some bookkeeping at the start, the compressed data is stored sequentially -- is that correct?
so instead of
1. open big file F
2. decompress big file F
3. read 500-byte header
4. re-compress big file F
do
1. open big file F
2. read first 5 K from F as stream A
3. decompress A as stream B
4. read 500-byte header from B
I am using libz.so but solutions in other languages are appreciated!
You can use gzip -cd file.gz | dd ibs=1024 count=10 to uncompress just the first 10 KiB, for example.
gzip -cd decompresses to the standard output.
Pipe | this into the dd utility.
The dd utility copies the standard input to the standard output.
Sodd ibs=1024 sets the input block size to 1024 bytes instead of the default 512.
And count=10 Copies only 10 input blocks, thus halting the gzip decompression.
You'll want to do gzip -cd file.gz | dd count=1 using the standard 512 block size and just ignore the extra 12 bytes.
A comment highlights that you can use gzip -cd file.gz | head -c $((1024*10)) or in this specific case gzip -cd file.gz | head -c $(512). The comment that the original dd relies on gzip decompressing in 1024 doesn't seem to true. For example dd ibs=2 count=10 decompresses the first 20 bytes.
Yes, it is possible.
But don't reinvent the wheel, the HDF5 database supports different compression algorithms (gz among them) and you can address different pieces. It is compatible with Linux and Windows and there are wrappers to many languages. It also supports reading and decompressing in parallel, that is very useful if you use high compression rates.
Here is a comparison of read speed using different compression algorithms from Python through PyTables:
A Deflate stream can have multiple blocks back to back. But you can always decompress just the number of bytes you want, even if it's part of a larger block. The zlib function gzread takes a length arg, and there are various other ways to decompress a specific amount of plaintext bytes, regardless of how long the full stream is. See the zlib manual for a list of functions and how to use them.
It's not clear if you want to modify just the headers. (You mention recompressing the whole file, but option B doesn't recompress anything). If so, write headers in a separate Deflate block so you can replace that block without recompressing the rest of the image. Use Z_FULL_FLUSH when you call the zlib deflate function to write the headers. You probably don't need to record the compressed length of the headers anywhere; I think it can be computed when reading them to figure out which bytes to replace.
If you aren't modifying anything, recompressing the whole file makes no sense. You can seek and restart decompression from the start after finding headers you like...

SAS- Reading multiple compressed data files

I hope you are all well.
So my question is about the procedure to open multiple raw data files that are compressed.
My files' names are ordered so I have for example : o_equities_20080528.tas.zip o_equities_20080529.tas.zip o_equities_20080530.tas.zip ...
Thank you all in advance.
How much work this will be depends on whether:
You have enough space to extract all the files simultaneously into one folder
You need to be able to keep track of which file each record has come from (i.e. you can't tell just from looking at a particular record).
If you have enough space to extract everything and you don't need to track which records came from which file, then the simplest option is to use a wildcard infile statement, allowing you to import the records from all of your files in one data step:
infile "c:\yourdir\o_equities_*.tas" <other infile options as per individual files>;
This syntax works regardless of OS - it's a SAS feature, not shell expansion.
If you have enough space to extract everything in advance but you need to keep track of which records came from each file, then please refer to this page for an example of how to do this using the filevar option on the infile statement:
http://www.ats.ucla.edu/stat/sas/faq/multi_file_read.htm
If you don't have enough space to extract everything in advance, but you have access to 7-zip or another archive utility, and you don't need to keep track of which records came from each file, you can use a pipe filename and extract to standard output. If you're on a Linux platform then this is very simple, as you can take advantage of shell expansion:
filename cmd pipe "nice -n 19 gunzip -c /yourdir/o_equities_*.tas.zip";
infile cmd <other infile options as per individual files>;
On windows it's the same sort of idea, but as you can't use shell expansion, you have to construct a separate filename for each zip file, or use some of 7zip's more arcane command-line options, e.g.:
filename cmd pipe "7z.exe e -an -ai!C:\yourdir\o_equities_*.tas.zip -so -y";
This will extract all files from all of the matching archives to standard output. You can narrow this down further via the 7-zip command if necessary. You will have multiple header lines mixed in with the data - you can use findstr to filter these out in the pipe before SAS sees them, or you can just choose to tolerate the odd error message here and there.
Here, the -an tells 7-zip not to read the zip file name from the command line, and the -ai tells it to expand the wildcard.
If you need to keep track of what came from where and you can't extract everything at once, your best bet (as far as I know) is to write a macro to process one file at a time, using the above techniques and add this information while you're importing each dataset.