How to encode multiple videos parallel (Debian) - encoding

I'd like to encode some video files either to MP4 and X264 format in Linux Debian.
It is very important that I encode multiple files parallel.
E.g. I want to encode two videos parallel on a Dual Code Machine and put the other videos in a queue. When a Video is finished I want the free core to encode the next video in the queue. Also even when this'd work with x264 I don't know about MP4.
What is the best approach here?
x264 supports parallel encoding but I don't know whether this is parallel encoding for multiple files or parallel encodings of different version for one single video.
Is there a way I can assign a encoding-process to core1 and another to core2?
Sincerly,
wolfen

Do you really need to encode multiple videos in parallel (are they racing?), or just not leave extra processor cores idle?
In either case, FFmpeg should work for your needs.
By default FFmpeg will use all available CPUs for any processing, allowing faster processing of single videos. However, you can also explicitly specify the number of cores to use via the -threads parameter, e.g., ffmpeg -i input.mov -threads 1 output.mov will only use one core.
It doesn't have any built-in queueing, though, you'll still have to code that aspect on your own.

Related

How to append data to a compressed file?

I am getting a lot of data from websocket screem and I want to store them on disk. The amount of data received is ~300 MB per hour and I want to store this data long term (months, years).
In .NET there is a way how to read/write from/to zipped files using compressed streams. Is there a way to write directly to compressed file in Swift?
This is Mac OS (OSX) question.
Edit:
Stream compression here might be a solution but I am not used to work with unsafe pointers and don't even know whether it can be used to write to compressed file... I am stacked on this for few hours now. Code sample or directions how to approach it would help. Cocoapods wrapper for stream compression would be even better.
gzlog does what you're looking for. It is written in C and uses the zlib library. zlib is available on macOS, and you can link to C code from Swift.

How to read/write to raw device with PowerShell?

I have to read and write data (up to 512 bytes) to/from raw disks (first sector on disk, and first sector on partitions).
I'd like to use PowerShell for that, but fail to find any reference to access to raw disks and raw partitions.
Whats is/are the way/s to do that?
You can do in PowerShell most of the things you can do in .NET (with C# or another language). You will find in CCS LABS C#: Low Level Disk Access the way to do it in C#, but really I'am not sure it's a good idea to do that using a scripting language.

Netflow sample data sets

Does anyone know of an open netflow data set, I want to use it to run a little experiment on it, and analyse some of the flows. I looked around but there is nothing. Or if there is a good method to capture netflow data without actually having a cisco router.
Thanks!
You best/quickest option is to generate NetFlow through a software exporter that uses live capture (see for instance nProbe: http://www.ntop.org/products/netflow/nprobe/ or FlowTraq's free exporter: http://www.flowtraq.com/corporate/product/flow-exporter/).
Both these software exporters also have the capability to generate netflow from PCAP files. This can be convenient if you either have PCAP files, or download PCAP datasets, which are much more available than netflow datasets.

Best Time Series Format for Querying and Converting to Matlab (HDF5)

I have somewhat of a unique problem that looks similar to the problem here :
https://news.ycombinator.com/item?id=8368509
I have a high-speed traffic analysis box that is capturing at about 5 Gbps, and picking out specific packets from this to save into some format in a C++ program. Each day there will probably be 1-3 TB written to disk. Since it's network data, it's all time series down to the nanosecond level, but it would be fine to save it at second or millisecond level and have another application sort the embedded higher-resolution timestamps afterwards. My problem is deciding which format to use. My two requirements are:
Be able to write to disk at about 50 MB/s continuously with several different timestamped parameters.
Be able to export chunks of this data into MATLAB (HDF5).
Query this data once or twice a day for analytics purposes
Another nice thing that's not a hard requirement is :
There will be 4 of these boxes running independently, and it would be nice to query across all of them and combine data if possible. I should mention all 4 of these boxes are in physically different locations, so there is some overhead in sharing data.
The second one is something I cannot change because of legacy applications, but I think the first is more important. The types of queries I may want to export into matlab are something like "Pull metric X between time Y and Z", so this would eventually have to go into an HDF5 format. There is an external library called MatIO that I can use to write matlab files if needed, but it would be even better if there wasn't a translation step. I have read the entire thread mentioned above, and there are many options that appear to stand out: kdb+, Cassandra, PyTables, and OpenTSDB. All of these seem to do what I want, but I can't really figure out how easy it would be to get it into the MATLAB HDF5 format, and if any of these would make it harder than others.
If anyone has experience doing something similar, it would be a big help. Thanks!
A KDB+ tickerplant is certainly capable of capturing data at that rate, however there's lots of things you need to make sure (whatever solution you pick)
Do the machine(s) that are capturing the data have enough cores? Best to taskset a tickerplant, for example, to a core that nothing else will contend with
Similarly with disk - SSD, be sure there is no contention on the bus
Separate the workload - can write different types of data (maybe packets can be partioned by source or stream?) to different cpus/disks/tickerplant processes.
Basically there's lots of ways you can cut this. I can say though that with the appropriate hardware KDB+ could do the job. However, given you want HDF5 it's probably even better to have a simple process capturing the data and writing/converting to disk on the fly.

How to efficiently process 300+ Files concurrently in scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).
Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.
You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.
Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"
If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.
I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.