dd speed issue using ibs/obs - perl

I have a loop where I use dd to copy a stream to a disk. I am using a larger blocksize using 'bs' in the entire process for speed reasons. However with one specific line I have to use 'ibs' and 'obs' because my 'seek' location is not a multiple of the 'bs' I use elsewhere.
My question is: Is there a way using dd or any other program/Perl module to write out a blocksize different from the one used to 'seek'?
dd if=/dev/ram1 of=/dev/sdb1 seek=2469396480 ibs=1048576 obs=1 count=1
As you can see above, while the raw data is read in a 1M block I have to write it out in 1 byte segments because I need to seek to a specific location based on a byte granularity. This makes the write 1/100th as fast.
Is there a workaround? Or is there a way to do this in Perl without using dd?
Thanks,
Nick

This problem is inherent in dd. If your desired seek location has no factor of suitable magnitude (big enough for good performance but small enough to use as a buffer size) then you're stuck. This happens among other times when your desired seek location is a large prime.
In this specific case, as Mark Mann pointed out, you do have good options: 2469396480 is 2355 blocks of size 1048576, or 1024 blocks of size 2411520, etc... But that's not a generic answer.
To do this generically, you'll want to use something other than dd. Fortunately, dd's task is really simple and all you need is the following (in pseudocode... I haven't done much Perl in a while)
if = open("/dev/ram1", "r")
of = open("/dev/sdb1", "r+")
seek(of, 2469396480)
loop until you have copied the amount of data you want {
chunk = read(if, min(chunksize, remaining_bytes_to_copy))
write(of, chunk)
}
It looks like the source of your copy is a ramdisk of some sort. If you really want screaming performance, you might try another method besides reading chunks into a buffer and writing the buffer out to the output file. For example you can mmap() the source file and write() directly from the mapped address. The OS may (or may not) optimize away one of the RAM-to-RAM copy operations. Note that such methods will be less portable and less likely to be available in high level languages like Perl.

Related

compression algorithm that allows random reads/writes in a file?

Does anyone have a better compression algorithm that would allow random reads/writes?
I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?
In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .
Have these people never heard of "compressed file systems", which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?
I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.
Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details. But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.

Difference between machine language, binary code and a binary file

I'm studying programming and in many sources I see the concepts: "machine language", "binary code" and "binary file". The distinction between these three is unclear to me, because according to my understanding machine language means the raw language that a computer can understand i.e. sequences of 0s and 1s.
Now if machine language is a sequence of 0s and 1s and binary code is also a sequence of 0s and 1s then does machine language = binary code?
What about binary file? What really is a binary file? To me the word "binary file" means a file, which consists of binary code. So for example, if my file was:
010010101010010
010010100110100
010101100111010
010101010101011
010101010100101
010101010010111
Would this be a binary file? If I google binary file and see Wikipedia I see this example picture of binary file which confuses me (it's not in binary?....)
Where is my confusion happening? Am I mixing file encoding here or what? If I were to ask one to SHOW me what is machine language, binary code and binary file, what would they be? =) I guess the distinction is too abstract to me.
Thnx for any help! =)
UPDATE:
In Python for example, there is one phrase in a file I/O tutorial, which I don't understand: Opens a file for reading only in binary format. What does reading a file in binary format mean?
Machine code and binary are the same - a number system with base 2 - either a 1 or 0. But machine code can also be expressed in hex-format (hexadecimal) - a number system with base 16. The binary system and hex are very interrelated with each other, its easy to convert from binary to hex and convert back from hex to binary. And because hex is much more readable and useful than binary - it's often used and shown. For instance in the picture above in your question -uses hex-numbers!
Let say you have the binary sequence 1001111000001010 - it can easily be converted to hex by grouping in blocks - each block consisting of four bits.
1001 1110 0000 1010 => 9 14 0 10 which in hex becomes: 9E0A.
One can agree that 9E0A is much more readable than the binary - and hex is what you see in the image.
I'm honestly surprised to not see the information I was looking for, looking back though, I guess the title of this thread isn't fully appropriate to the question the OP was asking.
You guys all say "Machine Code is a bunch of numbers".
Sure, the "CODE" is a bunch of numbers, but what people are wondering (I'm guessing) is "what actually is happening physically?"
I'm quite a novice when it comes to programming, but I understand enough to feel confident in 'roughly' answering this question.
Machine code, to the actual circuitry, isn't numbers or values.
Machine code is a bunch of voltage gates that are either open or closed, and depending on what they're connected to, a certain light will flicker at a certain time etc.
I'm guessing that the "machine code" dictates the pathway and timing for specific electrical signals that will travel to reach their overall destination.
So for 010101, 3 voltage gates are closed (The 0's), 3 are open (The 1's)
I know I'm close to the right answer here, but I also know it's much more sophisticated - because I can imagine that which I don't know.
010101 would be easy instructions for a simple circuit, but what I can't begin to fathom is how a complex computer processes all of the information.
So I guess let's break it down?
x-Bit-processors tell how many bits the processor can process at once.
A bit is either 1 or 0, "On" or "Off", "Open" or "Closed"
so 32-bit processors process "10101010 10101010 10101010 10101010" - this many bits at once.
A processor is an "integrated circuit", which is like a compact circuit board, containing resistors/capacitors/transistors and some memory. I'm not sure if processors have resistors but I know you'll usually find a ton of them located around the actual processor on the circuit board
Anyways, a transistor is a switch so if it receives a 1, it sends current in one direction, or if it receives a 0, it'll send current in a different direction... (or something like that)
So I imagine that as machine code goes... the segment of code the processor receives changes the voltage channels in such a way that it sends a signal to another part of the computer (why do you think processors have so many pins?), probably another integrated circuit more specialized to a specific task.
That integrated circuit then receives a chunk of code, let's say 2 to 4 bits 01 or 1100 or something, which further defines where the final destination of the signal will end up, which might be straight back to the processor, or possibly to some output device.
Machine code is a very efficient way of taking a circuit and connecting it to a lightbulb, and then taking that lightbulb out of the circuit and switching the circuit over to a different lightbulb
Memory in a computer is highly necessary because otherwise to get your computer to do anything, you would need to type out everything (in machine code). Instead, all of the 1's and 0's are stored inside some storage device, either a spinning hard disk with a magnetic head pin that 'reads' 1's or 0's based on the charge of the disk, or a flash memory device that uses a series of transistors, where sending a voltage through elicits 1's and 0's (I'm not fully aware how flash memory works)
Fortunately, someone took the time to think up a different base number system for programming (hex), and a way to compile those numbers (translate them) back into binary. And then all software programs have branched out from there.
Each key on the keyboard creates a specific signal in binary that translates to
a bunch of switches being turned on or off using certain voltages, so that a current could be run through the specific individual pixels on your screen that create "1" or "0" or "F", or all the characters of this post.
So I wonder, how does a program 'program', or 'make' the computer 'do' something... Rather, how does a compiler compile a program of a code different from binary?
It's hard to think about now because I'm extremely tired (so I won't try) but also because EVERYTHING you do on a computer is because of some program.
There are actively running programs (processes) in task manager. These keep your computer screen looking the way you've become accustomed, and also allow for the screen to be manipulated as if to say the pictures on the screen were real-life objects. (They aren't, they're just pictures, even your mouse cursor)
(Ok I'm done. enough editing and elongating my thoughts, it's time for bed)
Also, what I don't really get is how 0's are 'read' by the computer.
It seems that a '0' must not be a 'lack of voltage', rather, it must be some other type of signal
Where perhaps something like 1 volt = 1, and 0.5 volts = 0. Some distinguishable difference between currents in a circuit that would still send a signal, but could be the difference between opening and closing a specific circuit.
If I'm close to right about any of this, serious props to the computer engineers of the world, the level of sophistication is mouthwatering. I hope to know everything about technology someday. For now I'm just trying to get through arduino.
Lastly... something I've wondered about... would it even be possible to program today's computers without the use of another computer?
Machine language is a low-level programming language that generally consists entirely of numbers. Because they are just numbers, they can be viewed in binary, octal, decimal, hexadecimal, or any other way. Dave4723 gave a more thorough explanation in his answer.
Binary code isn't a very well-defined technical term, but it could mean any information represented by a sequence of 1s and 0s, or it could mean code in a machine language, or it could mean something else depending on context.
Technically, all files are stored in binary, we just don't usually look at the binary when we view a file. However, the term binary file is usually used to refer to any non-text file; e.g. an .exe, a .png, etc.
You have to understand how a computer works in its basic principles and this will clear things up for you... Therefore I recommend on reading into stuff like Neumann Architecture
Basically in a very simple computer you only have one memory like an array
which has instructions for your processor, the data and everything is a binary numbers.
Your program starts at a certain place in your memory and reads the first number...
so here comes the twist: these numbers can be instructions or data.
Your processor reads these numbers and interprets them as instructions
Example: the start address is 0
in 0 is a instruction like "read value from address 120 into the ALU (Math-Unit)
then it steps to address 1
"read value from address 121 into ALU"
then it steps to address 2
"subtract numbers in ALU"
then it steps to address 3
"if ALU-Value is smaller than zero go to address 10"
it is not smaller than zero so it steps to address 4
"go to address 20"
you see that this is a basic if(a < b)
You can write these instructions as numbers and they can be run by your processor but because nobody wants to do this work (that was what they did with punchcards in the 60s)
assembler was invented...
that looks like:
add 10 ,11, 20 // load var from address 10 and 11; run addition and store into address 20
In Conclusion:
Assembler (processor instructions) can be called binary because it's stored in plain numbers
But everything else can be a Binary file, too.
In reality if you have a simple .exe file it is both... If you have variables in there like a = 10 and b = 20, these values can be stored some where between if clauses and for loops... It depends on the compiler where it put these
But if you have a complex 3D-model it can be stored in a separate file with no executable code in it...
I hope it helps to clear things up a little.

Tokenize the mmaped string using strtok

I have mmaped a huge file containing 4k * 4k floats. Since it was an text file, I need to mmap it as char string and use. Now I need to parse floats and write into 2d array. If I tokenize it using strtok, it will not allow me to do since mmapped string is not modifiable. If I copy the string into std::string and then tokenize using getline function, it let me to do it but I feel I will lose the performance got from mmap. How do I optimally solve this problem ??
You can try some different solutions, but you will have to benchmark to find out which one is the best for you. It's not always clear that mmap()ing a file and processing the memory-mapped pages directly is the best solution. Especially if you make a single sequential pass through the file, a loop that read()s pieces at a time into a buffer can be faster, even if you use madvise() together with mmap(). Again, benchmark if you want to know what is fastest for you.
Some solutions you might try:
mmap() with MAP_WRITE and MAP_PRIVATE and then use your existing strtok() code. This will allow strtok() to write the NUL bytes it wants to write, without having those changes be reflected in the file. If you choose this solution, you should probably call madvise(MADV_DONTNEED) on the parts of the file you have already processed, else memory usage will grow linearly.
Implement your own variant of strtok() that returns the length of the matched token instead of a NUL-terminated string. It's not difficult, using memchr(). This way you don't need to modify the memory. You might then need to pass the resulting tokens to functions that take a string and a length instead of a NUL-terminated string. There aren't many such functions in the C library, but even so you might be able to get away with calling functions like strtod() if the tokens are guaranteed to end in some non-digit delimiter. Or you can copy them into a small stack-allocated buffer (they're floats, they can't be that long, right?)
Use a read()-and-process loop instead of mmap().

Is there a such a way to know how much memory space a file would take?

Is there a such a way to know how much memory space a file would take before hand?
For example, lets say I have a file size of 1G bytes. How would this file size translate to memory size?
I take your example from the comment and elaborate on what might happen to a text file when loaded into memory: Some time ago, "text" usually meant ASCII (as the least common denominator at least). And lots of software, written in a language like C, would represent such ASCII strings as an char* type. This resulted in a more-or-less exact match in memory requirements: Every byte in the input file would occupy one byte when loaded into RAM.
But that has changed in the last years with the rise of Unicode. The same text file, loaded by a simple Java program (and using Java's String type, which is very likely) would take up two times the amount of RAM. This is so, because the Java String type represents each character internally using UTF-16 (16 bits per character minimum), whereas ASCII used only one byte per character.
What I'm trying to say here is: There is no easy answer to your question, it always depends on who reads the data and what he's about to do with it.
One thing is true quite often: by "loading", the data does not become smaller.
If you read the whole file into memory at once, you'll need at least the size of the file free memory. Much of the time people don't actually need to do so, they just don't know another way. For an explanation of the problem and alternatives see:
http://www.effectiveperlprogramming.com/2010/01/memory-map-files-instead-of-slurping-them/
You can check yourself by writing a little test script with Memory::Usage.
From its documentation's synopsis:
use Memory::Usage;
my $mu = Memory::Usage->new();
# Record amount of memory used by current process
$mu->record('starting work');
# Do the thing you want to measure
$object->something_memory_intensive();
# Record amount in use afterwards
$mu->record('after something_memory_intensive()');
# Spit out a report
$mu->dump();
Then you'll know how much your build of Perl, given whatever character encoding you intend to use, and whatever method of dealing with the file you intend to implement, will consume in memory.
If you can avoid loading the whole file at once, and instead just iterate over it line by line or record by record, the memory concern goes away. So it would help to know what you actually are trying to accomplish. You may have an XY problem.
perldoc -f stat
stat Returns a 13-element list giving the status info for a file,
either the file opened via FILEHANDLE or DIRHANDLE, or named by
EXPR. If EXPR is omitted, it stats $_. Returns the empty list
if "stat" fails. Typically used as follows:
($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
$atime,$mtime,$ctime,$blksize,$blocks)
= stat($filename);
Note the $size return value. It is the size of the file in bytes. If you are going to slurp the entire file into memory you will need at least $size bytes. Then again, you might need a whole lot more (or even a whole lot less), depending on what you do to the contents of the file.

Which tool should I use for finding out my memory allocation in Perl?

I've slurped in a big file using File::Slurp but given the size of the file I can see that I must have it in memory twice or perhaps it's getting inflated by being turned into 16 bit unicode. How can I best diagnose that sort of a problem in Perl?
The file I pulled in is 800mb in size and my perl process that's analysing that data has roughly 1.6gb allocated at runtime.
I realise that I may be wrong about my reason for the problem but I'm not sure the most efficient way to prove/disprove my theory.
Update:
I have elminated dodgy character encoding from the list of suspects. It looks like I'm copying the variable at some point, I just can't figure out where.
Update 2:
I have now done some more investigation and discovered that it's actually just getting the data from File::Slurp that's causing the problem. I had a look through the documentation and discovered that I can get it to return a scalar_ref, i.e.
my $data = read_file($file, binmode => ':raw', scalar_ref => 1);
Then I don't get the inflation of my memory. Which makes some sense and is the most logical thing to do when getting the data in my situation.
The information about looking at what variables exist etc. has generally helpful though thanks.
Maybe Devel::DumpSizes and/or Devel::Size can help out? I think the former would be more useful in your case.
Devel::DumpSizes - Dump the name and size in bytes (in increasing order) of variables that are available at a give point in a script.
Devel::Size - Perl extension for finding the memory usage of Perl variables
Here are some generic resources on memory issues in Perl:
http://perl.active-venture.com/pod/perldebguts-perlmemory.html
Perl memory usage profiling and leak detection?
How can I find memory leaks in long-running Perl program?
As far as your own suggestion, the simplest way to disprove would be to write a simple Perl program that:
Creates a big (100M) file of plain text, probably by just outputting the same string in a loop into a file, or for binary files running dd command via system() call
Read the file in using standard Perl open()/#a=<>;
Measure memory consumption.
Then repeat #2-#3 for your 800M file.
That will tell you if the issue is File::Slurp, some weird logic in your program, or some specific content in the file (e.g. non-ascii, although I'd be surprized if that ends up to be the reason)