Currently in order to save an IP address I am converting it to number and store it in the collection. Basically I am doing this for logging purposes. This means that I care to store information as fast as possible and with smallest amount of space.
I will be rarely using it for querying.
My ideas that
Storing as strings is for sure inefficient.
Storing as 4 digits will be slower and will take more space.
Nonetheless I think that this is an adequate method, but is there a better one for my purpose?
Definitely save IP addresses as numbers, if you don't mind the extra bit of work that it takes, especially if you need to do queries on the addresses and you have large tables/collections.
Here's why:
Storage
An IPv4 address is 4 bytes if stored as unsigned integer.
An IPv4 address varies between 10 bytes and 18 bytes when written out as a string in dotted octed form. (Let's assume the average is 14 bytes.)
That is 7-15 bytes for the characters, plus 2-3 bytes if you're using a variable length string type, which varies based on the database you're using. If you have a fixed length string representation available, then you must use a 15-character fixed width field.
Disk storage is cheap, so that's not a factor in most use cases. Memory, however, is not as cheap, and if you have a large table/collection and you want to do fast queries, then you need an index. The 2-3x storage penalty of string encoding drastically reduces the amount of records you can index while still keeping the index resident in memory.
An IPv6 address is 16 bytes if stored as an unsigned integer. (Likely as multiple 4 or 8 byte integers, depending on your platform.)
An IPv6 address ranges from 6 bytes to 42 bytes when encoded as a string in abbreviated hex notation.
On the low end, a loop back address (::1) is 3 bytes plus the variable length string overhead. On the high end, an address like 2002:4559:1FE2:1FE2:4559:1FE2:4559:1FE2 uses 39 bytes plus the variable length string overhead.
Unlike with IPv4, it's not safe to assume the average IPv6 string length will be mean of 6 and 42, because the number of addresses with a significant number of consecutive zeroes is a very small fraction of the overall IPv6 address space. Only some special addresses, like loopback and autoconf addresses, are likely to be compressible in this way.
Again, this is a storage penalty of >2x for string encoding versus integer encoding.
Network Math
Do you think routers store IP addresses as strings? Of course they don't.
If you need to do network math on IP addresses, the string representation is a hassle. E.g. if you want to write a query that searches for all addresses on a specific subnet ("return all records with an IP address in 10.7.200.104/27", you can easily do this by masking an integer address with an integer subnet mask. (Mongo doesn't support this particular query, but most RDBMS do.) If you store addresses as strings, then your query will need to convert each row to an integer, then mask it, which is several orders of magnitude slower. (Bitwise masking for an IPv4 address can be done in a few CPU cycles using 2 registers. Converting a string to an integer requires looping over the string.)
Similarly, range queries ("return all records all records between 192.168.1.50 and 192.168.50.100") with integer addresses will be able to use indexes, whereas range queries on string addresses will not.
The Bottom Line
It takes a little bit more work, but not much (there are a million aton() and ntoa() functions out there), but if you're building something serious and solid and you want to future-proof it against future requirements and the possibility of a large dataset, you should store IP addresses as integers, not strings.
If you're doing something quick and dirty and don't mind the possibility of remodeling in the future, then use strings.
For the OP's purpose, if you are optimizing for speed and space and you don't think you want to query it often, then why use a database at all? Just print IP addresses to a file. That would be faster and more storage efficient than storing it in a database (with associated API and storage overhead).
An IPv4 is four bytes, so you can store it into a 32-bit integer (BSON type 16).
See http://docs.mongodb.org/manual/reference/bson-types
An efficient way to save a ip address as a int. If you want to tag a ip with cidr filter, a demo here:
> db.getCollection('iptag').insert({tags: ['office'], hostmin: 2886991873, hostmax: 2887057406, cidr: '172.20.0.0/16'})
> db.getCollection('iptag').insert({tags: ['server'], hostmin: 173867009, hostmax: 173932542, cidr: '10.93.0.0/16'})
> db.getCollection('iptag').insert({tags: ['server'], hostmin: 173932545, hostmax: 173998078, cidr: '10.94.0.0/16'})
Create tags index.
> db.getCollection('iptag').ensureIndex(tags: 1)
Filter ip with cidr range. ip2int('10.94.25.32') == 173938976.
> db.getCollection('iptag').find({hostmin: {$lte: 173938976}, hostmax: {$gte: 173938976}})
Simplest way for IPv4 is to convert to int using the interesting Maths provided here.
I use the following function (js) to convert before matching with db
ipv4Number: function (ip) {
iparray = ip.split(".");
ipnumber = parseInt(iparray[3]) +
parseInt(iparray[2]) * 256 +
parseInt(iparray[1]) * Math.pow(256, 2) +
parseInt(iparray[0]) * Math.pow(256, 3);
if (parseInt(ipnumber) > 0)return ipnumber;
return 0;
}
Related
I'm sending data buffer over the network using sockets. The only place I see usage of converting endianess is the port number of sender/receiver. I can understand that.
The thing I can't understand is that I’m sending a const char* (using send()/sendto(), depending on transfer protocol). As far as I understand, you transfer bytes over the network in Big Endian. The machine of mine is Little Endian.
What is the trick here of not using ntoh()/hton() functions when sending that const char*?
Background: The concept of big-endian vs little-endian only applies to multi-byte integers (and floating point values); for historical reasons, different CPUs may represent the same numeric binary value in memory via different byte-orderings, and so the designers of the Berkeley sockets API created ntoh*() and hton*() to translate from the CPU's native byte-ordering (whatever that may be) to a "network-standard" format (they chose big-endian) and back, so that binary numbers can be transferred from one machine type to another without being misinterpreted by the recipient's CPU.
Crucially, this splintering of representations happens only above the byte-level; i.e. the ordering of bytes within an N-byte word may be different on different CPUs, but the ordering of bits within each byte is always the same.
A character string (as pointed to by a const char *) refers to a series of individual 8-bit bytes, and the bytes' ordering is interpreted the same way by all CPU types, so there is no need to convert their format to maintain data-compatibility across CPUs. (Note that there are different formats that can be used to represent character strings, e.g. ASCII vs UTF-8 vs UTF-16 vs etc, but that is a decision made by the software rather than a characteristic of the CPU hardware the software is running on; as such, the Berkeley sockets API doesn't try to address that issue)
Currently in order to save an IP address I am converting it to number and store it in the collection. Basically I am doing this for logging purposes. This means that I care to store information as fast as possible and with smallest amount of space.
I will be rarely using it for querying.
My ideas that
Storing as strings is for sure inefficient.
Storing as 4 digits will be slower and will take more space.
Nonetheless I think that this is an adequate method, but is there a better one for my purpose?
Definitely save IP addresses as numbers, if you don't mind the extra bit of work that it takes, especially if you need to do queries on the addresses and you have large tables/collections.
Here's why:
Storage
An IPv4 address is 4 bytes if stored as unsigned integer.
An IPv4 address varies between 10 bytes and 18 bytes when written out as a string in dotted octed form. (Let's assume the average is 14 bytes.)
That is 7-15 bytes for the characters, plus 2-3 bytes if you're using a variable length string type, which varies based on the database you're using. If you have a fixed length string representation available, then you must use a 15-character fixed width field.
Disk storage is cheap, so that's not a factor in most use cases. Memory, however, is not as cheap, and if you have a large table/collection and you want to do fast queries, then you need an index. The 2-3x storage penalty of string encoding drastically reduces the amount of records you can index while still keeping the index resident in memory.
An IPv6 address is 16 bytes if stored as an unsigned integer. (Likely as multiple 4 or 8 byte integers, depending on your platform.)
An IPv6 address ranges from 6 bytes to 42 bytes when encoded as a string in abbreviated hex notation.
On the low end, a loop back address (::1) is 3 bytes plus the variable length string overhead. On the high end, an address like 2002:4559:1FE2:1FE2:4559:1FE2:4559:1FE2 uses 39 bytes plus the variable length string overhead.
Unlike with IPv4, it's not safe to assume the average IPv6 string length will be mean of 6 and 42, because the number of addresses with a significant number of consecutive zeroes is a very small fraction of the overall IPv6 address space. Only some special addresses, like loopback and autoconf addresses, are likely to be compressible in this way.
Again, this is a storage penalty of >2x for string encoding versus integer encoding.
Network Math
Do you think routers store IP addresses as strings? Of course they don't.
If you need to do network math on IP addresses, the string representation is a hassle. E.g. if you want to write a query that searches for all addresses on a specific subnet ("return all records with an IP address in 10.7.200.104/27", you can easily do this by masking an integer address with an integer subnet mask. (Mongo doesn't support this particular query, but most RDBMS do.) If you store addresses as strings, then your query will need to convert each row to an integer, then mask it, which is several orders of magnitude slower. (Bitwise masking for an IPv4 address can be done in a few CPU cycles using 2 registers. Converting a string to an integer requires looping over the string.)
Similarly, range queries ("return all records all records between 192.168.1.50 and 192.168.50.100") with integer addresses will be able to use indexes, whereas range queries on string addresses will not.
The Bottom Line
It takes a little bit more work, but not much (there are a million aton() and ntoa() functions out there), but if you're building something serious and solid and you want to future-proof it against future requirements and the possibility of a large dataset, you should store IP addresses as integers, not strings.
If you're doing something quick and dirty and don't mind the possibility of remodeling in the future, then use strings.
For the OP's purpose, if you are optimizing for speed and space and you don't think you want to query it often, then why use a database at all? Just print IP addresses to a file. That would be faster and more storage efficient than storing it in a database (with associated API and storage overhead).
An IPv4 is four bytes, so you can store it into a 32-bit integer (BSON type 16).
See http://docs.mongodb.org/manual/reference/bson-types
An efficient way to save a ip address as a int. If you want to tag a ip with cidr filter, a demo here:
> db.getCollection('iptag').insert({tags: ['office'], hostmin: 2886991873, hostmax: 2887057406, cidr: '172.20.0.0/16'})
> db.getCollection('iptag').insert({tags: ['server'], hostmin: 173867009, hostmax: 173932542, cidr: '10.93.0.0/16'})
> db.getCollection('iptag').insert({tags: ['server'], hostmin: 173932545, hostmax: 173998078, cidr: '10.94.0.0/16'})
Create tags index.
> db.getCollection('iptag').ensureIndex(tags: 1)
Filter ip with cidr range. ip2int('10.94.25.32') == 173938976.
> db.getCollection('iptag').find({hostmin: {$lte: 173938976}, hostmax: {$gte: 173938976}})
Simplest way for IPv4 is to convert to int using the interesting Maths provided here.
I use the following function (js) to convert before matching with db
ipv4Number: function (ip) {
iparray = ip.split(".");
ipnumber = parseInt(iparray[3]) +
parseInt(iparray[2]) * 256 +
parseInt(iparray[1]) * Math.pow(256, 2) +
parseInt(iparray[0]) * Math.pow(256, 3);
if (parseInt(ipnumber) > 0)return ipnumber;
return 0;
}
Need suggestion on which Datatype would give better performance if we set one of these as primary key in DB2 - BIGINT or Decimal(13,0) type?
I suspect Decimal(13,0) will have issues once the key grows to a very big size but I wanted a better answer/understanding for this.
Thanks.
Decimal does not have issues. The only thing, is that DB2 has to do more operations to retrieve the data, once is read. I mean, DB2 read the data and then it should find the decimal part (the precision) even if is 0.
On the other hand, DB2 will read the BigInt and it does not need any further process. The number is on the bufferpool.
If you are going to use integers of 13 positions (most of them), probably Decimal will be better because you are not going to use extra bytes, however decimals have extra bytes for the precision. By using decimal in this way, you are going to optimize the storage, and this will be translated in better IO, better performance. However, it depends on the other columns of your table. You have to test which of them gives you better performance.
When using compression, there are more CPU cycles to recover the information. You have to test if the performance is affected.
Use BIGINT:
Can store ~19 digits (versus 13)
Will take 8 bytes (versus maybe either 7 or 13 - see next)
Depending on platform, DECIMAL will be stored as a form of Binary Coded Decimal - for example, on the iSeries (and I can't remember if it's Packed or Zoned). Can't speak to other deployments, unfortunately.
You aren't doing math on these values (things like "next entry" don't count) - save DECIMAL/NUMERIC for measurements/values.
Note that, really, ids are just a sequence of bits - the fact that it happens to be an integer (usually) is irrelevant. It's best to consider them random data; sequential assignment is an optimization detail, there's often gaps (rollbacks, system crashes, whatever), and they're meaningless for anything other than joining.
There is a rule I once heard that when assigning storage size to char and varchar that instead of doing the regular 4, 8, 16, 32 rule you want to actually use 3,7,15, 31. Apparently it has something to do with optimizing the space in which it is stored.
Does anyone know if there is validity to this statement or is there a better way of assigning size to char and varchar in postgreSQL? Also is this rule for just postgreSQL or something to keep in mind in all of the SQL languages?
You're mis-remembering something that applies at a much lower level.
Strings in the "C" language are terminated by a zero-byte. So: "hello" would traditionally take six bytes. Of course, that was back when everyone assumed a single character would fit neatly into a single byte. Not the case any more.
The other (main) way to store strings is to have a length stored at the front, and then have the characters following. As it happens that is what PostgreSQL does, and I believe it even has an optimisation so the length doesn't take up so much space with short strings.
There are also separate issues where memory access is cheaper/easier at 2/4/8 byte boundaries (depending on the age of the machine) and memory allocation can be more efficient in powers of 2 (1024, 2048, 4096 bytes).
For PostgreSQL (or any of the major scripting languages / Java) just worry about representing your data accurately. About 99% of the time fiddly low-level optimisation is irrelevant. Actually, even if you are writing in "C", don't worry about it there until you need to.
I am just beginning to learn the concept of Direct mapped and Set Associative Caches.
I have some very elementary doubts . Here goes.
Supposing addresses are 32 bits long, and i have a 32KB cache with 64Byte block size and 512 frames, how much data is actually stored inside the "block"? If i have an instruction which loads from a value from a memory location and if that value is a 16-bit integer, is it that one of the 64Byte blocks now stores only a 16 bit(2Bytes) integer value. What of the other 62 bytes within the block? If i now have another load instruction which also loads a 16bit integer value, this value now goes into another block of another frame depending on the load address(If the address maps to the same frame of the previous instruction, then the previous value is evicted and the block again stores only 2bytes in 64 bytes). Correct?
Please forgive me if this seems like a very stupid doubt, its just that i want to get my concepts correctly.
I typed up this email for someone to explain caches, but I think you might find it useful as well.
You have 32-bit addresses that can refer to bytes in RAM.
You want to be able to cache the data that you access, to use them later.
Let's say you want a 1-MiB (220 bytes) cache.
What do you do?
You have 2 restrictions you need to meet:
Caching should be as uniform as possible across all addresses. i.e. you don't want to bias toward any particular kind of address.
How do you do this? Use remainder! With mod, you can evenly distribute any integer over whatever range you want.
You want to help minimize bookkeeping costs. That means e.g. if you're caching in blocks of 1 byte, you don't want to store 4 bytes of data just to keep track of where 1 byte belongs to.
How do you do that? You store blocks that are bigger than just 1 byte.
Let's say you choose 16-byte (24-byte) blocks. That means you can cache 220 / 24 = 216 = 65,536 blocks of data.
You now have a few options:
You can design the cache so that data from any memory block could be stored in any of the cache blocks. This would be called a fully-associative cache.
The benefit is that it's the "fairest" kind of cache: all blocks are treated completely equally.
The tradeoff is speed: To find where to put the memory block, you have to search every cache block for a free space. This is really slow.
You can design the cache so that data from any memory block could only be stored in a single cache block. This would be called a direct-mapped cache.
The benefit is that it's the fastest kind of cache: you do only 1 check to see if the item is in the cache or not.
The tradeoff is that, now, if you happen to have a bad memory access pattern, you can have 2 blocks kicking each other out successively, with unused blocks still remaining in the cache.
You can do a mixture of both: map a single memory block into multiple blocks. This is what real processors do -- they have N-way set associative caches.
Direct-mapped cache:
Now you have 65,536 blocks of data, each block being of 16 bytes.
You store it as 65,536 "rows" inside your cache, with each "row" consisting of the data itself, along with the metadata (regarding where the block belongs, whether it's valid, whether it's been written to, etc.).
Question:
How does each block in memory get mapped to each block in the cache?
Answer:
Well, you're using a direct-mapped cache, using mod. That means addresses 0 to 15 will be mapped to block 0 in the cache; 16-31 get mapped to block 2, etc... and it wraps around as you reach the 1-MiB mark.
So, given memory address M, how do you find the row number N? Easy: N = M % 220 / 24.
But that only tells you where to store the data, not how to retrieve it. Once you've stored it, and try to access it again, you have to know which 1-MB portion of memory was stored here, right?
So that's one piece of metadata: the tag bits. If it's in row N, all you need to know is what the quotient was, during the mod operation. Which, for a 32-bit address, is 12 bits big (since the remainder is 20 bits).
So your tag becomes 12 bits long -- specifically, the topmost 12 bits of any memory address.
And you already knew that the lowermost 4 bits are used for the offset within a block (since memory is byte-addressed, and a block is 16 bytes).
That leaves 16 bits for the "index" bits of a memory address, which can be used to find which row the address belongs to. (It's just a division + remainder operation, but in binary.)
You also need other bits: e.g. you need to know whether a block is in fact valid or not, because when the CPU is turned on, it contains invalid data. So you add 1 bit of metadata: the Valid bit.
There's other bits you'll learn about, used for optimization, synchronization, etc... but these are the basic ones. :)
I'm assuming you know the basics of tag, index, and offset but here's a short explanation as I have learned in my computer architecture class. Blocks are replaced in 64 byte blocks, so every time a new block is put into cache it replaces all 64 bytes regardless if you only need one byte. That's why when addressing the cache there is an offset that specifies the byte you want to get from the block. Take your example, if only 16 bit integer is being loaded, the cache will search for the block by the index, check the tag to make sure its the right data and then get the byte according to the offset. Now if you load another 16 bit value, lets say with the same index but different tag, it will replace the 64 byte block with the new block and get the info from the specified offset. (assuming direct mapped)
I hope this helps! If you need more info or this is still fuzzy let me know, I know a couple of good sites that do a good job of teaching this.