How Docker calculates the hash of each layer? Is it deterministic?

How Docker calculates the hash of each layer? Is it deterministic? - hash

I tried to find this information around the Docker official docs, but had no success.
Which pieces of information does Docker take into account when calculating the hash of each commit/layer?
It's pretty obvious that the line in the Dockerfile is part of the hash and, of course, the parent commit hash. But is something else take into account when calculating this hash?
Concrete use case: Let's suppose I have two devs in different machines, at different points in time (and because of that, different docker daemons and different caches) running $ docker build ... against the same Dockerfile. The FROM ... directive will give them the same starting point, but will the resulting hash of each operation result on the same hash? Is it deterministic?

Thanks #thaJeztah. Answer is in https://gist.github.com/aaronlehmann/b42a2eaf633fc949f93b#id-definitions-and-calculations
layer.DiffID: ID for an individual layer
Calculation: DiffID = SHA256hex(uncompressed layer tar data)
layer.ChainID: ID for a layer and its parents. This ID uniquely identifies a filesystem composed of a set of layers.
Calculation:
For bottom layer: ChainID(layer0) = DiffID(layer0)
For other layers: ChainID(layerN) = SHA256hex(ChainID(layerN-1) + " " + DiffID(layerN))
image.ID: ID for an image. Since the image configuration references the layers the image uses, this ID incorporates the filesystem data and the rest of the image configuration.
Calculation: SHA256hex(imageConfigJSON)

Related

Updating Redundant data/denormalized data in NoSQL(Aerospike)

My question is that I am having a problem where I need to update the data which is been denormalized due to being in NoSQL because a single update in one data needs to be updated in all other redundant data.
For eg: Consider an e-commerce database where there is one table "Products" which contains all the details about a product , let's say name,imageName, LogoImage
Now in this case the LogoImage of various "Products" table entry can be same, and now I need to update the LogoImage, so I need to update in all the fields which contains the given LogoImage. which seems like a very poor solution
So is there any better way to do that?
P.S.: If we seperate logo and Products into 2 different table , so when I need to get 1000 products at a time , I need to get the related logos by implementing a client level join type thing, which is also not a good solution.

You're suggesting using the database as your CDN and storing the binary image in it? That's not a great approach, in my opinion. You should be storing that image in an actual CDN like Amazon Cloudfront, or a simple one like Amazon S3, or your own webserver as a file. Whichever, the point is that you should be referring to it by URI. In Aerospike you would store the metadata about that image, not the image itself.
Next, you can have two sets - prod for products and prodimg for product images. The various products store a list of IDs referring to the product image set. The product image set has metadata about each image as a separate record { uri, name, title, width, length, ... } . If anything changes about this image, you just update the one record with the metadata for that image in prodimg. No need to change anything about the products.
And you don't really need JOIN functionality in this case. Your application can get the prod record first, and use the bin (images) that has all the IDs of the images for the product (each referring to a key of a record in prodimg). You can then issue either a few get operations (reads) or a single batch-read for all of them if there are many. The latencies for Aerospike are such that this will return faster and scale better than an equivalent JOIN in an RDBMS. A batch-read is a multi-node, multi-core, multi-threaded operation. A cluster of 3 multi-core nodes has plenty of parallel computing power.
Again, if you "need 1000 products at a time" use batch-read. In the Java client that's an AerospikeClient.get() with a list of Key objects. In the Python client that's an aerospike.Client.get_many. Every Aerospike client has batch-read functionality.

Identify crossroad nodes in openstreetmap data (.pbf)

does anybody know if there is a way I can seperate only the crossroad nodes which are included in a .pbf file? Is this clue (if a node is crossroad or not) included in this file's format?

Another option to solve your issue would be to use the new Atlas project.
As part of loading .osm.pbf files into in-memory Atlas files, it takes care of doing way sectioning on roads:
Load your pbf file into an Atlas. You will then have an Atlas object that you could save to a file and re-use.
Use the Atlas APIs to access all the intersections
In the end, each Atlas Node which is connected to more than 4 Edges on a two-way road or 2 Edges on a one way road would be a candidate if I understand your question correctly.

I'm not aware of a ready-made solution for this task, but it should still be relatively easy to do.
For parsing the .pbf file, I recommend using an existing library like Osmosis or Osmium. That way, you only need to implement the actual semantics of your use case.
The nodes themselves don't have any special attributes that mark them as crossroads. So instead, you will have to look at the ways containing the nodes.
Some considerations when implementing this:
You need to check the way's tags to find out whether it's a road. The most relevant key for that is highway. The details depend on your specific use case – for example, you need to decide whether footways, forestry tracks, driveways, ... should count.
What matters is the number of connecting way segments at a node, not the number of ways. For example, a node that is part of two ways may be a crossroads (if at least one of the ways continues beyond that node), or may not (if both ways start/end at that node).

Routing network from OSM

I'm looking for some good tool to import map.osm to postgres and next create some routes which will be displayed by geoserver. I need route, with some text information about vertexes (e.g. city, address, address number, and so on...)
I found this:
osm2pgrouting - Import OSM data into pgRouting Database
osm2postgis -Import OSM data to PostGIS
osm2po - tool to convert OSM data into a routable format
osm4routing - OpenStreetMap data parser to turn them into a nodes-edges adapted for routing applications
I do not have many experiences with GIS, so how tool is the best for me? I try osm2pgrouting, but in result I have tables, which do not contains data about vertexes(only lat. and alt.) Thanks for answers.
UPDATE App Info:
I will be have web and android client where user enter text value of start and end node, and next over geoserver get wms with vertexes of entered route for example
My result from could be be some edges and nodes like this like this:
sequence_num, edge_distance, and informations about edge vertexes like osm_id, some text value, lat alt, etc...

I think you have a lot of work to do before you get to a complete solution, but here are some pointers. I suggest you break down your project into smaller chunks and ask specific questions on any bits you might get stuck on.
First, you need to import your data. Then you'll need some pre-processing / cleaning. Then you need your routing queries and, finally, a way to use the outputs (with this last part determining to some extent the previous steps).
Import OSM data
As I described in an answer to your previous question here, you can use OGR2OGR to import OSM data to Postgis. You can use other programs, as you mention above, but I guess you'll get much the same results. I think the difference between the OGR2OGR tables and the osm2postgis ones is that some of the columns in the latter appear in the other_tags column. However, the data is still there, you just need slightly different queries.
Preparing data
I'm assuming you'll use pgrouting for the routing, but whatever you use, you'll need a network suitable for routing (in short, the edges have a start and end node, and the end nodes must connect with other start nodes). Pgrouting has tools to create what you need and validate it. E.g. you create integer columns source and target and the function pgr_createtopology will populate the columns for you.
OGR2OGR gives you tables "lines", "points", "multipolygons", "multilinestrings". I suggest you read up on OSM to understand exactly what is in these tables, but, roughly speaking, the lines contain your roads and the multipolygons contain, amongst other things, buildings with e.g. addresses. The addresses are in a hstore column called "other_tags".
The lines do not contain addresses! (although they do contain street names). So, if you want to do address-to-address routing you need to do some preparation. You can skip this if you can live with the street names.
Create your network (e.g. if you're routing for cars, you'll want to
throw out pedestrian routes and so on)
Extract the desired addresses (including coordinates)
Either snap the addresses to the nearest node, or otherwise
relate the address to the nearest node
Pgrouting will return the edges in your route, so you need the above to relate back to your addresses.
Routing
Your app is going to send to your server (in an as-yet unspecified way) a pair of addresses or coordinates and you need postgis to return the route. With pgrouting, that's quite easy and there are plenty of examples out there, for example here. You will need to write queries that join the output to your address table to give you the desired output.
pgrouting creates a vertices table. You can get the nearest vertex with the following query:
select id from vertices_pgr
order by the_geom <-> st_setsrid(st_point(lon,lat),4326)
limit 1
Using the output
Using WMS from geoserver is unlikely to be a good choice - you won't have the information on individual edges without a lot of messing about. You might consider geoJSON, which can be read by e.g. OpenLayers, Leaflet, or you can manipulate in Javascript. Postgres has lots of useful functions for working with json and geojson.
Conclusion
That's quite a lot of work and probably new stuff if you have little GIS knowledge, and it, er, basically recreates what you'd get from Graphhopper! Are you sure that's not a better way to go?
If you do decide to go this (or similar) route break things down into manageable chunks! First, figure out exactly what you're trying to achieve, then work backwards from there. If you do decide to use OSM / pgrouting, then play with the data and pgrouting first so you understand how it works before trying address matching etc.

The tools you listed are only for producing data, but I think you actually need a routing engine.
Try Graphhopper: https://graphhopper.com
Using the WEB Api (more likely what you need), you don't need the import the data in your database. This is the easiest solution. You will have not control over the input openstreetmap data but this is fine if you don't have special requirements.
Import data and implement/integrate a routing engine directly in your application would be much more complicated.

consistent hashing on Multiple machines

I've read the article: http://n00tc0d3r.blogspot.com/ about the idea for consistent hashing, but I'm confused about the method on multiple machines.
The basic process is:
Insert
Hash an input long url into a single integer;
Locate a server on the ring and store the key--longUrl on the server;
Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user.(How does this step work? In a single machine, there is a auto-increased id to calculate for shorten url, but what is the value to calculate for shorten url on multiple machines? There is no auto-increased id.)
Retrieve
Convert the shorten url back to the key using base conversion (from 62-base to 10-base);
Locate the server containing that key and return the longUrl. (And how can we locate the server containing the key?)

I don't see any clear answer on that page for how the author intended it. I think this is basically an exercise for the reader. Here's some ideas:
Implement it as described, with hash-table style collision resolution. That is, when creating the URL, if it already matches something, deal with that in some way. Rehashing or arithmetic transformation (eg, add 1) are both possibilities. This means, naively, a theoretical worst case of having to hit a server n times trying to find an available key.
There's a lot of ways to take that basic idea and smarten it, eg, just search for another available key on the same server, eg, by rehashing iteratively until you find one that's on the server.
Allow servers to talk to each other, and coordinate on the autoincrement id.
This is probably not a great solution, but it might work well in some situations: give each server (or set of servers) separate namespace, eg, the first 16 bits selects a server. On creation, randomly choose one. Then you just need to figure out how you want that namespace to map. The namespaces only really matter for who is allowed to create what IDs, so if you want to add nodes or rebalance later, it is no big deal.
Let me know if you want more elaboration. I think there's a lot of ways that this one could go. It is annoying that the author didn't elaborate on this point; my experience with these sorts of algorithms is that collision resolution and similar problems tend to be at the very heart of a practical implementation of a distributed system.

Using MD5 checksum to uniquely address a binary content in DB

I need to take binary (images and pdf's) from one environment to another .
These binaries are referenced in a main document mostly HTML Doc as Title and Version No: .
The problem is that we have a versioning ,So an HTML DOC might reead to img src=(Logo1 + Version 2) . The Title is good for me , but the Version is system generated for the host system's use .
I need take the HTML Doc to another system - I can ofcourse insert the Logo assosiated - I don't want to just insert the image(or pdf) , if it is already available in the destination system . Can I use a combination of Title + MD5 Checksum to check if the destination system already has the same content possibly with a different Version No:. I think the chances of a collision is bare minmal with this approach ? We have Md5's stored in our document manager system

The chances for collisions depend on the number of documents you have to store, but should be sufficiently low.
But this assumes nobody actually tries to create collisions. MD5 is considered broken, so if somebody could benefit from causing collisions on your end he/she might be able to pull that of.
Therefore I'd recommend a more secure hash function. It shouldn't make much of a difference for your effort which one you use.
See also this question and answer: What is the clash rate for md5?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse