192 days to build Europe tiles - openmaptiles

Hi and thanks for all the good work on OpenMapTiles.
I'm trying to build tiles for Europe, North-America, maybe world.
I'm using the ./quickstart script and it's said to be taking 30 days to build the tiles for America and 192 days for Europe.
This is running on a c5d.18xlarge EC2 instance (70 CPU, 180G RAM, SSD disks).
Am I missing something ?
I'm currently trying to use a database outside of Docker (on localhost) to see if I can speed things... but how are you guys doing ?

I'm using this
https://github.com/mapbox/mbutil/blob/5e1ac74fdf7b0f85cfbbc245481e1d6b4d0f440d/patch
This is one of my scripts, I'm merging it all to tmp and checking if there is still a tilelive-copy in progress on that file
for i in *.mbtiles; do
[ -f "$i" ] || break
if [[ $i != *"final.mbtiles"* ]]; then
if ! [[ `lsof -c /tilelive-copy/ $i` ]]; then
exit=$(/usr/local/bin/merge_mbtiles.sh $i /tmp/final.mbtiles)
echo $exit
(( $rc )) && echo "merge failed $i" && exit 1
echo "merge sucessfull"
fi
fi
done

I'm using openmaptiles too and speed extremly slowed down after the last updates (still have to figure out the changes that caused that).
quickstart script is nice for trying and figuring out, in the end I started writing scripts to split and parallise the work. Right now we are processing the whole world with zoom 0-14(fast) and most of europe with 14-18 (that takes weeks)
Try the following:
* tune postgres (defaults are bad for large databases)
* try to split the areas and parralise the work.
You can see that the rendering process with tilelive-copy is not really using all cores. The whole process isn't that effective in using resources. After a couple of tries I figured out that starting multiple workers in parallel is faster (in the end) than supersizing your server with more CPU core speed.
see also:
https://github.com/openmaptiles/openmaptiles/issues/462
https://github.com/mapbox/tilelive/issues/181

Related

GSUTIL CP using file size

I am trying to copy files from a directory on my Google Compute Instance to Google Cloud Storage Bucket. I have it working, however there are ~35k files but only ~5k have an data in them.
Is there anyway to only copy files above a certain size?
I've not tried this but...
You should be able to do this using a resumable transfer and setting the threshold to 5k (defaults to 8Mib). See: https://cloud.google.com/storage/docs/gsutil/commands/cp#resumable-transfers
May be advisable to set BOTO_CONFIG specifically for this copy (a) to be intentional; (b) to remind yourself how it works. See: https://cloud.google.com/storage/docs/boto-gsutil
Resumable uploads has the added benefit, of course, of resuming if there are any failures.
Recommend: try this on a small subset and confirm it works to your satisfaction.
While it's not possible to do it only with gsutil, it's possible to do it by parsing the names and use the -I flag on the cp command to process them. If you're using a Linux Compute Engine instance you can perform it by using the du and awk commands:
du * | awk '{if ($1 > 1000) print $2 }' | gsutil -m cp -I gs://bucket2
The command will get the filesize of the files inside the current directory on your compute engine with du * and will only copy the files which size are larger than 1000 bytes to bucket2, you can change that value to adjust it to your needs.

Helm and Kubernetes: Is there a barrier equivalent for jobs?

Given 3 jobs (A,B,C) in 3 Helm charts, is it possible to run A and B jobs in parallel, then start job C as soon as both of them are finished? Think of a barrier, in which a bunch of stuff needs to be finished before moving on.
Even if I put A and B charts as sub-charts for C chart, then all 3 are started in parallel.
I already have an workaround for this: add an external check for A and B job finishing, then start C. Still, I would prefer a Helm-based solution, if it exists.
Kubernetes isn't a batch job framework/scheduler and does not fit your advanced batch framework requirements.
My recommendation would be to use a real batch framework like Luigi which also supports scheduling Luigi jobs on kubernetes.
Look here for an example how to do this.
Indeed, Kubernetes seems to be quite basic when it comes to scheduling jobs.
We'll move to Luigi at some point in the future for advanced job scenarios.
For now, I wrote this small awk, bash-based workaround for this. Perhaps it could help others in a similar situation.
while true; do
sleep 60
# done means output is 'SUCCESSFUL\n1' only
is_done=$(kubectl get jobs 2>&1 | awk '{print $3}' | uniq | awk 'BEGIN{no_lines=0; no_all_lines=0} {no_all_lines++; if(NR==1 && $1=="SUCCESSFUL") {no_lines++}; if(NR==2 && $1=='1') {no_lines++}} END{ if(no_lines==no_all_lines) {print "1"} else {print "0"}}'
)
if [ ${is_done} = "1" ]; then
echo "Finished all jobs"
break
else
echo "One or more jobs are still running, checking again in 1 minute"
fi
done

OpenStack Swift client for the fastest synchronisation of huge amount of files?

I have folder with a lot of files (~50k, 3Gb). I need to sync this folder recursively to my container in OpenStak Swift-like storage.
I have tried to use cli duck (cyberduck), but it crashes on huge list of files in prepare process.
I am trying to use supload utility, but it is so slow :(
May be somebody recommends me the best approach (some cli better) for this situation?
You should use official python-swiftclient package and simply :
# load your openstack credentials
source openrc.sh
cd path_to_directory_you_want_to_sync
# upload all the files recursively keeping good paths
swift upload --changed your_container *
Swift does not support rsync-like synchronization, but I use this little script to delete in the container the files you deleted locally and to upload new files without asking swift to compare each files :
#!/bin/bash
cd $2
diff <(find * -type f -print | sort) <(swift list $1 | sort) | while read x; do
if [[ $x == \>* ]]; then
echo "Need to delete ${x:2}"
swift delete $1 "${x:2}"
elif [[ $x == \<* ]]; then
echo "Need to upload ${x:2}"
swift upload $1 "${x:2}"
fi
done
cd -
Use with :
./swift_sync.sh your_container directory_to_sync
Try http://rclone.org/docs/
It has "sync" operation and "Bandwidth limit".
If this not solve your speed problem, the root cause is not on the clients.

Limit to number of files to cp in parallel

Im running the gsutil cp command in parallel (with the -m option) on a directory with 25 4gb json files (that i am also compressing with the -z option).
gsutil -m cp -z json -R dir_with_4g_chunks gs://my_bucket/
When I run it, it will print out to terminal that it is copying all but one of the files. By this I mean that it prints one of these lines per file:
Copying file://dir_with_4g_chunks/a_4g_chunk [Content-Type=application/octet-stream]...
Once the transfer for one of them is complete, it says that it'll be copying the last file.
The result of this is that there is one file that only starts to copy only when one of the others finishes copying, significantly slowing down the process
Is there a limit to the number of files I can upload with the -m option? Is this configurable in the boto config file?
I was not able to find the .boto file on my Mac (as per jterrace's answer above), instead I specified these values using the -o switch:
gsutil -m -o "Boto:parallel_thread_count=4" cp directory1/* gs://my-bucket/
This seemed to control the rate of transfer.
From the description of the -m option:
gsutil performs the specified operation using a combination of
multi-threading and multi-processing, using a number of threads and
processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration file. You
might want to experiment with these value, as the best value can vary
based on a number of factors, including network speed, number of CPUs,
and available memory.
If you take a look at your .boto file, you should see this generated comment:
# 'parallel_process_count' and 'parallel_thread_count' specify the number
# of OS processes and Python threads, respectively, to use when executing
# operations in parallel. The default settings should work well as configured,
# however, to enhance performance for transfers involving large numbers of
# files, you may experiment with hand tuning these values to optimize
# performance for your particular system configuration.
# MacOS and Windows users should see
# https://github.com/GoogleCloudPlatform/gsutil/issues/77 before attempting
# to experiment with these values.
#parallel_process_count = 12
#parallel_thread_count = 10
I'm guessing that you're on Windows or Mac, because the default values for non-Linux machines is 24 threads and 1 process. This would result in copying 24 of your files first, then the last 1 file afterward. Try experimenting with increasing these values to transfer all 25 files at once.

memory exhausted : for large files using diff

I am trying to create a patch using two large size folders (~7GB).
Here is how I'm doing it :
$ diff -Naurbw . ../other-folder > file.patch
But maybe due to file sizes, patch is not getting created and giving an error:
diff: memory exhausted
I tried making space more than 15 GB but still the issue persists. Could someone help me out with the flags that I should use?
Recently I came across this too when I needed to diff two large files (>5Gb each).
I tried to use 'diff' with different options, but even the --speed-large-files had no effect. Other methods like splitting the files into smaller ones, using xdelta or sorting the files as per this suggestion didn't help either. I even got my hands around a very powerful VM (> 72Gb RAM), but still got this memory exhausted error.
I finally got to work by adding the following parameter to sysctl.conf (sudo vim /etc/sysctl.conf):
vm.overcommit_memory=1
vm.overcommit_memory has three values (0,1,2) and sets the kernel virtual memory accounting mode. From the proc(5) man page:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
To make sure that the parameter is indeed applied you can run
sudo sysctl -p
Don't forget to change this parameter back when you finish!
bsdiff is slow & requires large memory, xdelta is create large delta for large files.
Try HDiffPatch for large files: https://github.com/sisong/HDiffPatch
support diff between large binary files or directories;
can run on: Windows, macos, Linux, Android
diff & patch both support run with limit memory;
Usage example:
Creating a patch: hdiffz -s-256 [-c-lzma2] old_path new_path out_delta_file
Applying a patch: hpatchz old_path delta_file out_new_path
Try sdiff. It's a pre-built tool in some Linux Distributions.
sdiff a.txt b.txt --output=c.txt
will show the files to be Modified.
This worked perfectly for me.