How to optimize Build data into Nominatim? - openstreetmap

I am Building a data file into database nominatim. Its size is currently over 5GB. Now is there a way for me to optimize the build data input. Normally my Build time is about 5-6 hours.
The command I execute is :
./utils/setup.php --osm-file ../../planet.osm --all 2>&1 | tee setup.log
I want to remove some unnecessary data related to the address such as
the table of deleted nodes,...

Related

GSUTIL CP using file size

I am trying to copy files from a directory on my Google Compute Instance to Google Cloud Storage Bucket. I have it working, however there are ~35k files but only ~5k have an data in them.
Is there anyway to only copy files above a certain size?
I've not tried this but...
You should be able to do this using a resumable transfer and setting the threshold to 5k (defaults to 8Mib). See: https://cloud.google.com/storage/docs/gsutil/commands/cp#resumable-transfers
May be advisable to set BOTO_CONFIG specifically for this copy (a) to be intentional; (b) to remind yourself how it works. See: https://cloud.google.com/storage/docs/boto-gsutil
Resumable uploads has the added benefit, of course, of resuming if there are any failures.
Recommend: try this on a small subset and confirm it works to your satisfaction.
While it's not possible to do it only with gsutil, it's possible to do it by parsing the names and use the -I flag on the cp command to process them. If you're using a Linux Compute Engine instance you can perform it by using the du and awk commands:
du * | awk '{if ($1 > 1000) print $2 }' | gsutil -m cp -I gs://bucket2
The command will get the filesize of the files inside the current directory on your compute engine with du * and will only copy the files which size are larger than 1000 bytes to bucket2, you can change that value to adjust it to your needs.

Gatling: upload simulation.log to S3 bucket from simulation

I would like to upload the simulation.log file of a scenario to an S3 bucket once my simulation is finished.
I was thinking of adding the upload in the after block of my simulation.
I didn't find any example that does that. I only found scripts external to the simulation taking care if it.
Is there any reason why I shouldn't do it?
If not, how can I get the absolute path of the simulation.log?
Unfortunately block 'after' execute early then generating report.
I would advise writing a script that runs after the test.
Go to directory with the last report.
cd target/gatling && cd "$(ls -td -- */ | head -n 1 | cut -d'/' -f1)"
Use aws cli for upload to s3
aws s3 cp simulation.log s3://my-bucket/

Estimate/Print csv COPY status to postgresql table

I want to get an idea of how long it will take to copy a csv to a postgresql table. Is there a way to print the rows copied in a reasonable fashion or is there another way to somehow display the progress of the copy?
Perhaps there is a verbose setting or I should use --echo or -qecho
I am using:
psql -U postgres -d nyc_data -h localhost -c "\COPY rides FROM nyc_data_rides.csv CSV"
In Postgres 14, it's now possible to query the status of an active COPY via the internal pg_stat_progress_copy view.
e.g. to watch progress in terms of both bytes and lines processed:
select * from pg_stat_progress_copy \watch 1
Refs:
https://www.postgresql.org/docs/14/progress-reporting.html#COPY-PROGRESS-REPORTING
https://www.depesz.com/2021/01/12/waiting-for-postgresql-14-report-progress-of-copy-commands/
There is no such thing unfortunately.
One idea would be to divide the input into chunks of 1000 or 10000 lines, which you then import one after the other. That wouldn't slow processing considerably, and you can quickly get an estimate how long the whole import is going to take.
use pv tool
pv /tmp/some_table.csv | sudo -u postgres psql -d some_db -c "copy some_table from stdin delimiter ',' null '';"
and as a result, it will show
1.42GiB 0:11:42 [2.06MiB/s] [===================================================================================================================================================================>] 100%
As Laurenz Albe said, there's no way to measure how many time remaining to conclude the entire process. But one thing that I did today to take a good approximation was:
Start the "Monitor System" in my Linux
In this application there's a counter that how many data was uploaded since I started this application
Using the size of the file that I was uploading I made a good prediction about how many data was left to send to the server.

Limit to number of files to cp in parallel

Im running the gsutil cp command in parallel (with the -m option) on a directory with 25 4gb json files (that i am also compressing with the -z option).
gsutil -m cp -z json -R dir_with_4g_chunks gs://my_bucket/
When I run it, it will print out to terminal that it is copying all but one of the files. By this I mean that it prints one of these lines per file:
Copying file://dir_with_4g_chunks/a_4g_chunk [Content-Type=application/octet-stream]...
Once the transfer for one of them is complete, it says that it'll be copying the last file.
The result of this is that there is one file that only starts to copy only when one of the others finishes copying, significantly slowing down the process
Is there a limit to the number of files I can upload with the -m option? Is this configurable in the boto config file?
I was not able to find the .boto file on my Mac (as per jterrace's answer above), instead I specified these values using the -o switch:
gsutil -m -o "Boto:parallel_thread_count=4" cp directory1/* gs://my-bucket/
This seemed to control the rate of transfer.
From the description of the -m option:
gsutil performs the specified operation using a combination of
multi-threading and multi-processing, using a number of threads and
processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration file. You
might want to experiment with these value, as the best value can vary
based on a number of factors, including network speed, number of CPUs,
and available memory.
If you take a look at your .boto file, you should see this generated comment:
# 'parallel_process_count' and 'parallel_thread_count' specify the number
# of OS processes and Python threads, respectively, to use when executing
# operations in parallel. The default settings should work well as configured,
# however, to enhance performance for transfers involving large numbers of
# files, you may experiment with hand tuning these values to optimize
# performance for your particular system configuration.
# MacOS and Windows users should see
# https://github.com/GoogleCloudPlatform/gsutil/issues/77 before attempting
# to experiment with these values.
#parallel_process_count = 12
#parallel_thread_count = 10
I'm guessing that you're on Windows or Mac, because the default values for non-Linux machines is 24 threads and 1 process. This would result in copying 24 of your files first, then the last 1 file afterward. Try experimenting with increasing these values to transfer all 25 files at once.

How to include MySQL database schema on GitHub?

Stackoverflow and MySQL-via-command-line n00b here, please be gentle! I've been looking around for answers to my question but could only find topics dealing with GitHubbing MySQL dumps (as in: data dumps) for collaboration or MySQL "version control" via GitHub, neither of which tells me what I want to know:
How does one include MySQL database schemas/information on tables with PHP projects on GitHub?
I want to share a PHP project on GitHub which relies on the existence of a MySQL database with certain tables. If someone wanted to copy/make use of this project, they would need to have these particular tables in place to make the script work (all tables but one are empty in the beginning and only get filled by the user over time, via the script; the non-empty table holds three values from the start). How does one go about this, what is common practice?
Would I just get a (complete) dump file of my own db/tables, then
delete all the data parts (except for that one non-empty
table), set all autoincrements to zero and then upload that .sql file
to GitHub along with the rest of the project?
OR
Is it best/better practice to write a (PHP) script with which the
(maybe not-so-experienced) user can create these tables without
having to use mysqldump/command line magic?
If solution #1 is the way to go, would I include further instructions on how to use such a .sql file?
Sorry if my questions sound silly, but as I said above, I myself am new to using the command line for MySQL-related things and had only ever used phpMyAdmin until yesterday (when I created my very first dump file with mysqldump - yay!).
Common practice is to include an install script that creates the necessary tables, so solution #2 would be the way to go.
[edit] That script could ofc just replay a dump. ;)
You might also be interested in migrations: How to automate migration (schema and data) for PHP/MySQL application
If you want also track database schema changes
You can use git hooks.
In directory [your_project_dir]/.git/hooks add / edit script pre-commit
#!/bin/sh -e
set -o errexit
# -- you can omit next line if not using version table
version=`git log --tags --no-walk --pretty="format:%d" | sed 1q | sed 's/[()]//g' | sed s/,[^,]*$// | sed 's ...... '`
BASEDIR=$(dirname "$0")
# -- set directorey wher schema dump is placed
dumpfile=`realpath "$BASEDIR/../../install/database.sql"`
echo "Dumping database to file: $dumpfile"
# -- dump database schema
mysqldump -u[user] -p[password] --port=[port] [database-name] --protocol=TCP --no-data=true --skip-opt --skip-comments --routines | \
sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/' > "$dumpfile"
# -- dump versions table and update core vorsiom according to last git tag
mysqldump -u[user] -p[password] --port=[port] [database-name] [versions-table-name] --protocol=TCP --no- data=false --skip-opt --skip-comments --no-create-info | \
sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/' | \
sed -e "/INSERT INTO \`versions\` VALUES ('core'/c\\INSERT INTO \`versions\` VALUES ('core','$version');" >> "$dumpfile"
git add "$dumpfile"
# --- Finished
exit 0
Change [user], [password], [port], [database-name], [versions-table-name]
This script is executed autamatically by git on each commit. If commiting tag new version is saved to table dump by tag name. If no changes in database, nothing is commited. Make sure if script is executable :)
Your install script can take sql queries from this dump and developer can easy track database changes.