How to load PostgreSQL data into GeoMesa (with the Cassandra datastore)? - postgresql

I tried to load Postresql data into Geomesa (with a Cassandra datastore), by the JDBC Converter.
Loading from shape works fine, so the Cassandra and GeoMesa setup is okay
Next I tried to load data from PostgreSQL
Command:
echo "SELECT year, geom, grondgebruik, crop_code, crop_name, fieldid, global_id, area, perimeter, geohash FROM v_gewaspercelen2018" | bin/geomesa-cassandra ingest -c catalog -P cassandraserver:9042 -k agrodatacube -f parcel -C geomesa.converters.parcel -u -p
The converter definition file geomesa.converters.parcel looks like this:
geomesa.converters.parcel = {
type = "jdbc"
connection = "dbc:postgresql://postgresserver:5432/agrodatacube"
id-field="toString($5)"
fields = [
{ name = "fieldid", transform = "$5" }
{ name = "global_id", transform = "$6" }
{ name = "year", transform = "$0" }
{ name = "area", transform = "$7" }
{ name = "perimeter", transform = "$8" }
{ name = "grondgebruik", transform = "$2" }
{ name = "crop_code", transform = "$3" }
{ name = "crop_name", transform = "$4" }
{ name = "geohash", transform = "$9" }
{ name = "geom", transform = "$1" }
]
}
The geomesa output is:
INFO Schema 'parcel' exists
INFO Running ingestion in local mode
INFO Ingesting from stdin with 1 thread
[ ] 0% complete 0 i[ ] 0% complete 0 ingested 0 failed in 00:00:01
ERROR Fatal error running local ingest worker on <stdin>
[ ] 0% complete 0 i[ ] 0% complete 0 ingested 0 failed in 00:00:01
INFO Local ingestion complete in 00:00:01
INFO Ingested 0 features with no failures for file: <stdin>
WARN Some files caused errors, ingest counts may not be accurate
Does someone have a clue what is wrong here?

You can check in the logs folder for more detailed errors. However, just at a first glance, the JDBC converter follows standard result set numbering, meaning the first field is $1 (not $0). In addition, you may need to transform your geometry with a transform function, i.e. geometry($2).

Thanks Emilio, both suggestions made sence!
Made the converter field count start at 1
Inside the converter definition file changed
{ name = "geom", transform = "$2" }
into
{ name = "geom", transform = "geometry($2)" }
The SQL Select command should be:
SELECT year, ST_AsText(geom), .... FROM v_gewaspercelen2018
By the way, username and password are part of the connection-string (which is inside file geomesa.converters.parcel):
connection =
"dbc:postgresql://postgresserver:5432/agrodatacube?user=username&password=password"
So the -u and -p flags do not appear in the final command:
echo "SELECT year, ST_AsText(geom), grondgebruik, crop_code,
crop_name, fieldid, global_id, area, perimeter, geohash FROM
v_gewaspercelen2018" | bin/geomesa-cassandra ingest -c catalog -P
cassandraserver:9042 -k agrodatacube -f parcel -C
geomesa.converters.parcel
With these changes it works.
Thanks again!
Hugo

Related

Obtaining Azure VM scale set private IP addresses through Azure REST API

Can you get a list of Azure VM scale set instance private IP addresses through the Azure REST API?
It seems that Microsoft does not publish the VMSS IP configuration objects under the normal methods for retrieving a list of "ipConfigurations".
Here are some relevant API doc pages:
https://learn.microsoft.com/en-us/rest/api/compute/virtualmachinescalesets/listall
https://learn.microsoft.com/en-us/rest/api/compute/virtualmachinescalesets/get
https://learn.microsoft.com/en-us/rest/api/compute/virtualmachines/listall
In particular, this one only gives you the IP configuration of VMs, not VMSSes:
https://learn.microsoft.com/en-us/rest/api/virtualnetwork/networkinterfaces/listall
Here's how to get a list of private IP addresses for VMs and VMSS instances through Ruby:
require 'openssl'
require 'azure_mgmt_network'
require 'azure_mgmt_compute'
require 'awesome_print'
options = {
tenant_id: '<tenant_id>',
client_id: '<client_id>',
client_secret: '<client_secret>',
subscription_id: '<subscription_id>'
}
def net_interface_to_ip_mapping(client)
network_interfaces = client.network_interfaces.list_all
pairs = network_interfaces.collect { |ni| [ni.id.split('/').last, ni.ip_configurations.collect { |ip| ip.private_ipaddress }.flatten.compact[0] ] }
[network_interfaces, pairs]
end
def net_interface_to_vm(ni)
interface_vm_set = ni.collect { |prof| [prof.id, prof.virtual_machine, prof.ip_configurations.collect(&:id)] }
ipconf_to_host = interface_vm_set.collect { |x| [x[2][0], x[1]&.id&.split('/')&.last] }.to_h
conf_ip_map = ni.collect(&:ip_configurations).flatten.compact.collect { |ipconf| [ipconf&.id, ipconf&.private_ipaddress] }.to_h
[ipconf_to_host, conf_ip_map]
end
puts "*** Network Interfaces"
puts
client = Azure::Network::Profiles::Latest::Mgmt::Client.new(options)
ni, pairs = net_interface_to_ip_mapping(client)
pairs.to_h.each do |ni, ip|
puts " #{ni}: #{ip}"
end
puts
puts "*** Virtual Machines"
puts
ipconf_to_host, conf_ip_map = net_interface_to_vm(ni)
ipconf_to_host.each do |ipconf, host|
ni_name = ipconf.split('/')[-3]
puts " #{host || '# ' + ni_name} - #{conf_ip_map[ipconf]}"
end
puts
puts "*** Virtual Machine Scale Sets"
puts
vns = client.virtual_networks.list_all
vns.each do |vn|
resource_group = vn.id.split('/')[4]
puts
vn_details = client.virtual_networks.get(resource_group, vn.name, expand: 'subnets/ipConfigurations')
ip_configs = vn_details&.subnets&.collect { |subnet| subnet&.ip_configurations&.collect { |ip| [ip&.id, ip&.name, ip&.private_ipaddress] } }.compact
vmss_ipconf = ip_configs.collect { |subnet| subnet.select { |ipconf| ipconf[0].include?('/virtualMachineScaleSets/') } }
vmss_ipconf.each do |subnet|
subnet.each do |ipconf|
vmss_name = ipconf[0].split('/')[8]
vmss_instance = ipconf[0].split('/')[10]
puts "#{vmss_name} ##{vmss_instance} - #{ipconf[2]}"
end
end
end
Looking at the Azure CLI, there is az vmss nic list which returns all network interfaces in a virtual machine scale set. Looking at the results, there is
{
"dnsSettings": {
...
},
"ipConfigurations": [
{
privateIpAddress: "..."
}
]
}
You can use the --query syntax to get all private IPs.
az vmss nic list -g <resource_group> --vmss-name <vmss_name> --query [].{ip:ipConfigurations[0].privateIpAddress} -o tsv
you can get VM hostnames that will resolve to IPs thanks to Azure DNS
$ curl -H "Authorization: Bearer $JWT_TOCKEN" -sf https://management.azure.com/subscriptions/${subscription_id}/resourceGroups/${resourc_group}/providers/Microsoft.Compute/virtualMachineScaleSets/${scale_set}/virtualMachines?api-version=2018-10-01 | jq '.value[].properties.osProfile.computerName'
"influx-meta000000"
"influx-meta000001"
$ getent hosts influx-meta000001
10.120.10.7 influx-meta000001.l55qt5nuiezudgvyxzyvtbihmf.gx.internal.cloudapp.net

Mongodb Query to CSV dump (mlab hosted mongodb)

I am querying an already populated mlab MongoDB database, and I want to store the resulting multiple documents in one single CSV file.
EDIT: output format of CSV file I hope to get:
uniqueid status date
191b117fcf5c 0 2017-03-01 15:26:28.217000
191b117fcf5c 1 2017-03-01 18:26:28.217000
MongoDB database document format is
{
"_id": {
"$oid": "58b6bcc00bd666355805a3ee"
},
"sensordata": {
"operation": "chgstatus",
"user": {
"status": "1",
"uniqueid": "191b117fcf5c"
}
},
"created_date": {
"date": "2017-03-01 17:51:17.216000"
}
}
Database name:mparking_sensor
collection name: demo
The python code to query is as follows:
# -*- coding: utf-8 -*-
"""
Created on Wed Mar 01 18:55:18 2017
#author: Being_Rohit
"""
import pymongo
uri = 'mongodb://#####:*****#ds157529.mlab.com:57529/mparking_sensor'
client = pymongo.MongoClient(uri)
db = client.get_default_database().demo
print db
results = db.find()
f = open("mytest.csv", "w")
for record in results:
query1 = (record["sensordata"]["user"],record["created_date"])
print query1
print "done"
client.close()
EDIT: output format of query1 I am getting is:
({u'status': u'0', u'uniqueid': u'191b117fcf5c'}, {u'date': u'2017-03-01 17:51:08.263000'})
Does someone know the correct way to dump this data in a .csv file (pandas/or any other means) or some other approach for further prediction based analysis to do on it in future like linear regression?
Mongoexport will do the job for you. It can, uniquely among native MongoDB tools, export in CSV format, limited to a specific set of fields.
Your mongoexport command would be something like this:
mongoexport.exe \
--db mparking_sensor \
--collection demo \
--type=csv \
--fields sensordata.user.uniqueid,sensordata.user.status,created_date
That will export something like the following:
sensordata.user.uniqueid,sensordata.user.status,created_date
191b117fcf5c,0,2017-03-01T15:26:28.217000Z
191b117fcf5c,1,2017-03-01T18:26:28.217000Z
I was trying to export a collection to csv using mlabs 'export collection' they make it harder than it needs to be. So i used https://studio3t.com and connected using the standard MongoDB URI

MongoDB to BigQuery

What is the best way to export data from MongoDB hosted in mlab to google bigquery?
Initially, I am trying to do one time load from MongoDB to BigQuery and later on I am thinking of using Pub/Sub for real time data flow to bigquery.
I need help with first one time load from mongodb to bigquery.
In my opinion, the best practice is building your own extractor. That can be done with the language of your choice and you can extract to CSV or JSON.
But if you looking to a fast way and if your data is not huge and can fit within one server, then I recommend using mongoexport. Let's assume you have a simple document structure such as below:
{
"_id" : "tdfMXH0En5of2rZXSQ2wpzVhZ",
"statuses" : [
{
"status" : "dc9e5511-466c-4146-888a-574918cc2534",
"score" : 53.24388894
}
],
"stored_at" : ISODate("2017-04-12T07:04:23.545Z")
}
Then you need to define your BigQuery Schema (mongodb_schema.json) such as:
$ cat > mongodb_schema.json <<EOF
[
{ "name":"_id", "type": "STRING" },
{ "name":"stored_at", "type": "record", "fields": [
{ "name":"date", "type": "STRING" }
]},
{ "name":"statuses", "type": "record", "mode": "repeated", "fields": [
{ "name":"status", "type": "STRING" },
{ "name":"score", "type": "FLOAT" }
]}
]
EOF
Now, the fun part starts :-) Extracting data as JSON from your MongoDB. Let's assume you have a cluster with replica set name statuses, your db is sample, and your collection is status.
mongoexport \
--host statuses/db-01:27017,db-02:27017,db-03:27017 \
-vv \
--db "sample" \
--collection "status" \
--type "json" \
--limit 100000 \
--out ~/sample.json
As you can see above, I limit the output to 100k records because I recommend you run sample and load to BigQuery before doing it for all your data. After running above command you should have your sample data in sample.json BUT there is a field $date which will cause you an error loading to BigQuery. To fix that we can use sed to replace them to simple field name:
# Fix Date field to make it compatible with BQ
sed -i 's/"\$date"/"date"/g' sample.json
Now you can compress, upload to Google Cloud Storage (GCS) and then load to BigQuery using following commands:
# Compress for faster load
gzip sample.json
# Move to GCloud
gsutil mv ./sample.json.gz gs://your-bucket/sample/sample.json.gz
# Load to BQ
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/*.json.gz" \
"mongodb_schema.json"
If everything was okay, then go back and remove --limit 100000 from mongoexport command and re-run above commands again to load everything instead of 100k sample.
ALTERNATIVE SOLUTION:
If you want more flexibility and performance is not your concern, then you can use mongo CLI tool as well. This way you can write your extract logic in a JavaScript and execute it against your data and then send output to BigQuery. Here is what I did for the same process but used JavaScript to output in CSV so I can load it much easier to BigQuery:
# Export Logic in JavaScript
cat > export-csv.js <<EOF
var size = 100000;
var maxCount = 1;
for (x = 0; x < maxCount; x = x + 1) {
var recToSkip = x * size;
db.entities.find().skip(recToSkip).limit(size).forEach(function(record) {
var row = record._id + "," + record.stored_at.toISOString();;
record.statuses.forEach(function (l) {
print(row + "," + l.status + "," + l.score)
});
});
}
EOF
# Execute on Mongo CLI
_MONGO_HOSTS="db-01:27017,db-02:27017,db-03:27017/sample?replicaSet=statuses"
mongo --quiet \
"${_MONGO_HOSTS}" \
export-csv.js \
| split -l 500000 --filter='gzip > $FILE.csv.gz' - sample_
# Load all Splitted Files to Google Cloud Storage
gsutil -m mv ./sample_* gs://your-bucket/sample/
# Load files to BigQuery
bq load \
--source_format=CSV \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/sample_*.csv.gz" \
"ID,StoredDate:DATETIME,Status,Score:FLOAT"
TIP: In above script I did small trick by piping output to able to split the output in multiple files with sample_ prefix. Also during split it will GZip the output so you can load it easier to GCS.
From a basic reading of MongoDB's documentation, it sounds like you can use mongoexport to dump your database as JSON. Once you've done that, refer to the BigQuery loading data topic for a description of how to create a table from JSON files after copying them to GCS.
You can read data from MongoDB and stream it to BigQuery. You can find an example in NodeJS here.
This is an extension of the linked example that prevents duplicated records (as long as they are still streaming buffer):
const { BigQuery } = require('#google-cloud/bigquery');
const bigqueryClient = new BigQuery();
...
const jsonData = // Array of documents from MongoDB
const inputRows = jsonData.map(row => ({
insertId: row._id,
json: row
}));
const insertOptions = {
raw: true
};
await bigqueryClient
.dataset(datasetId)
.table(tableId)
.insert(inputRows, insertOptions);

logstash mongo Db connection issue

I am unable to push data to mongo Db using logstash
My config file looks like:-
input {
file {
type => "log"
path => "d:\logs\*.txt"
}
}
output {
mongodb {
database => "abhi1"
collection => "plain"
uri => "mongodb://127.0.0.1:27017"
}
}
command used for executing configuration file is logstash -f ./conf/demo.conf
ERROR :-
[2015-09-08T16:26:04.883000 #4528] DEBUG -- : MONGODB | COMMAND | namespace=a
in.$cmd selector={:ismaster=>1} flags=[] limit=-1 skip=0 project=nil | runtime
46.9999ms
hoping to get a workaround soon. thanks

NixOS Error on Declarative User Create

New to NixOS, and trying out a basic setup, including adding a new user. I'm sure this is a simple fix. I just need to know what setting to put in. Pastebin details are here.
These are the partial contents of /etc/nixos/configuration.nix. I created my nixos from this stock vagrant box: https://github.com/zimbatm/nixbox .
{
...
users = {
extraGroups = [ { name = "vagrant"; } { name = "twashing"; } { name = "vboxsf"; } ];
extraUsers = [ {
description = "Vagrant User";
name = "vagrant";
...
}
{ description = "Main User";
name = "user1";
group = "user1";
extraGroups = [ "users" "vboxsf" "wheel" "networkmanager"];
home = "/home/user1";
createHome = true;
useDefaultShell = true;
} ];
};
...
}
And these are the errors when calling nixos-rebuild switch , to rebuild environment. user1 doesn't seem to get added properly. And I certainly can't su to it after the command is run. How do I declaratively create users, and set their groups?
$ sudo nixos-rebuild switch
building Nix...
building the system configuration...
updating GRUB 2 menu...
stopping the following units: local-fs.target, network-interfaces.target, remote-fs.target
activating the configuration...
setting up /etc...
useradd: group 'networkmanager' does not exist
chpasswd: line 1: user 'user1' does not exist
chpasswd: error detected, changes ignored
id: user1: no such user
id: user1: no such user
/nix/store/1r443r7imrzl4mgc9rw1fmi9nz76j3bx-nixos-14.04.393.6593a98/activate: line 77: test: 0: unary operator expected
chown: missing operand after ‘/home/user1’
Try 'chown --help' for more information.
/nix/store/1r443r7imrzl4mgc9rw1fmi9nz76j3bx-nixos-14.04.393.6593a98/activate: line 78: test: 0: unary operator expected
chgrp: missing operand after ‘/home/user1’
Try 'chgrp --help' for more information.
starting the following units: default.target, getty.target, ip-up.target, local-fs.target, multi-user.target, network-interfaces.target, network.target, paths.target, remote-fs.target, slices.target, sockets.target, swap.target, timers.target