Copying GeoJSON data from S3 to Redshift - amazon-redshift

I have a spatial data. It is GeoJSON format. I want to copy this data into Redshift from S3. So can you please help me to create table and copy the data into table. I want to know copy command.

Redshift's COPY command currently supports ingestion of geometries from (hexadecimal) WKB/EWKB format only. We currently do not support ingestion from GeoJSON. https://docs.aws.amazon.com/redshift/latest/dg/geospatial-overview.html
Alternatively, you can ingest the data in WKT format as VARCHAR(MAX) and then convert to GEOMETRY using the ST_GeomFromText() function. Using this method the WKT description of a geometry is limited to the 64KB max VARCHAR size.
More info: https://docs.aws.amazon.com/redshift/latest/dg/spatial-limitations.html

Related

Data mismatch in AWS DMS Bulk load vs CDC

I have a postgis database(source) for which i have done DMS and moved it to S3 bucket(target) in parquet file.
There is a column name point of datatype-geometry(point, 4326) in the source which converts to string after dms and looks like this in target-
-In bulk load it is like -point": "<Point srsName=\"EPSG:4326\"><coordinates>72.836903300000003,19.0823766</coordinates></Point>"
-In CDC it is like -point": "0101000020E610000051D77F4262855EC09591C4DCFFB54240"
I am able to get the coordinates back from the string in full load by string parsing but during CDC it is in some random hexadecimal string from which I am not aware how to decode back to the coordinates.
Welcome to SO.
In PostgreSQL in order to display geometries in a format other than WKB you have to explicitly state it in your query. In your case ST_AsGML:
SELECT ST_AsGML('0101000020E610000051D77F4262855EC09591C4DCFFB54240');
st_asgml
-------------------------------------------------------------------------------------------------------
<gml:Point srsName="EPSG:4326"><gml:coordinates>-122.0841223,37.4218708</gml:coordinates></gml:Point>
(1 row)
Or something like this if you wish to omit the namespace (as your example suggests):
SELECT ST_AsGML('0101000020E610000051D77F4262855EC09591C4DCFFB54240', 15, 0, '', '');
st_asgml
---------------------------------------------------------------------------------------
<Point srsName="EPSG:4326"><coordinates>-122.0841223,37.4218708</coordinates></Point>
(1 row)
See also: Converting geometries in PostGIS

Azure Data Factory schema mapping not working with SQL sink

I have a simple pipeline that loads data from a csv file to an Azure SQL db.
I have added a data flow where I have ensured all schema matches the SQL table. I have a specific field which contains numbers with leading zeros. The data type in the source - projection is set to string. The field is mapped to the SQL sink showing as string data-type. The field in SQL has nvarchar(50) data-type.
Once the pipeline is run, all the leading zeros are lost and the field appears to be treated as decimal:
Original data: 0012345
Inserted data: 12345.0
The CSV data shown in the data preview is showing correctly, however for some reason it loses its formatting during insert.
Any ideas how I can get it to insert correctly?
I had repro’d in my lab and was able to load as expected. Please see the below repro details.
Source file (CSV file):
Sink table (SQL table):
ADF:
Connect the data flow source to the CSV source file. As my file is in text format, all the source columns in the projection are in a string.
Source data preview:
Connect sink to Azure SQL database to load the data to the destination table.
Data in Azure SQL database table.
Note: You can all add derived columns before sink to convert the value to string as the sink data type is a string.
Thank you very much for your response.
As per your post the DF dataflow appears to be working correctly. I have finally discovered an issue with the transformation - I have an Azure batch service which runs a python script, which does a basic transformation and saves the output to a csv file.
Interestingly, when I preview the data in the dataflow, it looks as expected. However, the values stored in SQL are not.
For the sake of others having a similar issue, my existing python script used to convert a 'float' datatype column to string-type. Upon conversion, it used to retain 1 decimal number but as all of my numbers are integers, they were ending up with .0.
The solution was to convert values to integer and then to string:
df['col_name'] = df['col_name'].astype('Int64').astype('str')

Can't use Data Explorer as a sink in Data Flow

I'm trying to do a Data Flow using ADL1 as the source and Data Explorer as the sink; I can create the source but when I select Dataset for Sink Type the only available options in the Dataset pulldown are my ADL1 Datasets. If I use Data Copy instead I can choose Data Explorer as a sink but this won't work as Data Copy won't allow null values into Data Explorer number data types. Any insight on how to fix this?
I figured out a workaround. First I Data Copy the csv file into a staging table where all columns are strings. Then I Data Copy from staging table to production table using a KQL query that converts strings to their destination data types.

How to flatten an Parquet Array datatype when using IBM Cloud SQL Query

I have to push parquet file data which I am reading from IBM Cloud SQL Query to Db2 on Cloud.
My parquet file has data in array format, and I want to push that to DB2 on Cloud too.
Is there any way to push that array data of parquet file to Db2 on Cloud?
Have you checked out this advise in the documentation?
https://cloud.ibm.com/docs/services/sql-query?topic=sql-query-overview#limitations
If a JSON, ORC, or Parquet object contains a nested or arrayed
structure, a query with CSV output using a wildcard (for example,
SELECT * from cos://...) returns an error such as "Invalid CSV data
type used: struct." Use one of the following
workarounds:
For a nested structure, use the FLATTEN table transformation function.
Alternatively, you can specify the fully nested column names
instead of the wildcard, for example, SELECT address.city, address.street, ... from cos://....
For an array, use the Spark SQL explode() function, for example, select explode(contact_names) from cos://....

osm2pgsql data converting: lost columns

I've executed osm data converting using osm2pgsql from *.bz2 format to PostgreSQL database. But after converting I don't see such columns in table planet_osm_roads as: lanes, maxspeed.
Сan someone explain where are these columns? Thanks.
Add the option -k when using osm2pgsql
osm2pgsql -d geodatabase -k planet.osm.bz2
-k|--hstore Add tags without column to an additional hstore (key/value) column to postgresql tables
Explanation: osm2pgsql imports normally the data in a static database schema. The tags without a corresponding column are ignored. By adding the option -k or --hstore, osm2pgsql will add a new hstore column tags to each table and save there all tags without column.
Depending of your needs, you can use the -j instead, which make osm2pgsql to save ALL tags in the tags column, this means, the tags with a database column too.
-j|--hstore-all Add all tags to an additional hstore (key/value) column in postgresql tables
After the import, to extract all maxspeed tags from the database, you can use query like this (in example):
SELECT osm_id, name, tags -> 'maxspeed' FROM planet_osm_roads;
where tags is the hstore column and -> is a hstore operator.
See the Postgresql documentation for more infos about the hstore type and his operators: http://www.postgresql.org/docs/9.3/static/hstore.html
This should better be a comment, however, I don't have enough reputation to do so: Instead of using .bz2, I recommend strongly to use .pbf, the "Protocolbuffer Binary Format", because: "It is about half of the size of a gzipped planet and about 30% smaller than a bzipped planet. It is also about 5x faster to write than a gzipped planet and 6x faster to read than a gzipped planet. The format was designed to support future extensibility and flexibility." More infos: http://wiki.openstreetmap.org/wiki/PBF_Format