Find the corrupted doc in DynamoDB - nosql

I am using dynamoose to scan a table. However one key (or more) seems to be corrupted. The scan fails with this error: Expected _modifiedAt to be of type number, instead found type object
The Schema is expecting a Number but somewhere in the table there is a doc where the key is an object.
How do I find this? We have thousands of keys, so I guess a simple search won't cut it.
Thanks

So, since I need to do this because of a schema migration, I decided to simply export all data as CSV (This tool is only needed if you have over 100 keys, before that you can export directly within dynamoDB). And then write a custom migration script which handles all the error handling etc.
Hope this might help someone ;)

Related

Save simple information for a database within postgress

I have a multi tennant application which will use the SILO Model to save data (each tennant will get an own database).
Because tennant names could be redundand my database are with GUIDs: MyApp_[GUID].
Now I want to save simple but neccesary information for each database like a tennant name and 3 to 5 more informations.
Is there a simple way to write and get these data?
The only way I can think of is to create a special table for this with only 1 row - but it seems a bot of wasting.
If you're looking for a simpler solution than a table per database (and having to deal with the awkward constraint that it must have exactly one row), you could
use a custom configuration parameter. You can change them with ALTER DATABASE. The downside is that you can only store strings, and that the settings might be overridden per session.
use a COMMENT on the database. The downside is that you can only store a single string per databasebase; the advantage is that it is automatically shown in many lists of databases such as psql's \l+ command
add your own columns to the pg_database system table. You should not mess with that, so it's a spectacularly bad idea even if you knew what you were doing, but in a relational model it's the closest to what you were asking for so I'd mention it for completeness.
I don't really advocate any of these solutions, although they do what you were asking for there's probably a better solution to your actual problem. It might be as simple a table of databases, possibly with a foreign key to pg_database, in an extra database shared by all tenants.

Is there a way to insert data into an sql table using spark jdbc WITHOUT inserting duplicates AND losing already existing data?

I'm trying to write a spark dataframe into a postgresql table by using df.write.jdbc.
The problem is, I want to make sure to not lose existing data already inside the table (Using SaveMode.Append) but also making sure to avoid inserting duplicate data already inserted into it.
So, if I use SaveMode.Overwrite:
-The table gets dropped losing all previous data
If I use SaveMode.Append:
The table doesn't get dropped but the duplicate records get inserted.
If I use this mode together with a primary key already into the db (that would provide the unique constraint) it returns an error.
Is there some kind of option to solve this?
Thanks
What I did was to filter out existing records, that means an additional read to get existing ids, and a fitler operation on data to append.. but it does the job for me.
There's I think a more complex solution in this post:
https://medium.com/#thomaspt748/how-to-upsert-data-into-a-relational-database-using-apache-spark-part-1-python-version-b43b9761bbf2
Maybe late, but just went into this.

Manifold/PostGIS data manipulation and export

I'm currently working on a GIS database project using Manifold Ultimate.
I am able to import data from PostGIS via the database console, and edit the data as a table object within Manifold.
How do i 'commit' these changes back to PostGIS?
I am required to submit the exported database. What format is expected for a PostGIS export and how is the exporting done?
#mdsumner is correct. Linking the PostGIS data is the way to go.
If you have exported the complete table and edited records it's not simple to replace the data present in PostGIS by a new export. This will fail until you have deleted all the tables with index, triggers and sequences whose names are derived from the same name of exported drawing (with inconsistend handling of lower case). It's not enought to drop the table.
Note that with Manifolds linked storage model you have no client buffer of edited, added or deleted records that are written back in a process of commitment of a transaction. Every edit of every single column is written to PostGIS at once.
Concerning your 2. question: That depends on the target system. Manifold exports GEOMETRY type geometries. Other PostGIS clients may digest only a single type point, line or polygon. You can edit the type in "geometry_columns.type" as long as you have added only the one type of object to the drawing.
I think that if you imported the data it is no longer linked to the DB and you would need to export it and replace what is in the DB. If you link the data the edits you make are commited "live" as the data is not a copy but remains stored by the DB.
I'm not that familiar with this, but that's what the Database Console topic in help describes.

suggest a postgres tool to find the difference between the schema and the data

Dear all ,
Can any one suggest me the postgres tool for linux which is used to find the
difference between the 2 given database
I tried with the apgdiff 2.3 but it gives the difference in terms of schema not the data
but I need both !
Thanks in advance !
Comparing data is not easy especially if your database is huge. I created Python program that can dump PostgreSQL data schema to file that can be easily compared via 3rd party diff programm: http://code.activestate.com/recipes/576557-dump-postgresql-db-schema-to-text/?in=user-186902
I think that this program can be extended by dumping all tables data into separate CSV files, similar to those used by PostgreSQL COPY command. Remember to add the same ORDER BY in SELECT ... queries. I have created tool that reads SELECT statements from file and saves results in separate files. This way I can manage which tables and fields I want to compare (not all fields can be used in ORDER BY, and not all are important for me). Such configuration can be easily created using "dump schema" utility.
Check out dbsolo DBSOLO. It does both object and data compares and can create a sync script based on the results. It's free to try and $99 to buy. My guess is the 99 bucks will be money well spent to avoid trying to come up with your own software to do this.
Data Compare
http://www.dbsolo.com/help/datacomp.html
Object Compare
http://www.dbsolo.com/help/compare.html
apgdiff https://www.apgdiff.com/
It's an opensource solution. I used it before for checking differences between differences in dumps. Quite useful
[EDIT]
It's for differenting by schema only

Data Warehousing Postgres

We're considering using SSIS to maintain a PostgreSql data warehouse. I've used it before between SQL Servers with no problems, but am having a lot of difficulty getting it to play nicely with Postgres. I’m using the evaluation version of the OLEDB PGNP data provider (http://www.postgresql.org/about/news.1004).
I wanted to start with something simple like UPSERT on the fact table (10k-15k rows are updated/inserted daily), but this is proving very difficult (not to mention I’ll want to use surrogate keys in the future).
I’ve attempted (Link) and (http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) which are effectively the same (except I don’t really understand the union all at the end when I’m trying to upsert) But I run into the same problem with parameters when doing the update using a OLEDb command – which I tried to overcome using (http://technet.microsoft.com/en-us/library/ms141773.aspx) but that just doesn’t seem to work, I get a validation error –
The external columns for complent.... are out of sync with the datasource columns... external column “Param_2” needs to be removed from the external columns.
(this error is repeated for the first two parameters as well – never came across this using the sql connection as it supports named parameters)
Has anyone come across this?
AND:
The fact that this simple task is apparently so difficult to do in SSIS suggests I’m using the wrong tool for the job - is there a better (and still flexible) way of doing this? Or would another ETL package be better for use between two Postgres database? -Other options include any listed on (http://en.wikipedia.org/wiki/Extract,_transform,_load#Open-source_ETL_frameworks). I could just go and write a load of SQL to do this for me, but I wanted a neat and easily maintainable solution.
I have used the Slowly Changing Dimension wizard for this with good success. It may give you what you are looking for especially with the Wizard
http://msdn.microsoft.com/en-us/library/ms141715.aspx
The External Columns Out Of Sync: SSIS is Case Sensitive - I encountered this issue multiple times and it makes me want to pull my hair out.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
SCD is way too slow for what I want. I need to use set based sql.
It turned out that a lot of my problems were with bugs in the provider.
I opened a forum topic (http://www.pgoledb.com/forum/viewtopic.php?f=4&t=49) and had a useful discussion with the moderator/support/developer person.
Also Postgres doesn't let you do cross db querys, so I solved the problem this way:
Data Source from Production DB to a temp Archive DB table
Run set based query between temp table and archive table
Truncate temp table
Note that the temp table is not atchally a temp table, but a copy of the archive table schema to temporarily stored data in.
Took a while, but I got there in the end.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
What enterprise ETL solution would you suggest?