Fastest way to import CSV into Postgres? - postgresql

I am using Postgres and I want to import some data from a CSV into my database. However the data is "relational" so I can't just do a row by row import.
For example I have a Category table. A row can include multiple categories in the format Comedy;Crime;Drama so I need to find the correct category from Category so I can create a relation. Edit: the Category table is already pre-populated with unique values.
What is a fast way to do this? I expect to parse 60-80GB but possible more in the future so I want something fast.
I tried to do this quickly with Node where I would stream the file, find and create the relations for each line. The pool could not handle this so I have to read, pause, process, resume, repeat.
I'm using a quad-core i7 so I also feel like I could easily speed this process now that Node is single-threaded. How should I approach this?

Related

How to replicate a Postgres DB with only a sample of the data

I'm attempting to mock a database for testing purposes. What I'd like to do is given a connection to an existing Postgres DB, retrieve the schema, limit the data pulled to 1000 rows from each table, and persist both of these components as a file which can later be imported into a local database.
pg_dump doesn't seem to fullfill my requirements as theres no way to tell it to only retrieve a limited amount of rows from tables, its all or nothing.
COPY/\copy commands can help fill this gap, however, it doesn't seem like theres a way to copy data from multiple tables into a single file. I'd rather avoid having to create a single file per table, is there a way to work around this?

When you create a free form of Microstrategy, is it possible to do an automatic mapping?

When you finish the free form query in microstrategy, the next step is to map the columns.
Is there any way to do it automatically? At least make the list of the columns with its names.
Thanks!!!!
Sadly, this isn't possible. You will have to map all columns manually.
While this functionality isnt possible with freeform reporting specifically, Microstrategy Data Import will allow you the ability to create Data Import Cubes. These cubes can be configured as live connections, meaning they execute against the data source selected every time they are used, and are not your typical snapshot cube. Data Imports from a database can be sourced from a database query. This effectively allows you to write your own SQL with the end result being a report that you did not have to specify columns manually for.

An alternative design to insert/update of talend

I have a requirement in Talend where in I have to update/insert rows from the source table to the destination table. The source and destination tables are identical. The source gets refreshed by a business process and need to update/insert these results in the destination table.
I had designed for the 'insert or update' in tmap and tmysqloutput. However, the job turns out to be super slow
As an alternative to the above solution I am trying to do design the insert and update separately.In order to do this, I was wanting to hash the source rows as the number of rows would be usually less.
So, my question I will hash the input rows but when I join them with the destination rows in tmap should I hash the destination rows as well? Or should I use the destination rows as it is and then join them?
Any suggestions on the job design here?
Thanks
Rathi
If you are using the same database, you should not use ETL loading techniques but ELT loading so that all processing will happen in the database. Talend offers a few ELT components which are a bit different to use but very helpful for this case. I've had things to speed up by multiple magnitudes using only those components.
It is still a good idea to use an indexed hashed field both in the source and the target, which is done in a same way in loading Satellites in the Data Vault 2.0 model.
Alternatively, if you have direct access to the source table database, you could consider adding triggers for C(R)UD scenarios. Doing this, every action on the source database could be reflected in your database immediately. Remember though that you might need to think about a buffer table ("staging") where you could store your changes so that you are able to ingest fast, process later. In this table only the changed rows and the change type (create, update, delete) would be present for you to process. This decouples loading and processing which can be helpful if there will be a problem with loading or processing later on.
Yes i believe that you should use hash component for destination table as well.
Because than your processing (lookup) will be very fast as its happening in memory
If not than lookup load may take more time.

Filemaker Pro Advanced - Scripting import from ODBC with variable target table

I have several tables I'm importing from ODBC using the import script step. Currently, I have an import script for each and every table. This is becoming unwieldy as I now have nearly 200 different tables.
I know I can calculate the SQL statement to say something like "Select * from " & $TableName. However, I can't figure out how to set the target table without specifying it in the script. Please, tell me I'm being dense and there is a good way to do this!
Thanks in advance for your assistance,
Nicole Willson
Integrated Research
Unfortunately, the target table of an import has to be hard coded in FileMaker up through version 12 if you're using the Import Records script step. I can think of a workaround to this, but it's rather convoluted and if you're importing a large number of records, would probably significantly increase the time to import them.
The workaround would be to not use the Import Records script step, but to script the creation of records and the population of data into fields yourself.
First of all, the success of this would depend on how you're using ODBC. As far as I can think, it would only work if you're using ODBC to create shadow tables within FileMaker so that FileMaker can access the ODBC database via other script steps. I'm not an expert with the other ODBC facilities of FileMaker, so I don't know if this workaround would be helpful in other cases.
So, if you have a shadow table into the remote ODBC database, then you can use a script something like the following. The basic idea is to have two sets of layouts, one for the shadow tables that information is coming from and another for the FileMaker tables that the information needs to go to. Loop through this list, pulling information from the shadow table into variables (or something like the dictionary library I wrote which you can find at https://github.com/chivalry/filemaker-dictionary). Then go to the layout linked to the target table, create a record and populate the fields.
This isn't a novice technique, however. In addition to using variables and loops, you're also going to have to use FileMaker's design functions to determine the source and destination of each field and Set Field By Name to put the data in the right place. But as far as I can tell, it's the only way to dynamically target tables for importing data.

Most efficient way of bulk loading unnormalized dataset into PostgreSQL?

I have loaded a huge CSV dataset -- Eclipse's Filtered Usage Data using PostgreSQL's COPY, and it's taking a huge amount of space because it's not normalized: three of the TEXT columns is much more efficiently refactored into separate tables, to be referenced from the main table with foreign key columns.
My question is: is it faster to refactor the database after loading all the data, or to create the intended tables with all the constraints, and then load the data? The former involves repeatedly scanning a huge table (close to 10^9 rows), while the latter would involve doing multiple queries per CSV row (e.g. has this action type been seen before? If not, add it to the actions table, get its ID, create a row in the main table with the correct action ID, etc.).
Right now each refactoring step is taking roughly a day or so, and the initial loading also takes about the same time.
From my experience you want to get all the data you care about into a staging table in the database and go from there, after that do as much set based logic as you can most likely via stored procedures. When you load into the staging table don't have any indexes on the table. Create the indexes after the data is loaded into the table.
Check this link out for some tips http://www.postgresql.org/docs/9.0/interactive/populate.html