Exporting into a single large CSV from MySQL Workbench into the client machine without viewing it on GUI? - mysql-workbench

After going through similar questions on Stackoverflow, I am unable to find a method where I could export a large CSV file from a query made in MySQL workbench (v 5.2).
The query is about 4 million rows, 8 columns (comes to about 300Mb when exported as a csv file).
Currently I load the entire rows (have see it in the GUI) and use the export option. This makes my machine crash most of the time)
My constraints are:
I am not looking for a solution via bash terminal.
I need to export it to the client machine and not the database server.
Is this drawback of MySQL Workbench?
How do I not see it in GUI but yet export all the rows into a single file?
There is a similar question I found, but the answers dont meet the constraints I have:
" Exporting query results in MySQL Workbench beyond 1000 records "
Thanks.

In order to export to CSV you first have to load all that data, which is a lot to have in a GUI. many controls are simply no made to carry that much data. So your best bet is to avoid GUI as much as possible.
One way could be to run your query outputting to a text window (see Query menu). This is not CSV but at least should work. You can then try to copy out the text into a spreadsheet and convert it to CSV.
If that is too much work try limiting your rows into ranges, say 1 million each, using the LIMIT clause on your query. Lower the size until you have one that can be handled by MySQL Workbench. You will get n CSV files you have to concatenate later. A small application or (depending on your OS) a system tool should be able to strip headers and concatenate the files into one.

Related

Talend Open Studio Big Data - Iterate and load multiple files in DB

I am new to talend and need guidance on below scenario:
We have set of 10 Json files with different structure/schema and needs to be loaded into 10 different tables in Redshift db.
Is there a way we can write generic script/job which can iterate through each file and load it into database?
For e.g.:
File Name: abc_< date >.json
Table Name: t_abc
File Name: xyz< date >.json
Table Name: t_xyz
and so on..
Thanks in advance
With Talend Enterprise version one can benefit of dynamic schema. However based on my experiences with json-s they are somewhat nested structures usually. So you'd have to figure out how to flatten them, once thats done it becomes a 1:1 load. However with open studio this will not work due to the missing dynamic schema.
Basically what you could do is: write some java code that transforms your JSON into CSV. Use either psql from commandline or if your Talend contains new enough PostgreSQL JDBC driver then invoke the client side \COPY from it to load the data. If your file and the database table column order matches it should work without needing to specify how many columns you have, so its dynamic, but the data newer "flows" through talend.
Really not cool but also theoretically possible solution: If Redshift supports JSON (Postgres does) then one can create a staging table, with 2 columns: filename, content. Once the whole content is in this staging table, INSERT-SELECT SQL could be created that transforms the JSON into tabular format that can be inserted into the final table.
However, with your toolset you probably have no other choice than to load these files with 1 job per file. And I'd suggest 1 dedicated job to each file. They would each look for their own files and triggered / scheduled individually or be part of a bigger job where you scan the folders and trigger the right job for the right file.

Using postgres to replace csv files (pandas to load data)

I have been saving files as .csv for over a year now and connecting those files to Tableau Desktop for visualization for some end-users (who use Tableau Reader to view the data).
I think I settled on migrating to postgreSQL and I will be using the pandas library to_sql to fill it up.
I get 9 different files each day and I process each of them (I currently consolidate them into monthly files in .csv.bz2 format) by adding columns, calculations, replacing information, etc.
I create two massive csv files using pd.concat and pd.merge out of those
processed files which Tableau is connected to. These files are literally overwritten every day when new data is added which is time consuming
Is it okay to still do my file joins and concatenation with pandas and export the output data to postgres? This will be my first time using a real database and I am more comfortable with pandas compared to learning SQL syntax and creating views or tables. I just want to avoid overwriting the same csv files over and over (and some other csv problems I run into).
Don't worry too much about normalization. A properly normalized database will usually be more efficient and easier to handle than an non-normalized. On the other hand, if you have non-normalized csv data you dump into a database, your import functions will be a lot more complicated if you do a proper normalization. I think I would recommend you to make one step at the time. Start up with just loading the processed csv-files into postgres. I am pretty sure all processing following that will be a lot easier and quicker than doing it using csv-files (just make sure you set up the right indexes). When you start to get used to using the database, you can start to do more processing there.
Just remember, one thing a database is really good at is to pick out the subset of data you want to work on. Try as much as possible to avoid pulling out huge amount of data from the database when you only intend to work on a subset of it.

Tableau TDE or connect to files directly?

I have a personal license for Tableau. I am using it to connect to .csv and .xlsx files currently but am running into some issues.
1) The .csv files are massive (10+ gig)
2) The Excel files are starting to reach the 1mil row limit
3) I need to add certain columns to the .csv files sometimes (like unique ID and a few formulas) which means that I need to open sections of them in Excel, modify what I need to, then save a new file
Would it be better to create an extract for each of these files and then connect the Tableau Workbook to the extract instead of the file? Currently I am connected directly to files and then extract data from there and refresh everyday.
I don't know about others, but I'm using that exactly guideline. I'll have some Workbooks that will simply serve to extract data from some datasource (be it SQL, xlsx, csv, mdb, or any other), and all analysis will be performed in other Workbooks, that'll connect only to tdes
The advantages are:
1) Whenever you need to update a data source, you'll need to only update once (and replace the tde file) and all your workbooks will be up to date. If you connect to the same data source and extract to different tde files, you'll have to extract to all those different tde files (and worry about having updated the extract in that specific Workbook). And even if you extract to the same tde (which doesn't make much sense), it can be confusing (am I connected to the tde or to the file? Does the extract I made in the other workbook updated this one too? Well, yes it did, but it can be confusing)
2) You don't have to worry about replacing a datasource, especially when it's a csv, xlsx or mdb file. You can keep many different versions of those files, and choose which one is the best one. For instance, I'll have table_v1.mdb, table_v2.mdb, ..., and a single table_v1.tde, which will be the extract of one of those mdb files. And I still have the previous versions in case I need them.
3) When you have a SQL connection, or anything that is not a file (csv, xlsx, mdb), extracts are very handy for basically the same reasons above, with (at least) one upside. You don't need to connect to a server every time you want to perform an analysis. That means you can do everything offline, and the person using Tableau doesn't need to have access to the SQL table (or any other source).
One good practice is always keeping a back-up when updating a tde (because, well, shit happens)
10 gig csv, wow. Yes, you should absolutely use a data extract, that would be much quicker. For that much data you could look at other connections such as MS Access or a SQL instance.
If your data have that many rows, I would try to set up a small MySQL instance on your local machine and keep the data there instead. You would be able to connect Tableau directly to the MySQL instance and would be able to easily edit the source data.

Can one upload a csv file to postgres via pgadmin, without specifying column names beforehand?

Im trying to use the PgAdminn III Import tool and want to upload a .csv file. I dont know the column names OR column numbers beforehand, and would like to have them be populated on the fly. I also know that the number of columns is consistent across rows.
In the sense of having a table dynamically created for you from the CSV, no, not with PgAdmin-III or psql.
You'll want to write a quick script for that with your preferred scripting language + its PostgreSQL driver interface, or use an ETL tool like CloverETL, Pentaho Kettle, or Talend Studio.

Extract Active Directory into SQL database using VBScript

I have written a VBScript to extract data from Active Directory into a record set. I'm now wondering what the most efficient way is to transfer the data into a SQL database.
I'm torn between;
Writing it to an excel file then firing an SSIS package to import it or...
Within the VBScript, iterating through the dataset in memory and submitting 3000+ INSERT commands to the SQL database
Would the latter option result in 3000+ round trips communicating with the database and therefore be the slower of the two options?
Sending an insert row by row is always the slowest option. This is what is known as Row by Agonizing Row or RBAR. You should avoid that if possible and take advantage of set based operations.
Your other option, writing to an intermediate file is a good option, I agree with #Remou in the comments that you should probably pick CSV rather than Excel if you are going to choose this option.
I would propose a third option. You already have the design in VB contained in your VBscript. You should be able to convert this easily to a script component in SSIS. Create an SSIS package, add a DataFlow task, add a Script Component (as a datasource {example here}) to the flow, write your fields out to the output buffer, and then add a sql destination and save yourself the step of writing to an intermediate file. This is also more secure, as you don't have your AD data on disk in plaintext anywhere during the process.
You don't mention how often this will run or if you have to run it within a certain time window, so it isn't clear that performance is even an issue here. "Slow" doesn't mean anything by itself: a process that runs for 30 minutes can be perfectly acceptable if the time window is one hour.
Just write the simplest, most maintainable code you can to get the job done and go from there. If it runs in an acceptable amount of time then you're done. If it doesn't, then at least you have a clean, functioning solution that you can profile and optimize.
If you already have it in a dataset and if it's SQL Server 2008+ create a user defined table type and send the whole dataset in as an atomic unit.
And if you go the SSIS route, I have a post covering Active Directory as an SSIS Data Source