Working with huge csv files in Tableau - tableau-api

I have a large csv files (1000 rows x 70,000 columns) which I want to create a union between 2 smaller csv files (since these csv files will be updated in the future). In Tableau working with such a large csv file results in very long processing time and sometimes causes Tableau to stop responding. I would like to know what are better ways of dealing with such large csv files ie. by splitting data, converting csv to other data file type, connecting to server, etc. Please let me know.

The first thing you should ensure is that you are accessing the file locally and not over a network. Sometimes it is minor, but in some cases that can cause some major slow down in Tableau reading the file.
Beyond that, your file is pretty wide should be normalized some, so that you get more row and fewer columns. Tableau will most likely read it in faster because it has fewer columns to analyze (data types, etc).
If you don't know how to normalize the CSV file, you can use a tool like: http://www.convertcsv.com/pivot-csv.htm
Once you have the file normalized and connected in Tableau, you may want to extract it inside of Tableau for improved performance and file compression.

The problem isn't the size of the csv file: it is the structure. Almost anything trying to digest a csv will expect lots of rows but not many columns. Usually columns define the type of data (eg customer number, transaction value, transaction count, date...) and the rows define instances of the data (all the values for an individual transaction).
Tableau can happily cope with hundreds (maybe even thousands) of columns and millions of rows (i've happily ingested 25 million row CSVs).
Very wide tables usually emerge because you have a "pivoted" analysis with one set of data categories along the columns and another along the rows. For effective analysis you need to undo the pivoting (or derive the data from its source unpivoted). Cycle through the complete table (you can even do this in Excel VBA despite the number of columns by reading the CSV directly line by line rather than opening the file). Convert the first row (which is probably column headings) into a new column (so each new row contains every combination of original row label and each column header plus the relevant data value from the relevant cell in the CSV file). The new table will be 3 columns wide but with all the data from the CSV (assuming the CSV was structured the way I assumed). If I've misunderstood the structure of the file, you have a much bigger problem than I thought!

Related

Apache Druid Appending Segment without dropping or summing it

I have three JSON files with the same timestamp but different values to upload to the Druid. I want to upload them separately with the same segment granularity. However, it drops the existing segment and uploads the new one.
I don't want to use appendToExisting: True bc it sums the values of the same rows. This is the situation that I don't want to happen (I may be adding the same file in the future).
Is there a way to add new data to a specific segment without dropping or summing it?

Inputting data row by row from a large data set into a hypertable (Postgre)

I have a csv file with 5 columns of data and 5000+ rows. My task is to input the data ONE by ONE into a hypertable which I already created.
My question is : the COPY function copies the entire file into the hypertable. I could just sit and use the INPUT function and input the data one by one - however, this is very painful and very time consuming.
I'm not sure how the conditional loops work, the documentation available is a little hazy. I have experience in C and python if that helps.
Any guidance is much appreciated!

Bulk load causing schema drift in files (adf pipeline or mapping dataflows)

Bit of a challenge here
I have around 45,000 historic .parquet files
partitioned like this yyyy,mm,dd (2021/08/19) in the dd level I have 24 files (one for each hour)
The columns in each day file are pretty wide, anything up to 250 columns. It has increased and decreased over time, hence there being schema drift when trying to load into SQL using mapping dataflows that made the file larger.
Around 200 of those columns I require and I know what they are. I even have them in a schema template. The rest are legacy or unwanted
I'd like to retain the original files in blob as they are, but load files with those 200 columns per file into SQL.
What is the best way to achieve this?
How do I iterate over every file but only take the columns I need?
I tried using a wildcard path
'2021/**/*.parquet'
within mapping dataflows to pick up All files in blob so I don't have to iterate creating multiple clusters or a foreach
I'm not even sure how to handle this or whether it should be a copy activity or a mapping df
both have their benefits but I think I can only use mapping df if I need to transform parts of these files in depth.
should I be combining the months or even years into a single file then trying to read from this files so I can exclude the additional from the columns I want to take into SQL server.
ideally this is a bulk load that need some refinement when it lands.
Thank in advance
Add a data flow to the pipeline and use a Select transformation to choose the columns you wish to propagate. You can create pattern-based rules in the data flow Select transformation to choose the columns that you wish to pick from each file schema.

Grouping rows in .csv by data ID in MATLAB

I have a large .csv file which contains sensor data, which I would like to process in MATLAB. I cannot include the original data, but it is similar to this (with over 2000 rows):
There are other columns, but I am not interested in them.
What I would like to do in group all the data for the different courses into tables, so I would have a table for all the data from 1:12:aa, another for 5:C6:19 and so on.
I tried using the groupsummary function, but that just seems to give me a count of how many times each source ID is in the file. I will also be trying a look to filter the data, but I imagine there is a fast way to do it in MATLAB, and I am hoping someone can point me in the right direction.

Manipulating large csv files with Matlab

I am trying to work with a large set of numerical data stored in a csv file. Is so big that I cannot store in a single variable, as Matlab does not have enough memory.
I was wondering if there is some way to manipulate large csv files in matlab similar as if they were variables, i.e. I want to sort it, delete some rows, find the column and row of some values, etc.
If that is not possible, what programming language do you recommend to do that, considering that the data is stored in a matrix form?
You can import the csv file into a database. E.g. sqlite - https://sqlite.org/cvstrac/wiki?p=ImportingFiles
Take one of the sqlite Toolboxes for Matlab, e.g. http://go-sqlite.osuv.de/doc/
You should be able to select single rows and columns due sql language and import to matlab. Or use sqlite functions (for sort -> order by etc.)...
Another option is to access csv files directly like it is a sql database with q. See https://github.com/harelba/q