TSQL VARCHAR compression - tsql

Is there a huffman or zip compression dll written for TSQL? I have searched and can't seem to find it. I want to store compressed data in one field and use a calculated field to display the uncompressed data.

There is no builtin function in TSQL but you can write your own dll written by C# or VB.net called SQLCLR and add it to your sql.
Now you can use this function.

Related

How to filter Wikidata dump for a language?

I've downloaded the Wikidata truthy dump in RDF format (.nt.bz2 file). I want to limit the language of the dump to English only and generate this new filtered dump as a new .nt file.
I've tried using parallel grep to filter lines with '#en' text, but it consumes a lot of processing time.
Is there some much faster way to generate filtered dumps? Something like using Spark?
Maybe it is a bit late for you, but meanwhile a tool was generated to create custom dumps: https://tools.wmflabs.org/wdumps/
With this tool you can online define a language filter and then download an .nt file with only the relevant triples.

Manipulating large csv files with Matlab

I am trying to work with a large set of numerical data stored in a csv file. Is so big that I cannot store in a single variable, as Matlab does not have enough memory.
I was wondering if there is some way to manipulate large csv files in matlab similar as if they were variables, i.e. I want to sort it, delete some rows, find the column and row of some values, etc.
If that is not possible, what programming language do you recommend to do that, considering that the data is stored in a matrix form?
You can import the csv file into a database. E.g. sqlite - https://sqlite.org/cvstrac/wiki?p=ImportingFiles
Take one of the sqlite Toolboxes for Matlab, e.g. http://go-sqlite.osuv.de/doc/
You should be able to select single rows and columns due sql language and import to matlab. Or use sqlite functions (for sort -> order by etc.)...
Another option is to access csv files directly like it is a sql database with q. See https://github.com/harelba/q

FORTRAN: Best way to store large amount of data which is readable in MATLAB

I am working on developing an application in Fortran where I have points defining quadrilateral panels on the surface of an object. I am calculating various parameters on these quadrilateral panels for a number of frequencies.
The output file should look like:
FREQUENCY,PANEL_NUMBER,X1,Y1,Z1,X2,Y2,Z2,X3,Y3,Z3,X4,Y4,Z4,AREA,PRESSURE,....
0.01,1,....
0.01,2,....
0.01,3,....
.
.
.
.
0.01,2000,....
0.02,1,....
0.02,2,....
.
.
.
0.02,2000,...
.
.
I am expecting a maximum of 300,000 rows with 30 columns. Data types are composed of integer, real and complex numbers. I want to store this file and later read the file in MATLAB to create a 3D geometry which I will color based on pressure at each panel.
The problem is, as you can see from the file structure, there is lot of data. I am currently writing this as a CSV file and the size is about 26GB.
I do not want to use database to handle this. Could anyone suggest what file format I should write this data using FORTRAN.
Thanks for your help,
Amitava
Store the data in the native format of the computer rather than in a human-readable file in which the numbers have been converted to base 10 and characters. This will produce the smallest file and the fastest to process. On the Fortran open statement, use form='unformatted', access='stream'. The first causes the file to be unformatted, the second causes Fortran not to include its usual record-length information, which is Fortran specific. This omission makes the file more portable to other languages. Someone else can help better with how to read the file in MATLAB; I found this on the web: http://www.mathworks.com/help/matlab/import_export/importing-binary-data-with-low-level-i-o.html
UPDATE: This approach has several assumptions. It might not work easily if you wish to transport the file between different types of computers. Your question implies that want many rows of identical content. Identical rows simply matches a file structure with that number of identical records. It seems that you want to read the entire file, in which case a sequential file is appropriate. If you wish to read "random" records, a Fortran direct access file might be useful. With the simplicity of identical records, using a native file format seems easy. If you want self-documentation or portability across computers (different numeric representations), a file format such as HDF or FITS would be useful.
I second #steabert's mention of NetCDF and there's also HDF5 (on which the NetCDF 4 format is built). However, it does depend on what you mean by "data types": they are best used with regular/rigid data layouts and NetCDF's support for Fortran derived types can be painful at times.
Possible advantages for cases with large lumps are data transparent compression; data checksumming; and possibly more natural random access (that is, no need to compute seek positions based on array index) compared with Fortran stream access. That's on top of the usual things of a self-documenting and portable file format.
MATLAB has inbuilt support for reading these files, and recent versions also support the OPeNDAP framework so you wouldn't even need to have the file on the same (or multiple) machine(s).
Of course, disadvantages: extra software; extra skills development (especially for HDF5); and increased code complexity on the Fortran side.

How to handle large columnar data files in Octave WITH headers?

I have a .dat file that is space/tab delimited with 1 line of headers. There are about 60 columns of data in this file. There are others with other numbers of columns.
How can I read in the headers (as a vector, perhaps?) such that I can index into the appropriate column of the data-matrix without having to count my way manually to the correct column?
I seem to recall Matlab being able to create cell-arrays with headers as indexes. Is anything like that remotely possible in Octave?
So far, all I can get is the actual data according to this:
a = dlmread('Core.dat'," ",r0=1,c0=0);
Any and all help is much appreciated! Thanks!
I've been looking for an easy way to do this using just the standard packages, also, but there doesn't seem to be one.
However, it does look like the dataframe package, might let you do this sort of thing.
It does seem like something simpler should be built into the language, though.

How can I create a web page that shows aggregate data from Sawtooth surveys?

I'm guessing this won't apply to 99.99% of anyone that sees this. I've been doing some Sawtooth survey programming at work and I've been needing to create a webpage that shows some aggregate data from the completed surveys. I was just wondering if anyone else has done this using the flat files that Sawtooth generates and how you went about doing it. I only know very basic Perl and the server I use does not have PHP so I'm somewhat at a loss for solutions. Anything you've got would be helpful.
Edit: The problem with offering example files is that it's more complicated. It's not a single file and it occasionally gets moved to a different file with a different format. The complexities added in there are why I ask this question.
Doesn't Sawtooth export into CSV format? There are many Perl parsers for CSV files. Just about every language has a CSV parser or two (or twelve), and MS Excel can open them directly, and they're still plaintext so you can look at them in any text editor.
I know our version of Sawtooth at work (which is admittedly very old) exports Sawtooth data into SPSS format, which can then be exported into various spreadsheet formats including CSV, if all else fails.
If you have a flat (fixed-width field) file, you can easily parse it in Perl using regular expressions or just taking substrings of each line one at a time, assuming you know the width of the fields. Your question is too general to give much better advice, sorry.
Matching the values up from a plaintext file with meta-data (variable names and labels, value labels etc.) is more complicated unless you already have the meta-data in some script-readable format. Making all of that stuff available on a web page is more complicated still. I've done it and it can be a bit of a lengthy project to roll your own. There are packages you can buy, like SDA, which will help you build a website where people can browse and download your survey data and view your codebooks.
Honestly though the easiest thing to do if you're posting statistical data on a website is get the data into SPSS or SAS or another statistics package format and post those files for download directly. Then you don't have to worry about it.