Cleaning a non-uniform address in excel - data-cleaning

If I have non-uniform data in one column like so
How could I separate it into it's separate components using a macro? Is this a much more difficult question than it looks? This is just a tiny set. The data maybe hundreds of thousands of rows. They could be really messed up,misspelled, It could have strange characters in there like carriage returns.

Related

Loading a CSV File taking forever

I am trying to load a 4.0 GB large CSV file into Matlab. I have 40GB of RAM. However, the table does not seem to finish loading. (Activity Monitor showed fast increase of RAM use up to 38.64GB and stopped after that. CPU still in heavily in use.)
According to the force quit menu of apple, matlab is has not gotten stuck. (I'd guess a missing "Matlab is not responding"-message signals that.)
1st Question: Why does it even take up that much RAM? I've read RAM duplicates. Can I do something in this regard?
2nd Question: Can I speed this project up. Split the CSV somehow?
3rd Question: Can I speed up my computer? It is taking forever, while using only 30% of CPU capacity... Why does it not use more? The vents are not crazy loud, so I guess "it's chilling".
Edit: It went up to 72.80 and is now decreasing...
Edit: Now back down at 55.something
There are a few concepts you should be aware of with Matlab.
Strings are stored as UINT16 (sort of, I can never get this right). Importantly what this means is that every character requires 2 bytes. If you stored the entire file as a long string it would take up 8 GB.
Values, whether they are arrays or scalars, are stored with headers. This means that storing a string (technically a character array, strings - the ones with double quotes instead of single quotes - may be different) requires a header that is roughly 104 bytes. This means something like 'test' requires roughly 108 bytes! If you can store an array of numbers then the 104 byte overhead is minimal. If you have a cell array of scalars, then each scalar is taking up 112 byes (assuming the scalar is an 8 byte double). This might be a bit confusing but in the end it means if you're not careful reading a CSV file your memory requirements can explode.
So what can you do. Tables store columns as arrays where possible. You can try readtable although I think the underlying implementation might not be memory efficient.
For large files Matlab suggests using the datastore function. It will fix your memory problem although it may be a bit slow.
The other option is to read the entire file into memory and to do your own custom processing. For example, assuming you don't have anything escaped (i.e. commas that are not actually delimiters), you can find all relevant delimiters by using:
%Find comma or newline
I = regexp(temp,',|\n')
Here's an example of extracting various columns. As indicated above, this has a large overhead for strings (character arrays) but is efficient for numbers.
%Fake data as an example, 3 columns with middle one numeric
temp = sprintf('asdf,1234,temp\nfred,324,chip\ncheese,12,you are always right');
I = regexp(temp,',|\n');
starts = [0 I];
ends = [I length(temp)+1];
n_columns = 3;
%extract column 2
c2 = arrayfun(#(x,y) str2double(temp(x+1:y-1)),starts(2:n_columns:end),ends(2:n_columns:end));
%extract column 1
c1 = arrayfun(#(x,y) temp(x+1:y-1),...
starts(1:n_columns:end),ends(1:n_columns:end),'un',0);
Depending on your use case this may work or it may not. To read the file into memory you can use fileread
In answer to question(2): It is quite straightforward to split the csv up, assuming there are more rows than columns...
bigfile= csvread(filename);
bigLen=length(bigfile);
size=unint64(bliglen/2)
csvwrite('first.csv', bigfile(1:size,:));
csvwrite('second.csv', bigfile(size:beglen,:));
Or even doing this with SEVERAL files; it may not make it faster overall, but it would allow you to observe the process as each file is read.
I think that MatLab itself has a limit of how much input it is allowed to take in. I'm sure you can set that in the preferences if you have the high enough version.
Check this out: http://www.mathworks.com/help/matlab/matlab_env/set-workspace-and-variable-preferences.html

matlab cell array size much larger than actual data

I recently discovered an strange behavour of MATLAB cell arrays that was not happening before.
If I create a cell array with
a=cell(1,4)
its size is 32 bytes.
If then I put something inside, e.g.
a{2}='abcd'
its size becomes 144 bytes. But if I remove this content by putting
a{2}=[]
the size becomes 132 bytes and so on. What is the problem?
Simply put, the Matlab cell array needs some internal data structures to keep track of what is stored within.
As it seems, Matlab allocates memory as needed, and thus extends the storage needed by the cell array as you insert data.
Removing the data doesn't mean that matlab can return the now unused memory to the OS or internal memory pool -- that might either be something that is impossible with the internal storage structure, or something that would be unwise with respect to performance, because cell arrays from which data is removed are (speaking over all use cases of cell arrays) be structures that get updated often, so that "prematurely" returning memory just to acquire it back again a few instructions later would be pretty CPU-intense.
As a general note: Matlab has pretty terrible storage approaches for nearly everything but matrices and sparse matrices (vectors of course being special cases of matrices). That's because it's not Matlab's job to be e.g. a string parser etc.
If memory becomes a problem, it might be worth considering implementing the math core of your problem in Matlab and doing the rest in other, more generally usable programming languages and somehow interfacing your Matlab code with that -- I haven't tried that myself, but Mathworks has a Matlab engine for python, and I'd take writing python for things like storing arbitrary data over using Matlab every day; with that engine, you can call Matlab to do your dirty math work, and use python to do your everyday scripting/programming work.
Notice that my bottom line here is that Matlab has great Math routines and impressive documentation, but if you want to actually develop software, using a general purpose tool/language is much more likely to be satisfying quickly.
I'd even go as far as saying that it's probably worth your time to learn python, just to be able to circumvent having to deal with things that Matlab wasn't designed for (and cell arrays are a prime example of what Matlab is really complicated about and what's extremely easy in python).
You use
a{2}=[]
to 'kill' the data in that field. In reality you actually do access the data, that is you leave a non-empty cell entry with an empty double array. (Thanks to matlab for representing empty cells as empty doubles...)
but if you use (no curly braces, but parentheses):
a(2) = cell(1,1)
then the cell array size is back to "empty" = 32 bytes.

Plotting data sets of different lengths from a struct - avoid padding

I hope this is a simple enough question, but I am a beginner and haven't managed it on my own after several sessions.
I have a 1x29 struct of financial market data with 8 fields: stock_market.
the first field is the date
As you can see, not all the components of the stock market i.e. the stocks have the same length time series (some comanies are newer than others or have since left the market etc.)
I would like to know in general how best to manipulate such an uneven struct of time series data. One specific example is that I'd like to write a function that can extract the 'Date' and 'AdjClose' (adjusted closing price) from the struct, for one individual stock, and plot that on a graph, with the dates shown/labelled on the x-axis and prices on the right. An extension is to plot several AdjClose time series on one x-axis together, even though the length of their respective Date values are different. I don't want to pad these strings out so that I have lines running along zero for a portion of the graph.
Other more complicated functions shouldn't be difficult once I could manage this task.
I haven't shared any code, as it is simply 'plot(stock_market(1).Date, stock_market(1).AdjClose)` and mostly red with errors. I have read a few discussions about padding, but like i mentioned above, don't want to pad with zeros or any other number for the diagrams. For other functions, I may have to pad them, but if anyone has a different solution - it'd be great to hear it :-)
Thanks in advance.

Need a method to store a lot of data in Matlab

I've asked this before, but I feel I wasn't clear enough so I'll try again.
I am running a network simulation, and I have several hundreds output files. Each file holds the simulation's test result for different parameters.
There are 5 different parameters and 16 different tests for each simulation. I need a method to store all this information (and again, there's a lot of it) in Matlab with the purpose of plotting graphs using a script. suppose the script input is parameter_1 and test_2, so I get a graph where parameter_1 is the X axis and test_2 is the Y axis.
My problem is that I'm not quite familier to Matlab, and I need to be directed so it doesn't take me forever (I'm short on time).
How do I store this information in Matlab? I was thinking of two options:
Each output file is imported separately to a different variable (matrix)
All output files are merged to one output file and imprted together. In the resulted matrix each line is a different output file, and each column is a different test. Problem is, I don't know how to store the simulation parameters
Edit: maybe I can use a dataset?
So, I would appreciate any suggestion of how to store the information, and what functions might help me fetch the only the data I need.
If you're still looking to give matlab a try with this problem, you can iterate through all the files and import them one by one. You can create a list of the contents of a folder with the function
ls(name)
and you can import data like this:
A = importdata(filename)
if your data is in txt files, you should consider this Prev Q
A good strategy to avoid cluttering your workspace is to import them all into a single matrix. SO if you have a matrix called VAR, then VAR{1,1}.{1,1} could be where you put your test results and VAR{1,1}.{2,1} could be where you put your simulation parameters of the first file. I think that is simpler than making a data structure. Just make sure you uniformly place the information in the same indexes of the arrays. You could also organize your VAR row v col by parameter vs test.
This is more along the lines of your first suggestion
Each output file is imported separately to a different variable
(matrix)
Your second suggestion seems unnecessary since you can just iterate through your files.
You can use the command save to store your data.
It is very convenient, and can store as much data as your hard disk can bear.
The documentation is there:
http://www.mathworks.fr/help/techdoc/ref/save.html
Describe the format of text files. Because if it has a systematic format then you can use dlmread or similar commands in matlab and read the text file in a matrix. From there, you can plot easily. If you try to do it in excel, it will be much slower than reading from a text file. If speed is an issue for you, I suggest that you don't go for Excel.

Near Duplicate Detection in Data Streams

I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to filter near duplicate data.
I did a bit of research on duplicate detection in data streams and read about Stable Bloom Filters. Stable bloom filters are data structures for duplicate detection in data streams with an upper bound on the false positive rate.
But, I want to identify near duplicates and I also looked at Hashing Algorithms like LSH and MinHash that are used in Nearest Neighbour problems and Near Duplicate Detection.
I am kind of stuck and looking for pointers as to how to proceed and papers/implementations that I could look at?
First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5 hash (or something faster) of the resulting string. Do a database lookup of the MD5 hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.
To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs paper and blog post by Greg Linden. Suppose the routine Sigs() does that for a given string, that is, given the normalized string x, Sigs(x) returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).
Given the Sigs() the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash method.
http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes