Removing duplicate lines from a large dataset

Removing duplicate lines from a large dataset - hash

Let's assume that I have a very large dataset that can not be fit into the memory, there are millions of records in the dataset and I want to remove duplicate rows (actually keeping one row from the duplicates)
What's the most efficient approach in terms of space and time complexity ?
What I thought :
1.Using bloom filter , I am not sure about how it's implemented , but I guess the side effect is having false-positives , in that case how can we find if it's REALLY a duplicate or not ?
2.Using hash values , in this case if we have a small number of duplicate values, the number of unique hash values would be large and again we may have problem with memory ,

Your solution 2: using hash value doesn't force a memory problem. You just have to partition the hash space into slices that fits into memory. More precisely:
Consider a hash table storing the set of records, each record is only represented by its index in the table. Say for example that such a hash table will be 4GB. Then you split your hash space in k=4 slice. Depending on the two last digits of the hash value, each record goes into one of the slice. So the algorithm would go roughly as follows:
let k = 2^M
for i from 0 to k-1:
t = new table
for each record r on the disk:
h = hashvalue(r)
if (the M last bit of h == i) {
insert r into t with respect to hash value h >> M
}
search t for duplicate and remove them
delete t from memory
The drawback is that you have to hash each record k times. The advantage is that is it can trivially be distributed.
Here is a prototype in Python:
# Fake huge database on the disks
records = ["askdjlsd", "kalsjdld", "alkjdslad", "askdjlsd"]*100
M = 2
mask = 2**(M+1)-1
class HashLink(object):
def __init__(self, idx):
self._idx = idx
self._hash = hash(records[idx]) # file access
def __hash__(self):
return self._hash >> M
# hashlink are equal if they link to equal objects
def __eq__(self, other):
return records[self._idx] == records[other._idx] # file access
def __repr__(self):
return str(records[self._idx])
to_be_deleted = list()
for i in range(2**M):
t = set()
for idx, rec in enumerate(records):
h = hash(rec)
if (h & mask == i):
if HashLink(idx) in t:
to_be_deleted.append(idx)
else:
t.add(HashLink(idx))
The result is:
>>> [records[idx] for idx in range(len(records)) if idx not in to_be_deleted]
['askdjlsd', 'kalsjdld', 'alkjdslad']

Since you need deletion of duplicate item, without sorting or indexing, you may end up scanning entire dataset for every delete, which is unbearably costly in terms of performance. Given that, you may think of some external sorting for this, or a database. If you don't care about ordering of output dataset. Create 'n' number of files which stores a subset of input dataset according to hash of the record or record's key. Get the hash and take modulo by 'n' and get the right output file to store the content. Since size of every output file is small now, your delete operation would be very fast; for output file you could use normal file, or a sqlite/ berkeley db. I would recommend sqlite/bdb though. In order to avoid scanning for every write to output file, you could have a front-end bloom filter for every output file. Bloom filter isn't that difficult. Lot of libraries are available. Calculating 'n' depends on your main memory, I would say. Go with pessimistic, huge value for 'n'. Once your work is done, concatenate all the output files into a single one.

Related

Split a Matlab table in several tables dynamically

I am working in MATLAB and I did not find yet a way to split a table T in different tables {T1,T2,T3,...} dynamically. What I mean with dynamic is that it must be done based on some conditions of the table T that are not known a priori. For now, I do it in a non-dynamic way with the following code (I hard-code the number of tables I want to have).
%% Separate data of table T in tables T1,T2,T3
starting_index = 1;
T1 = T(1:counter_simulations(1),:);
starting_index = counter_simulations(1)+1;
T2 = T(starting_index:starting_index+counter_simulations(2)-1,:);
starting_index = starting_index + counter_simulations(2);
T3 = T(starting_index:starting_index+counter_simulations(3)-1,:);
Any ideas on how to do it dynamically? I would like to do something like that:
for (i=1:number_of_tables_to_create)
T{i} = ...
end
EDIT: the variable counter_simulations is an array containing the number of rows I want to extract for each table. Example: counter_simulations(1)=200 will mean that the first table will be T1= T(1:200, :). If counter_simulations(2)=300 the first table will be T1= T(counter_simulations(1)+1:300, :) and so on.
I hope I was clear.
Should I use cell arrays instead of tables maybe?
Thanks!

For the example you give, where counter_simulations contains a list of the number of rows to take from T in each of the output tables, MATLAB's mat2cell function actually implements this behaviour directly:
T = mat2cell(T,counter_simulations);
While you haven't specified the contents of counter_simulations, it's clear that if sum(counter_simulations) > height(T) the example would fail. If sum(counter_simulations) < height(T) (and so your desired output doesn't contain the last row(s) of T) then you would need to add a final element to counter_simulations and then discard the resulting output table:
counter_simulations(end+1) = height(T) - sum(counter_simulations);
T = mat2cell(T,counter_simulations);
T(end) = [];
Whether this solution applies to all examples of
some conditions of the table T that are not known a priori
you ask for in the question depends on the range of conditions you actually mean; for a broad enough interpretation there will not be a general solution but you might be able to narrow it down if mat2cell performs too specific a job for your actual problem.

How can I assign a unique value to represent a list of unique integers

I need to compare two lists of unique integers by assigning each list a unique value to represent it's integers. What method/algorithm can I apply for this that's not too computationally intensive and produces a relatively short id/hash a set
Both lists:
have a unique set of integers ranging from 1 to 1000
are ordered
For example:
l1 = [1,2,3,4...55,57...999]
l2 = [1,2,3,4...54,56...999]
l1 is missing 56 while l2 is missing 55.
All I need to know in this case is that the lists are not identical so I can update l2.

Updated after comment
See below for an explanation of why you can't use a hash code to assign "each list a unique value to represent its integers."
However, a hash code can be useful. Assuming you create a hash code for each list. You'll want to make sure that you sort items in the lists before computing the hash code, because order definitely matters in hash code computations. That won't necessarily generate a unique hash code for each list, but if the hash codes for two lists aren't identical, the lists are definitely different. If the hash codes are identical, the lists might be identical. The code, then looks like this:
bool AreListsIdentical(list1, list2)
{
if (list1.hashCode != list2.hashCode)
{
// hash codes are different, so lists are definitely not identical
return false;
}
// hash codes are equal. Lists might be identical.
if (list1.Count != list2.Count)
{
// lists have different numbers of items. Definitely not identical.
return false;
}
// have to compare individual items
for (int i = 0; i < list1.Count; ++i)
{
if (list1[x] != list2[x])
{
return false;
}
}
return true;
}
Previous answer
You have multiple lists, each of which contains unique numbers in the range 1 to 1,000. You don't say how large each list is, but for illustration purposes I'll say that each list contains 10 numbers.
You also don't say whether order matters in the list. Is the list [1,7,99,206] the same as [99,7,206,1]? I'll show you the calculations either way.
The number of permutations (order matters) of 1,000 items taken 10 at a time is 9.56E+29. The number of combinations (order doesn't matter) is 2.63E+23. Those are huge numbers.
You say you want a "relatively short id." We can express a 64-bit value easily in a 12-character string, so let's assume that you want to create a 64-bit hash code. There are 1.84E+18 possible 64-bit values.
There are one hundred trillion times more possible permutations than possible hash codes. There are 100,000 times more combinations than hash codes.
Applying the Pigeonhole principle, you have n things that you want to put in m boxes. Since n > m, at least one box will contain more than one item. You can't possibly have a unique 64-bit value for each list.
(In truth, assuming a good hash function, every hash code will represent approximately the same number of different lists.)

Scala Iterator/Looping Technique - Large Collections

I have really large tab delimited files (10GB-70GB) and need to do some read, data manipulation, and write to a separate file. The files can range from 100 to 10K columns with 2 million to 5 million rows.
The first x columns are static which are required for reference. Sample file format:
#ProductName Brand Customer1 Customer2 Customer3
Corolla Toyota Y N Y
Accord Honda Y Y N
Civic Honda 0 1 1
I need to use the first 2 columns to get a product id then generate an output file similar to:
ProductID1 Customer1 Y
ProductID1 Customer2 N
ProductID1 Customer3 Y
ProductID2 Customer1 Y
ProductID2 Customer2 Y
ProductID2 Customer3 N
ProductID3 Customer1 N
ProductID3 Customer2 Y
ProductID3 Customer3 Y
Current sample code:
val fileNameAbsPath = filePath + fileName
val outputFile = new PrintWriter(filePath+outputFileName)
var customerList = Array[String]()
for(line <- scala.io.Source.fromFile(fileNameAbsPath).getLines()) {
if(line.startsWith("#")) {
customerList = line.split("\t")
}
else {
val cols = line.split("\t")
val productid = getProductID(cols(0), cols(1))
for (i <- (2 until cols.length)) {
val rowOutput = productid + "\t" + customerList(i) + "\t" + parser(cols(i))
outputFile.println(rowOutput)
outputFile.flush()
}
}
}
outputFile.close()
One of tests that I ran took about 12 hours to read a file (70GB) that has 3 million rows and 2500 columns. The final output file generated 250GB with about 800+ million rows.
My question is: is there anything in Scala other than what I'm already doing that can offer quicker performance?

Ok, some ideas ...
As mentioned in the comments, you don't want to flush after every line. So, yeah, get rid of it.
Moreover, PrintWriter by default flushes after every newline anyway (so, currently, you are actually flushing twice :)). Use a two-argument constructor, when creating PrintWriter, and make sure the second parameter is false
You don't need to create BufferedWriter explicitly, PrintWriter is already buffering by default. The default buffer size is 8K, you might want to try to play around with it, but it will probably not make any difference, because, last I checked, the underlying FileOutputStream ignores all that, and flushes kilobyte-sized chunks either way.
Get rid of gluing rows together in a variable, and just write each field straight to the output.
If you do not care about the order in which lines appear in the output, you can trivially parallelize the processing (if you do care about the order, you still can, just a little bit less trivially), and write several files at once. That would help tremendously, if you place your output chunks on different disks and/or if you have multiple cores to run this code. You'd need to rewrite your code in (real) scala to make it thread safe, but that should be easy enough.
Compress data as it is being written. Use GZipOutputStream for example. That not only lets you reduce the physical amount of data actually hitting the disks, but also allows for a much larger buffer
Check out what your parser thingy is doing. You didn't show the implementation, but something tells me it is likely not free.
split can get prohibitively expensive on huge strings. People often forget, that its parameter is actually a regular expression. You are probably better off writing a custom iterator or just using good-old StringTokenizer to parse the fields out as you go, rather than splitting up-front. At the very least, it'll save you one extra scan per line.
Finally, last, but by no measure least. Consider using spark and hdfs. This kind of problems is the very area where those tools really excel.

Creating the optimum index for my database

I have a table in postgresql with the following information:
rawData (fileID integer references otherTable, lineNum integer, data1 double, ...)
When I am searching this table, I do so with the following query:
SELECT lineNum, data1, ...other data FROM rawData WHERE
fileID = ? AND data1 < ? ORDER BY lineNum;
In general, the data in this table is a number of entries for each fileID, and each fileID has lineNum from 0 to x, with lineNum never repeating for each fileID (but it does repeat for different fileID's). Then data1 is effectively a random number that may or may not overlap.
In order to speed up the reading of this data, I am trying to create an index on it, but am having trouble figuring out the best way to index it. Currently I am looking at one of the following two index methods, and am wondering which would be better for my search, or if there is another option that I haven't thought of that would be better than either of them.
index idea 1:
CREATE INDEX searchIndex ON rawData (fileID, data1, lineNum);
index idea 2:
CREATE INDEX searchIndex ON rawData (fileID, lineNum, data1);
Note that at this time, this and a search not constrained by data1 are the only searches that I run on this table, so I'm not too concerned about this index slowing down other searches.
Lastly, would I have to change my search query to use the index, or would it automatically use that index when I search the table?

You should look at using this instead:
CREATE INDEX searchIndex ON rawData (fileID, lineNum);
A few things:
In particular, as per docs, Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized.
Since your second search query requires filtering without the data1 column, keeping the second column lineNum should be sufficient (since you mention it would be quasi-random), and in the rare occurrence that there are repeats, table fetches should ensure correctness. But what this would mean is that the Index would be 1/3rd smaller in size, which is a big win (Think index small-enough to be in memory / index-only-scans etc.)

Either index can be used. Which is faster will depend on many things, like how many rows are in the table, how many lineNum there are per fileID, how selective the data1 < ? clause is, what your hardware is, what our config settings are, which version of PostreSQL you are using, what physical order the table rows lie in, etc.
The only way to know for sure is to try it with your own data on your own system and see.
I'd just build an index on (fileID, lineNum, data1), or even just (fileID, lineNum), because that seems more natural, and then forget about it. Most likely it will be fast enough. Once there is a demonstrable performance problem, than you will have the test case at hand which is needed to come to a real conclusion.

MongoDB custom and unique IDs

I'm using MongoDB, and I would like to generate unique and cryptical IDs for blog posts (that will be used in restful URLS) such as s52ruf6wst or xR2ru286zjI.
What do you think is best and the more scalable way to generate these IDs ?
I was thinking of following architecture :
a periodic (daily?) batch running to generate a lot of random and uniques IDs and insert them in a dedicated MongoDB collection with InsertIfNotPresent
and each time I want to generate a new blog post, I take an ID from this collection and mark it as "taken" with UpdateIfCurrent atomic operation
WDYT ?

This is exactly why the developers of MongoDB constructed their ObjectID's (the _id) the way they did ... to scale across nodes, etc.
A BSON ObjectID is a 12-byte value
consisting of a 4-byte timestamp
(seconds since epoch), a 3-byte
machine id, a 2-byte process id, and a
3-byte counter. Note that the
timestamp and counter fields must be
stored big endian unlike the rest of
BSON. This is because they are
compared byte-by-byte and we want to
ensure a mostly increasing order.
Here's the schema:
0123 456 78 91011
time machine pid inc
Traditional databases often use
monotonically increasing sequence
numbers for primary keys. In MongoDB,
the preferred approach is to use
Object IDs instead. Object IDs are
more synergistic with sharding and
distribution.
http://www.mongodb.org/display/DOCS/Object+IDs
So I'd say just use the ObjectID's
They are not that bad when converted to a string (these were inserted right after each other) ...
For example:
4d128b6ea794fc13a8000001
4d128e88a794fc13a8000002
They look at first glance to be "guessable" but they really aren't that easy to guess ...
4d128 b6e a794fc13a8000001
4d128 e88 a794fc13a8000002
And for a blog, I don't think it's that big of a deal ... we use it production all over the place.

What about using UUIDs?
http://www.famkruithof.net/uuid/uuidgen as an example.

Make a web service that returns a globally-unique ID so that you can have many webservers participate and know you won't hit any duplicates?
If your daily batch didn't allocate enough items? Do you run it midday?
I would implement the web-service client as a queue that can be looked at by a local process and refilled as needed (when server is slower) and could keep enough items in queue not to need to run during peak usage. Makes sense?

This is an old question but for anyone who could be searching for another solution.
One way is to use simple and fast substitution cipher. (The code below is based on someone else's code -- I forgot where I took it from so cannot give proper credit.)
class Array
def shuffle_with_seed!(seed)
prng = (seed.nil?) ? Random.new() : Random.new(seed)
size = self.size
while size > 1
# random index
a = prng.rand(size)
# last index
b = size - 1
# switch last element with random element
self[a], self[b] = self[b], self[a]
# reduce size and do it again
size = b;
end
self
end
def shuffle_with_seed(seed)
self.dup.shuffle_with_seed!(seed)
end
end
class SubstitutionCipher
def initialize(seed)
normal = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a + [' ']
shuffled = normal.shuffle_with_seed(seed)
#map = normal.zip(shuffled).inject(:encrypt => {} , :decrypt => {}) do |hash,(a,b)|
hash[:encrypt][a] = b
hash[:decrypt][b] = a
hash
end
end
def encrypt(str)
str.split(//).map { |char| #map[:encrypt][char] || char }.join
end
def decrypt(str)
str.split(//).map { |char| #map[:decrypt][char] || char }.join
end
end
You use it like this:
MY_SECRET_SEED = 3429824
cipher = SubstitutionCipher.new(MY_SECRET_SEED)
id = hash["_id"].to_s
encrypted_id = cipher.encrypt(id)
decrypted_id = cipher.decrypt(encrypted_id)
Note that it'll only encrypt a-z, A-Z, 0-9 and a space leaving other chars intact. It's sufficient for BSON ids.

The "correct" answer, which is not really a great solution IMHO, is to generate a random ID, and then check the DB for a collision. If it is a collision, do it again. Repeat until you've found an unused match. Most of the time the first will work (assuming that your generation process is sufficiently random).
It should be noted that, this process is only necessary if you are concerned about the security implications of a time-based UUID, or a counter-based ID. Either of these will lead to "guessability", which may or may not be an issue in any given situation. I would consider a time-based or counter-based ID to be sufficient for blog posts, though I don't know the details of your situation and reasoning.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse