delete queue items inside foreach loop - system-verilog

When we need to delete some items inside a queue, we may easily write code like below:
foreach(queue[i]) begin
if(queue[i].value == 1)
queue.delete(i);
end
But there is bugs in above code when queue[0]==queue[1]==1. Because queue.delete(0) will change all indexes of items inside queue.
So currently I use code as below:
foreach(queue[i]) begin
if(queue[i].value == 1) begin
queue.delete(i);
i--;
end
end
It works, but it looks confusing at first glance.
So my question is:
Are there any better solution for this issue in system verilog?

I believe this should work (I'm unable to test it right now. Make sure order is persevered when you try it out)
queue = queue.find() with ( item.value != 1 );
Another approach would be to find all the indexes that meet your criteria, sort in depending order, then loop through the indexes
int qi[$] = queue.find_index() with ( item.value == 1 );
qi = qi.sort() with ( -item ); // sort highest to lowest
foreach(qi[idx]) queue.delete(qi[idx]);
Refer to IEEE1800-2012 § 7.12 Array manipulation methods for details

Related

Why Iam getting ReferenceOutOfRangeException while PlayerPref a list in Unity [duplicate]

I have some code and when it executes, it throws a IndexOutOfRangeException, saying,
Index was outside the bounds of the array.
What does this mean, and what can I do about it?
Depending on classes used it can also be ArgumentOutOfRangeException
An exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll but was not handled in user code Additional information: Index was out of range. Must be non-negative and less than the size of the collection.
What Is It?
This exception means that you're trying to access a collection item by index, using an invalid index. An index is invalid when it's lower than the collection's lower bound or greater than or equal to the number of elements it contains.
When It Is Thrown
Given an array declared as:
byte[] array = new byte[4];
You can access this array from 0 to 3, values outside this range will cause IndexOutOfRangeException to be thrown. Remember this when you create and access an array.
Array Length
In C#, usually, arrays are 0-based. It means that first element has index 0 and last element has index Length - 1 (where Length is total number of items in the array) so this code doesn't work:
array[array.Length] = 0;
Moreover please note that if you have a multidimensional array then you can't use Array.Length for both dimension, you have to use Array.GetLength():
int[,] data = new int[10, 5];
for (int i=0; i < data.GetLength(0); ++i) {
for (int j=0; j < data.GetLength(1); ++j) {
data[i, j] = 1;
}
}
Upper Bound Is Not Inclusive
In the following example we create a raw bidimensional array of Color. Each item represents a pixel, indices are from (0, 0) to (imageWidth - 1, imageHeight - 1).
Color[,] pixels = new Color[imageWidth, imageHeight];
for (int x = 0; x <= imageWidth; ++x) {
for (int y = 0; y <= imageHeight; ++y) {
pixels[x, y] = backgroundColor;
}
}
This code will then fail because array is 0-based and last (bottom-right) pixel in the image is pixels[imageWidth - 1, imageHeight - 1]:
pixels[imageWidth, imageHeight] = Color.Black;
In another scenario you may get ArgumentOutOfRangeException for this code (for example if you're using GetPixel method on a Bitmap class).
Arrays Do Not Grow
An array is fast. Very fast in linear search compared to every other collection. It is because items are contiguous in memory so memory address can be calculated (and increment is just an addition). No need to follow a node list, simple math! You pay this with a limitation: they can't grow, if you need more elements you need to reallocate that array (this may take a relatively long time if old items must be copied to a new block). You resize them with Array.Resize<T>(), this example adds a new entry to an existing array:
Array.Resize(ref array, array.Length + 1);
Don't forget that valid indices are from 0 to Length - 1. If you simply try to assign an item at Length you'll get IndexOutOfRangeException (this behavior may confuse you if you think they may increase with a syntax similar to Insert method of other collections).
Special Arrays With Custom Lower Bound
First item in arrays has always index 0. This is not always true because you can create an array with a custom lower bound:
var array = Array.CreateInstance(typeof(byte), new int[] { 4 }, new int[] { 1 });
In that example, array indices are valid from 1 to 4. Of course, upper bound cannot be changed.
Wrong Arguments
If you access an array using unvalidated arguments (from user input or from function user) you may get this error:
private static string[] RomanNumbers =
new string[] { "I", "II", "III", "IV", "V" };
public static string Romanize(int number)
{
return RomanNumbers[number];
}
Unexpected Results
This exception may be thrown for another reason too: by convention, many search functions will return -1 (nullables has been introduced with .NET 2.0 and anyway it's also a well-known convention in use from many years) if they didn't find anything. Let's imagine you have an array of objects comparable with a string. You may think to write this code:
// Items comparable with a string
Console.WriteLine("First item equals to 'Debug' is '{0}'.",
myArray[Array.IndexOf(myArray, "Debug")]);
// Arbitrary objects
Console.WriteLine("First item equals to 'Debug' is '{0}'.",
myArray[Array.FindIndex(myArray, x => x.Type == "Debug")]);
This will fail if no items in myArray will satisfy search condition because Array.IndexOf() will return -1 and then array access will throw.
Next example is a naive example to calculate occurrences of a given set of numbers (knowing maximum number and returning an array where item at index 0 represents number 0, items at index 1 represents number 1 and so on):
static int[] CountOccurences(int maximum, IEnumerable<int> numbers) {
int[] result = new int[maximum + 1]; // Includes 0
foreach (int number in numbers)
++result[number];
return result;
}
Of course, it's a pretty terrible implementation but what I want to show is that it'll fail for negative numbers and numbers above maximum.
How it applies to List<T>?
Same cases as array - range of valid indexes - 0 (List's indexes always start with 0) to list.Count - accessing elements outside of this range will cause the exception.
Note that List<T> throws ArgumentOutOfRangeException for the same cases where arrays use IndexOutOfRangeException.
Unlike arrays, List<T> starts empty - so trying to access items of just created list lead to this exception.
var list = new List<int>();
Common case is to populate list with indexing (similar to Dictionary<int, T>) will cause exception:
list[0] = 42; // exception
list.Add(42); // correct
IDataReader and Columns
Imagine you're trying to read data from a database with this code:
using (var connection = CreateConnection()) {
using (var command = connection.CreateCommand()) {
command.CommandText = "SELECT MyColumn1, MyColumn2 FROM MyTable";
using (var reader = command.ExecuteReader()) {
while (reader.Read()) {
ProcessData(reader.GetString(2)); // Throws!
}
}
}
}
GetString() will throw IndexOutOfRangeException because you're dataset has only two columns but you're trying to get a value from 3rd one (indices are always 0-based).
Please note that this behavior is shared with most IDataReader implementations (SqlDataReader, OleDbDataReader and so on).
You can get the same exception also if you use the IDataReader overload of the indexer operator that takes a column name and pass an invalid column name.
Suppose for example that you have retrieved a column named Column1 but then you try to retrieve the value of that field with
var data = dr["Colum1"]; // Missing the n in Column1.
This happens because the indexer operator is implemented trying to retrieve the index of a Colum1 field that doesn't exist. The GetOrdinal method will throw this exception when its internal helper code returns a -1 as the index of "Colum1".
Others
There is another (documented) case when this exception is thrown: if, in DataView, data column name being supplied to the DataViewSort property is not valid.
How to Avoid
In this example, let me assume, for simplicity, that arrays are always monodimensional and 0-based. If you want to be strict (or you're developing a library), you may need to replace 0 with GetLowerBound(0) and .Length with GetUpperBound(0) (of course if you have parameters of type System.Array, it doesn't apply for T[]). Please note that in this case, upper bound is inclusive then this code:
for (int i=0; i < array.Length; ++i) { }
Should be rewritten like this:
for (int i=array.GetLowerBound(0); i <= array.GetUpperBound(0); ++i) { }
Please note that this is not allowed (it'll throw InvalidCastException), that's why if your parameters are T[] you're safe about custom lower bound arrays:
void foo<T>(T[] array) { }
void test() {
// This will throw InvalidCastException, cannot convert Int32[] to Int32[*]
foo((int)Array.CreateInstance(typeof(int), new int[] { 1 }, new int[] { 1 }));
}
Validate Parameters
If index comes from a parameter you should always validate them (throwing appropriate ArgumentException or ArgumentOutOfRangeException). In the next example, wrong parameters may cause IndexOutOfRangeException, users of this function may expect this because they're passing an array but it's not always so obvious. I'd suggest to always validate parameters for public functions:
static void SetRange<T>(T[] array, int from, int length, Func<i, T> function)
{
if (from < 0 || from>= array.Length)
throw new ArgumentOutOfRangeException("from");
if (length < 0)
throw new ArgumentOutOfRangeException("length");
if (from + length > array.Length)
throw new ArgumentException("...");
for (int i=from; i < from + length; ++i)
array[i] = function(i);
}
If function is private you may simply replace if logic with Debug.Assert():
Debug.Assert(from >= 0 && from < array.Length);
Check Object State
Array index may not come directly from a parameter. It may be part of object state. In general is always a good practice to validate object state (by itself and with function parameters, if needed). You can use Debug.Assert(), throw a proper exception (more descriptive about the problem) or handle that like in this example:
class Table {
public int SelectedIndex { get; set; }
public Row[] Rows { get; set; }
public Row SelectedRow {
get {
if (Rows == null)
throw new InvalidOperationException("...");
// No or wrong selection, here we just return null for
// this case (it may be the reason we use this property
// instead of direct access)
if (SelectedIndex < 0 || SelectedIndex >= Rows.Length)
return null;
return Rows[SelectedIndex];
}
}
Validate Return Values
In one of previous examples we directly used Array.IndexOf() return value. If we know it may fail then it's better to handle that case:
int index = myArray[Array.IndexOf(myArray, "Debug");
if (index != -1) { } else { }
How to Debug
In my opinion, most of the questions, here on SO, about this error can be simply avoided. The time you spend to write a proper question (with a small working example and a small explanation) could easily much more than the time you'll need to debug your code. First of all, read this Eric Lippert's blog post about debugging of small programs, I won't repeat his words here but it's absolutely a must read.
You have source code, you have exception message with a stack trace. Go there, pick right line number and you'll see:
array[index] = newValue;
You found your error, check how index increases. Is it right? Check how array is allocated, is coherent with how index increases? Is it right according to your specifications? If you answer yes to all these questions, then you'll find good help here on StackOverflow but please first check for that by yourself. You'll save your own time!
A good start point is to always use assertions and to validate inputs. You may even want to use code contracts. When something went wrong and you can't figure out what happens with a quick look at your code then you have to resort to an old friend: debugger. Just run your application in debug inside Visual Studio (or your favorite IDE), you'll see exactly which line throws this exception, which array is involved and which index you're trying to use. Really, 99% of the times you'll solve it by yourself in a few minutes.
If this happens in production then you'd better to add assertions in incriminated code, probably we won't see in your code what you can't see by yourself (but you can always bet).
The VB.NET side of the story
Everything that we have said in the C# answer is valid for VB.NET with the obvious syntax differences but there is an important point to consider when you deal with VB.NET arrays.
In VB.NET, arrays are declared setting the maximum valid index value for the array. It is not the count of the elements that we want to store in the array.
' declares an array with space for 5 integer
' 4 is the maximum valid index starting from 0 to 4
Dim myArray(4) as Integer
So this loop will fill the array with 5 integers without causing any IndexOutOfRangeException
For i As Integer = 0 To 4
myArray(i) = i
Next
The VB.NET rule
This exception means that you're trying to access a collection item by index, using an invalid index. An index is invalid when it's lower than the collection's lower bound or greater than equal to the number of elements it contains. the maximum allowed index defined in the array declaration
Simple explanation about what a Index out of bound exception is:
Just think one train is there its compartments are D1,D2,D3.
One passenger came to enter the train and he have the ticket for D4.
now what will happen. the passenger want to enter a compartment that does not exist so obviously problem will arise.
Same scenario: whenever we try to access an array list, etc. we can only access the existing indexes in the array. array[0] and array[1] are existing. If we try to access array[3], it's not there actually, so an index out of bound exception will arise.
To easily understand the problem, imagine we wrote this code:
static void Main(string[] args)
{
string[] test = new string[3];
test[0]= "hello1";
test[1]= "hello2";
test[2]= "hello3";
for (int i = 0; i <= 3; i++)
{
Console.WriteLine(test[i].ToString());
}
}
Result will be:
hello1
hello2
hello3
Unhandled Exception: System.IndexOutOfRangeException: Index was outside the bounds of the array.
Size of array is 3 (indices 0, 1 and 2), but the for-loop loops 4 times (0, 1, 2 and 3). So when it tries to access outside the bounds with (3) it throws the exception.
A side from the very long complete accepted answer there is an important point to make about IndexOutOfRangeException compared with many other exception types, and that is:
Often there is complex program state that maybe difficult to have control over at a particular point in code e.g a DB connection goes down so data for an input cannot be retrieved etc... This kind of issue often results in an Exception of some kind that has to bubble up to a higher level because where it occurs has no way of dealing with it at that point.
IndexOutOfRangeException is generally different in that it in most cases it is pretty trivial to check for at the point where the exception is being raised. Generally this kind of exception get thrown by some code that could very easily deal with the issue at the place it is occurring - just by checking the actual length of the array. You don't want to 'fix' this by handling this exception higher up - but instead by ensuring its not thrown in the first instance - which in most cases is easy to do by checking the array length.
Another way of putting this is that other exceptions can arise due to genuine lack of control over input or program state BUT IndexOutOfRangeException more often than not is simply just pilot (programmer) error.
These two exceptions are common in various programming languages and as others said it's when you access an element with an index greater than the size of the array. For example:
var array = [1,2,3];
/* var lastElement = array[3] this will throw an exception, because indices
start from zero, length of the array is 3, but its last index is 2. */
The main reason behind this is compilers usually don't check this stuff, hence they will only express themselves at runtime.
Similar to this:
Why don't modern compilers catch attempts to make out-of-bounds access to arrays?

Script is taking 11 - 20 seconds to lookup up an item in an 18,000 row data set

I have two Google sheets workbooks.
One is the "master" source of lookup data with a key based on manufacturer item #, which could be anything from 1234 to A-01/234-Name_1. This sheet, referenced via SpreadsheetApp.openByUrl, has 18,000 rows and 13 columns. The key column has been converted to plain text and the sheet is sorted by this column.
The second is the "template" where people enter item #s that they need to look up against the master, typically 20 - 1500 items at a time.
The script is in the template. It is very slow and routinely times out after 30 minutes. It was written by someone else and I am new to App Script, but I think I've managed to understand what the script is doing and where the bottleneck is occurring.
It does a bunch of stuff, but this is the meat of the lookup:
var numrows = master.getDataRange().getNumRows();
var masterdata = master.getDataRange().getValues();
var itemnumberlist = template.getDataRange().getValues();
var retreiveddata = [];
// iterate through the manf item number list to find all matches in the
// master and return those matches to another sheet
for (i = 1; i < template.getDataRange().getValues().length; i++) {
for (j = 0; j < numrows; j++) {
if (masterdata[j][1].toString() === itemnumberlist[i][1].toString()) {
retreiveddata.push(data[j]);
anothersheet.appendRow(data[j]);
}
}
}
I used Logger.log() to determine that each time through the i loop is taking 11 - 19 seconds, which just seems insane.
I've been doing some google searching and I've tried a couple of different things...
First I tried moving the writing of found data out of the for loop so the script would be doing all of its reading first and then writing in one big chunk, but I couldn't get it exactly right. My two attempts are below.
var mycounter = 0;
for (i = 0; i < template.getDataRange().getValues().length; i++) {
for (j = 0; j < numrows; j++) {
if (masterdata[j][0].toString() === itemnumberlist[i][0].toString()) {
retreiveddata.push(masterdata[j]);
mycounter = mycounter + 1;
}
}
}
// Attempt 1
// var myrange = retreiveddata.length;
// for(k = 0; k < myrange; k++) {
// anothersheet.appendRow(retreiveddata.pop([k]);
// }
//Attempt 2
var myotherrange = anothersheet.getRange(2,1,myothercounter, 13)
myotherrange.setValues(retreiveddata);
I can't remember for sure, because this was on Friday, but I think both attempts resulted in the script trying to write the entire master file into "anothersheet".
So I temporarily set this aside and decided to try something else. I was trying to recreate the issue in a couple of sample spreadsheets, but I was unable to do so. The same script is getting through my 15,000 row sample "master" file in less than 1 second per lookup. The only thing I can think of is that I used a random number as my key instead of a weird text string.
That led me to think that maybe I could use a hash algorithm on both the master data and the values to be looked up, but this is presenting a whole other set of issues.
I borrowed these functions from another forum post:
function GetMD5Hash(value) {
var rawHash = Utilities.computeDigest(Utilities.DigestAlgorithm.MD5,
value);
var txtHash = '';
for (j = 0; j <rawHash.length; j++) {
var hashVal = rawHash[j];
if (hashVal < 0)
hashVal += 256;
if (hashVal.toString(16).length == 1)
txtHash += "0";
txtHash += hashVal.toString(16);
Utilities.sleep(100);
}
return txtHash;
}
function RangeGetMD5Hash(input) {
if (input.map) { // Test whether input is an array.
return input.map(GetMD5Hash); // Recurse over array if so.
Utilities.sleep(100);
} else {
return GetMD5Hash(input)
}
}
It literally took me all day to get the hash value for all 18,000 item #s in my master spreadsheet. Neither GetMD5Hash nor RangeGetMD5Hash will return a value consistently. I can only do a few rows at a time. Sometimes I get "Loading..." indefinitely. Sometimes I get "#Name" with a message about GetMD5Hash being undefined (despite the fact that it worked on the previous row). And sometimes I get "#Error" with a message about an internal error.
This method actually reduces the lookup time of each item to 2 - 3 seconds (much better, but not great). However, I can't get the hash function to consistently work on the input data.
At this point I'm so frustrated and behind on my other work that I thought I'd reach out to the smart people on these forums and hope for some sort of miracle response.
To summarize, I'm looking for suggestions on these three items:
What am I doing wrong in my attempt to move the write out of the for loop?
Is there a way to get my hash value faster or utilize a different method to accomplish the same goal?
What else can I try to help speed up the script?
Any suggestions you can offer would be greatly appreciated!
-Mandy
It sounds like you hit on the right approach with attempting to move the appendRow() call out of the loop. Anytime you are reading or writing to a spreadsheet you can expect the individual call to take 1 to 2 seconds, so this will eat up a lot of time when you get matches. Storing the matches in an array and writing them all at once is the way to go.
Another thing I notice is that your script calls getValues() in the actual for loop condition statement. The condition statement is executed each time on each iteration of the loop, so this is potentially wasting a lot of time even when you don't have matches.
A final tweak that may be helpful depending on your desired behaviour. You can stop the inner for loop after it finds the first match, which, if you only care about the first match or know there will only be one match, will save you a lot of iterations. To do this, put "break" immediately after the retreiveddata.push(masterdata[j]); line.
To fix the getValues issue, Change:
for (i = 1; i < template.getDataRange().getValues().length; i++) {
To:
for (i = 1; i < itemnumberlist.length; i++) {
And that fix along with the appendRow issue, and including the break call:
for (i = 1; i < itemnumberlist.length; i++) {
for (j = 0; j < numrows; j++) {
if (masterdata[j][0].toString() === itemnumberlist[i][0].toString()) {
retreiveddata.push(masterdata[j]);
break; //stop searching after first match, move on to next item
}
}
}
//make sure you have data to write before trying to write it.
if(retreiveddata.length > 0){
var myotherrange = anothersheet.getRange(2,1,retreiveddata.length, retreiveddata[0].length);
myotherrange.setValues(retreiveddata);
}
If you are re-using the same sheet for "anothersheet" on each execution, you may also want to call anothersheet.clear() to erase any existing data before you write your fresh results.
I would pass on the hashing approach altogether, comparing strings is comparing strings, so whether they are hashes or actual part numbers I wouldn't expect a significant difference.

magento2 - How to get a product's stock status enabled/disabled?

I'm trying to get whether the product's stock status is instock/outofstock (Integers representing each state are fine. i don't necessarily need the "in stock"/"out of stock" strings per se).
I've tried various things to no avail.
1)
$inStock = $obj->get('Magento\CatalogInventory\Api\Data\StockItemInterface')->getisInStock()'
// Magento\CatalogInventory\Api\Data\StockItemInterface :: getisInStock returns true no matter what, even for 0qty products
// summary: not useful. How do you get the real one?
2)
$inStock = $obj->get('\Magento\CatalogInventory\Api\StockStateInterface')->verifyStock($_product->getId());
// test results for "verifyStock":
// a 0 qty product is in stock
// a 0 qty product is out of stock
// summary: fail. find correct method, with tests.
3)
$stockItemRepository = $obj->get('Magento\CatalogInventory\Model\Stock\StockItemRepository');
stockItem = $stockItemRepository->get($_product->getId());
$inStock = $stockItem->getIsInStock();
// Uncaught Magento\Framework\Exception\NoSuchEntityException: Stock Item with id "214"
// summmary: is stockitem not 1to1 with proudctid?
The weird thing is, getting stock quantities works just fine.
$availability = (String)$obj->get('\Magento\CatalogInventory\Api\StockStateInterface')->getStockQty($_product->getId(), $_product->getStore()->getWebsiteId());
So why isn't getIsInStock working?
This was one way I did it.
$stockItemResource = $obj->create('Magento\CatalogInventory\Model\ResourceModel\Stock\Item');
// grab ALL stock items (i.e. object that contains stock information)
$stockItemSelect = $stockItemResource->getConnection()->select()->from($stockItemResource->getMainTable());
$stockItems = $stockItemResource->getConnection()->fetchAll($stockItemSelect);
$inStock = null;
foreach($stockItems as $k => $item) {
if ($item['product_id'] == $_productId) {
$inStock = $item['is_in_stock'];
break; // not breaking properly. 'qz' still prints
}
}
Notes on efficiency:
I'm sure there are another ways to target the single item specifically, instead of getting all. Either through a method, or by adjusting the query passed in somehow.
But this method is probably more efficient for large n, avoiding the n+1 query problem.
You do still end up iterating through a lot, but perhaps theta(n) of iterating through a cached PHP variable is probably lower than n+1 querying the database. Haven't tested, just a hypothesis.
The returned structure is an array of arrays, where the sub-array (which also happens to be a stock item) has the product ID and the stock status value. And because the product ID and the stock status value is on the same level of nesting, we have no choice but to iterate through each sub-array to check the product_id, choose that sub-array, and grab the stock value. In short, we can't just utilize the hashmap, since the keys of the sub-array are not product IDs.
Ultimately, the efficiency of this depends on your use case. Rarely will you grab all stock items, unless doing mass exports. So the ultimate goal is to really just stay within the configured time limit is allowed for a request to persist.

Summing specific fields in Matlab

How do I sum different fields? I want to sum all of the information for material(1) ...so I want to add 5+4+6+300 but I am unsure how. Like is there another way besides just doing material(1).May + material(1).June etc....
material(1).May= 5;
material(1).June=4;
material(1).July=6;
material(1).price=300;
material(2).May=10;
material(2).price=550;
material(3).May=90;
You can use structfun for this:
result = sum( structfun(#(x)x, material(1)) );
The inner portion (structfun(#(x)x, material(1))) runs a function each individual field in the structure, and returns the results in an array. By using the identity function (#(x)x) we just get the values. sum of course does the obvious thing.
A slightly longer way to do this is to access each field in a loop. For example:
fNames = fieldnames(material(1));
accumulatedValue = 0;
for ix = 1:length(fNames)
accumulatedValue = accumulatedValue + material(1).(fNames{ix});
end
result = accumulatedValue
For some users this will be easier to read, although for expert users the first will be easier to read. The result and (approximate) performance are the same.
I think Pursuit's answer is very good, but here is an alternative off the top of my head:
sum( cell2mat( struct2cell( material(1) )));

Wanted: a quicker way to check all combinations within a very large hash

I have a hash with about 130,000 elements, and I am trying to check all combinations within that hash for something (130,000 x 130,000 combinations). My code looks like this:
foreach $key1 (keys %CNV)
{
foreach $key2 (keys %CNV)
{
if (blablabla){do something that doesn't take as long}
}
}
As you might expect, this takes ages to run. Does anyone know a quicker way to do this? Many thanks in advance!!
-Abdel
Edit: Update on the blablabla.
Hey guys, thanks for all the feedback! Really appreciate it. I changed the foreach statement to:
for ($j=1;$j<=24;++$j)
{
foreach $key1 (keys %{$CNV{$j}})
{
foreach $key2 (keys %{$CNV{$j}})
{
if (blablabla){do something}
}
}
}
The hash is now multidimensional:
$CNV{chromosome}{$start,$end}
I'll elaborate on what I'm exactly trying to do, as requested.
The blablabla is the following:
if ( (($CNVstart{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVstart{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVend{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVend{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVstart{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVstart{$j}{$key2} <= $CNVend{$j}{$key1})) ||
(($CNVend{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} <= $CNVend{$j}{$key1}))
)
In short: The hash elements represent a specific part of the DNA (a so called "CNV", think of it like a gene for now), with a start and an end (which are integers representing their position on that particular chromosome, stored in hashes with the same keys: %CNVstart & %CNVend). I'm trying to check for every combination of CNVs whether they overlap. If there are two elements that overlap within a family (I mean a family of persons whose DNA I have and read in; there is also a for-statement inside the foreach-statement that let's the program check this for every family, which makes it last even longer), I check whether they also have the same "copy number" (which is stored in another hash with the same keys) and print out the result.
Thank you guys for your time!
It sounds like Algorithm::Combinatorics may help you here. It's intended to provide "efficient generation of combinatorial sequences." From its docs:
Algorithm::Combinatorics is an
efficient generator of combinatorial
sequences. ... Iterators do not use
recursion, nor stacks, and are written
in C.
You could use its combinations sub-routine to provide all possible 2 key combos from your full set of keys.
On the other hand, Perl itself is written in C. So I honestly have no idea whether or not this would help at all.
Maybe by using concurrency? But you would have to be carefull with what you do with a possitive match as to not get problems.
E.g. take $key1, split it in $key1A and §key1B. The create two separate threads, each containing "half of the loop".
I am not sure exactly how expensive it is to start new threads in Perl but if your positive action doesn't have to be synchronized I imagine that on matching hardware you would be faster.
Worth a try imho.
define blah blah.
You could write it like this:
foreach $key1 (keys %CNV)
{
if (blah1)
{
foreach $key2 (keys %CNV)
{
if (blah2){do something that doesn't take as long}
}
}
}
This pass should be O(2N) instead of O(N^2)
The data structure in the question is not a good fit to the problem. Let's try it this way.
use Set::IntSpan::Fast::XS;
my #CNV;
for ([3, 7], [4, 8], [9, 11]) {
my $set = Set::IntSpan::Fast::XS->new;
$set->add_range(#{$_});
push #CNV, $set;
}
# The comparison is commutative, so we can cut the total number in half.
for my $index1 (0 .. -1+#CNV) {
for my $index2 (0 .. $index1) {
next if $index1 == $index2; # skip if it's the same CNV
say sprintf(
'overlap of CNV %s, %s at indices %d, %d',
$CNV[$index1]->as_string, $CNV[$index2]->as_string, $index1, $index2
) unless $CNV[$index1]->intersection($CNV[$index2])->is_empty;
}
}
Output:
overlap of CNV 4-8, 3-7 at indices 1, 0
We will not get the overlap of 3-7, 4-8 because it is a duplicate.
There's also Bio::Range, but it doesn't look so efficient to me. You should definitely get in touch with the bio.perl.org/open-bio people; chances are what you're doing has been done already a million times before they already have the optimal algorithm all figured out.
I think I found the answer :-)
Couldn't have done it without you guys though. I found a way to skip most of the comparisons I make:
for ($j=1;$j<=24;++$j)
{
foreach $key1 (sort keys %{$CNV{$j}})
{
foreach $key2 (sort keys %{$CNV{$j}})
{
if (($CNVstart{$j}{$key2} < $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} < $CNVstart{$j}{$key1}))
{
next;
}
if (($CNVstart{$j}{$key2} > $CNVend{$j}{$key1}) && ($CNVend{$j}{$key2} > $CNVend{$j}{$key1}))
{
last;
}
if ( (($CNVstart{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVstart{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVend{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVend{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVstart{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVstart{$j}{$key2} <= $CNVend{$j}{$key1})) ||
(($CNVend{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} <= $CNVend{$j}{$key1}))
) {print some stuff out}
}
}
}
What I did is:
sort the keys of the hash for each foreach loop
do "next" if the CNVs with $key2 still haven't reached the CNV with $key1 (i.e. start2 and end2 are both smaller than start1)
and probably the most time-saving: end the foreach loop if the CNV with $key2 has overtaken the CNV with $key1 (i.e. start2 and end2 are both larger than end1)
Thanks a lot for your time and feedback guys!
Your optimisation with taking out the j into the outer loop was good, but the solution is still far from optimal.
Your problem does have a simple O(N+M) solution where N is the total number of CNVs and M is the number of overlaps.
The idea is: you walk through the length of DNA while keeping track of all the "current" CNVs. If you see a new CNV start, you add it to the list and you know that it overlaps with all the other CNVs currently in the list. If you see a CNV end, you just remove it from the list.
I am not a very good perl programmer, so treat the following as a pseudo-code (it's more like a mix of Java and C# :)):
// input:
Map<CNV, int> starts;
Map<CNV, int> ends;
// temporary:
List<Tuple<int, bool, CNV>> boundaries;
foreach(CNV cnv in starts)
boundaries.add(starts[cnv], false, cnv);
foreach(CNV cnv in ends)
boundaries.add(ends[cnv], true, cnv);
// Sort first by position,
// then where position is equal we put "starts" first, "ends" last
boundaries = boundaries.OrderBy(t => t.first*2 + (t.second?1:0));
HashSet<CNV> current;
// main loop:
foreach((int position, bool isEnd, CNV cnv) in boundaries)
{
if(isEnd)
current.remove(cnv);
else
{
foreach(CNV otherCnv in current)
OVERLAP(cnv, otherCnv); // output of the algorithm
current.add(cnv);
}
}
Now I'm not a perl warrior, but based on the information given it is the same in any programming language; unless you sort the "hash" on the property you want to check and do a binary lookup you won't improve any performance in a lookup.
You can also if it is possible calculate which indexes in your hash would have the properties you are interested in, but as you have no information regarding such a possibility, this would perhaps not be a solution.