First I would like to clarify that I have very little experience with Graph Theory and the proper algorithms to parse a directed graph, and that I've searched here on SO but didn't quite find what I was looking for. Hopefully you guys can help me :)
I have a large directed graph (around 3000 nodes) that has several subgraphs made out of connected nodes, and the subgraphs are not connected to each other. Here is a small representative graph of the data I have here:
I am writing a Perl script to find all possible paths starting from each source node to the sink nodes and store them in an array of arrays. So, for this graph, the possible paths would be:
1,2,3,4,5,6
1,2,3,4,5,7
1,8,9,10
11,12,13
11,14
15,16,17
The way I've done this search in my script was to use the Graph module in the following steps:
Find all source nodes in the graph and store them in an array
Find all sink nodes in the graph and store them in an array
Find all pairs short paths with the Floyd-Warshall algorithm
Search the APSP Floyd-Warshall graph object if exist a path between a source node and a sink node. If there is a path, store it in array of arrays. If there isn't a path, do nothing.
Here is the part of my script that does it:
#Getting all source nodes in the graph:
my #source_nodes = $dot_graph->source_vertices();
my #sink_nodes = $dot_graph->sink_vertices();
# Getting all possible paths between from source to sink nodes in the graph:
print "Calculating all possible overlaps in graph\n";
my $all_possible_paths = $dot_graph->APSP_Floyd_Warshall();
print "Done\n";
# print "Extending overlapping contigs\n";
my #all_paths;
foreach my $source (#source_nodes) {
foreach my $sink (#sink_nodes) {
my #path_vertices = $all_possible_paths->path_vertices($source, $sink);
my $path_length = $all_possible_paths->path_length($source,$sink);
#Saving only the paths that actually exist:
push (#all_paths, \#path_vertices) unless (!#path_vertices);
}
}
The problem with that is that it works fine for small graphs, but now that I have 3000 nodes, it would take a very very long time to finish (assuming that each path would take 1ms to be found, it would take 312.5 days to go through all of them). I know using the Floyd-Warshall algorithm to find all possible paths in the graph to only find the paths between sources and sinks is not efficient, but when I wrote the script I needed the results as soon as possible and my graphs were a lot smaller.
My question is how can I find the all paths starting from each source in the graph that will end in a sink node, without find all possible paths first? Is that what is called a breadth-first or a depth-first search? How to implement that with Perl (and if possible, with the Graph module)? Any help would be awesome!
P.S.: In order to make the program run faster, I started trying to breaking the initial big graph into its subgraphs and running the original script, but forking the main loop that searches for the paths between sources and sinks using Parallel::ForkManager. What do you guys think of that approach?
You're not interested in finding the shortest path, so forget about all those shortest path algorithms.
You're interested in finding all paths. This is called tree traversal, and it can be performed depth-first or breadth-first. Since you're traversing the entire tree, it doesn't really matter which approach is taken. The following performs a depth-first search using recursion:
sub build_paths {
my ($graph) = #_;
my #paths;
local *_helper = sub {
my $v = $_[-1];
my #successors = $graph->successors($v);
if (#successors) {
_helper(#_, $_)
for #successors;
} else {
push #paths, [ #_ ];
}
};
_helper($_)
for $graph->source_vertices();
return \#paths;
}
die if $graph->has_a_cycle;
my $paths = build_paths($graph);
Yes, it would be possible to parallelize the work, but I'm not writing that for you.
What concerns me the most is memory. Depending on the number of branches in the paths, you could easily end up running out of memory. Note that storing the paths as strings (of comma-separated values) would take less memory than storing them as arrays.
Related
I'm just starting to learn Perl and I have to do an exercise containing references.
I have to create a program, that constructs a list with two sided references, that are are received as command line arguments. At the beginning of the program, there is only one element in the list - 0. To go through the list, reference is being used, that references to the only element of the list at the moment - 0. The arguments of the command line are being read one by one and added right behind the element, that is being referenced to. When one argument is added, the reference slides one element to the right(it references to the newly added element). There are also two special elements - + and -. + allows the reference to move one element to the right, and - one element to the left. Also, it is important that the reference would not go beyond the list limit.
The output is all the arguments in the correct order of the list.
Additional requirements are that the list must be created by using hashes, that contain links to neighbouring elements. Also, I cannot use arrays to store the whole list.
There are a few examples to make it easier to grasp the concept, this is the most useful one:
./3.pl A D - B C + E
0 A B C D E
All I've got now is just the start of the program, it is nowhere near done and doesn't compile, but I can't figure out where to go from there. I've tried looking for some information about two-sided references(I'm not sure if I'm translating it correctly), but I can't seem to find anything. Any information about two-sided references or any tips how to start writing this program properly would be very appreciated.
My code:
#!/usr/bin/perl
use strict;
use warnings;
my $A= {
value=>'0',
prev=>'undef',
next=>'$B'
};
my $B= {
value=>'0',
prev=>'$A',
next=>'$C'
};
my $C= {
value=>'0',
prev=>'$B',
next=>'undef'
};
for my $smbl(0..#$ARGV) {
$A-> {value} = $ARGV[$smbl];
$Α-> {next} = $ARGV[$smbl+1];
}
First of all, what you are building is called a doubly linked list.
Let me tell you the biggest trick for working with linked lists: Create a dummy "head" node and a dummy "tail" node. You won't print their values, but having them will greatly reduce the number of special cases in your code, making it so much simpler!
At the core, you will have three "pointers" (references).
$head points to the first node of the list.
$tail points to the last node of the list.
$cursor initially points to the node in $tail. New nodes will be inserted before this node.
When processing +, two different situations you need to handle:
$cursor == $tail: Error! Cursor moved beyond end of list.
$cursor != $tail: Point $cursor to the node following the one it references.
When processing -, there are two different situations you need to handle:
$cursor->{prev} == $head: Error! Cursor moved beyond start of list.
$cursor->{prev} != $head: Point $cursor to the node preceding the one it references.
When processing inserting nodes, no checks need to be performed because of the dummy nodes!
I have a Matlab structure, A, that has 3 fields. The first field is SignalName and the third field is Children. Children can either be empty or a Struct array with the same fields, and so on, to an arbitrary depth not immediately known to the user. SignalName is a character array which is the name of a signal.
I have written a recursive function (in Matlab) to retrieve all of the SignalName values for structures that have no Children, and it's pretty slick (I think), but I need to know the absolute path taken to arrive at said SignalName. I cannot figure this out in Matlab.
As an example:
A.SignalName = 'Things'
A.Children = <22x1 struct>
A.Children(1).SignalName = 'Places'
A.Children(1).Children = [8x1 struct]
This goes on for an unknown depth, and the length of the struct arrays is not immediately known. It is easy via recursion to 'dive' down and get all of the SignalNames belonging to Children with no further Children, but how do I trace the route I used to get there? My function would ideally return results as a signal name, and the path taken to said signal.
In my experience with other languages, it seems like something like A* or Breadth-First might help, but I'm not exactly searching for something. I want simply to map every node and the path to it, and I'm not sure how to do that with the strange data-structure I'm given.
Thanks for any help you all can provide!
EDIT: I wanted to provide the code to hopefully shed light on my issue. I can get the paths down to the deepest node, but any other nodes at that level leave me without a complete path to that specific location. This is what I need help with. I am using '*' as my delimiter for a regexp in my post-processing script to break up the strings in PATHS.
For two nodes at the same depth, I might get a full path like 'A.B.C.D.Signal1' for the first node, but the second would give me a path of 'D.Signal2', when what I need is 'A.B.C.D.Signal2'. If the path was the same to the 'D' level with every signal, I would just copy the path over, but I have multiple branches in this struct from each level, and I go 4 or 5 levels deep.
function [NAMES,PATHS]=FindSignals(A,TMP,TMP2)
persistent SigName;
persistent path;
SigName = TMP;
path = TMP2;
if(~isempty(A.Children)) % If this struct has Children
for i = 1:length(A.Children) % Iterate through the Children
nextStruct = A.Children(i);
path = strcat([path '*' A.SignalName]);
[NAMES,PATHS] = FindSignals(nextStruct,SigName,path); % Recurse
end
else % If this struct has no Children
path = strcat([path '*' A.SignalName '*']); % Finish the path to Child
SigName = char(SigName,A.SignalName); % Grab the signal name
NAMES = SigName;
PATHS = path;
end
end
Again, thanks for any help!
EDIT: 12/14/2015 - I'm still completely stuck on this. Could anyone please take a peek at my source code above? I am unsure how to tack on an absolute path to the recursive function call and allow it to be a full path for each node at the same depth, and then reset to the appropriate depth once I move up or down the 'tree'. Thanks.
The related problem comes from the power Grid in Germany. I have a network of substations, which are connected according to the Lines. The shortest way from point A to B was calculated using the graphshortestpath function. The result is a path with the used substation ID's. I am interested in the Line ID's though, so I have written a sequential code to figure out the used Line_ID's for each path.
This algorithm uses two for loops. The first for-loop to access the path from a cell array, the second for-loop looks at each connection and searches the Line_ID from the array.
Question: Is there a better way of coding this? I am looking for the Line_ID's, graphshortestpath only returns the node ID's.
Here is the main code:
for i = i_entries
path_i = LKzuLK_path{i_entries};
if length(path_i) > 3 %If length <=3 no lines are used.
id_vb = 2:length(path_i) - 2;
for id = id_vb
node_start = path_i(id);
node_end = path_i(id+1);
idx_line = find_line_idx(newlinks_vertices, node_start, ...
node_end);
Zuordnung_LKzuLK_pathLines(ind2sub(size_path,i),idx_line) = true;
end
end
end
Note: The first and last enrty of path_i are area ID's, so they are not looked upon for the search for the Line_ID's
function idx_line = find_line_idx(newlinks_vertices, v_id_1, v_id_2)
% newlinks_vertices includes the Line_ID, and then the two connecting substations
% Mirror v_id's in newlinks_vertices:
check_links = [newlinks_vertices; newlinks_vertices(:,1), newlinks_vertices(:,3), newlinks_vertices(:,2)];
tmp_dist1 = find(check_links(:,2) == v_id_1);
tmp_dist2 = find(check_links(tmp_dist1,3) == v_id_2,1);
tmp_dist3 = tmp_dist1(tmp_dist2);
idx_line = check_links(tmp_dist3,1);
end
Note: I have already tried to shorten the first find-search routine, by indexing the links list. This step will return a short list with only relevant entries of the links looked upon. That way the algorithm is reduced of the first and most time consuming find function. The result wasn't much better, the calculation time was still at approximately 7 hours for 401*401 connections, so too long to implement.
I would look into Dijkstra's algorithm to get a faster implementation. This is what Matlab's graphshortestpath uses by default. The linked wiki page probably explains it better than I ever could and even lays it out in pseudocode!
Especially interesting for me as a PHP/Perl-beginner is this site in Switzerland:
see this link:http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=1308
Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it.
what we have so far: Well the harvesting task should be no problem if i take WWW::Mechanize - particularly for doing the form based search and selecting the individual entries. Hmm - i guess that the algorithm would be basically 2 nested loops: the outer loop runs the form based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here. Well - how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);
This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.
If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of perl of max. 20 lines - likely less.
But wait: the processing of the entry pages will then be the most complex part
of the script.
Approaches: In principle we could do the same algorithm with a single while loop
if we use the back() function smartly.
Can you give me a hint for the beginning - the processing of the entry pages - doing this in Perl:: Mechanize
"Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it."
Not true. See http://perlmonks.org/?node_id=905767
"The data is copyrighted even though it is made available freely: "Downloading or copying of texts, illustrations, photos or any other data does not entail any transfer of rights on the content." (and again, in German, as you've been scraping some other German list to spam before)."
I've started a little pet project to parse log files for Team Fortress 2. The log files have an event on each line, such as the following:
L 10/23/2009 - 21:03:43: "Mmm... Cycles!<67><STEAM_0:1:4779289><Red>" killed "monkey<77><STEAM_0:0:20001959><Blue>" with "sniperrifle" (customkill "headshot") (attacker_position "1848 813 94") (victim_position "1483 358 221")
Notice there are some common parts of the syntax for log files. Names, for example consist of four parts: the name, an ID, a Steam ID, and the team of the player at the time. Rather than rewriting this type of regular expression, I was hoping to abstract this out slightly.
For example:
my $name = qr/(.*)<(\d+)><(.*)><(Red|Blue)>/
my $kill = qr/"$name" killed "$name"/;
This works nicely, but the regular expression now returns results that depend on the format of $name (breaking the abstraction I'm trying to achieve). The example above would match as:
my ($name_1, $id_1, $steam_1, $team_1, $name_2, $id_2, $steam_2, $team_2)
But I'm really looking for something like:
my ($player1, $player2)
Where $player1 and $player2 would be tuples of the previous data. I figure the "killed" event doesn't need to know exactly about the player, as long as it has information to create the player, which is what these tuples provide.
Sorry if this is a bit of a ramble, but hopefully you can provide some advice!
I think I understand what you are asking. What you need to do is reverse your logic. First you need to regex to split the string into two parts, then you extract your tuples. Then your regex doesn't need to know about the name, and you just have two generic player parsing regexs. Here is an short example:
#!/usr/bin/perl
use strict;
use Data::Dumper;
my $log = 'L 10/23/2009 - 21:03:43: "Mmm... Cycles!<67><STEAM_0:1:4779289><Red>" killed "monkey<77><STEAM_0:0:20001959><
Blue>" with "sniperrifle" (customkill "headshot") (attacker_position "1848 813 94") (victim_position "1483 358 221")';
my ($player1_string, $player2_string) = $log =~ m/(".*") killed (".*?")/;
my #player1 = $player1_string =~ m/(.*)<(\d+)><(.*)><(Red|Blue)>/;
my #player2 = $player2_string =~ m/(.*)<(\d+)><(.*)><(Red|Blue)>/;
print STDERR Dumper(\#player1, \#player2);
Hope this what you were looking for.
Another way to do it, but the same strategy as dwp's answer:
my #players =
map { [ /(.*)<(\d+)><(.*)><(Red|Blue)>/ ] }
$log_text =~ /"([^\"]+)" killed "([^\"]+)"/
;
Your log data contains several items of balanced text (quoted and parenthesized), so you might consider Text::Balanced for parts of this job, or perhaps a parsing approach rather than a direct attack with regex. The latter might be fragile if the player names can contain arbitrary input, for example.
Consider writing a Regexp::Log subclass.