I am working on creating an operations processor which takes a number of entries and performs some operations on them. The operations are chained, I mean the result of one operation will be the input for the next one.
For example:
inputs:
5 6 8
operation +
The idea would be to do the operation with the two first digits 5 and 6 and then use the result to do the operation with the third operator, 8 in this case.
To achieve this I'm thinking of using a queue although I am also considering recursive function
Is there a better solution? If not is the use of a queue a good solution?
Thanks
If you are dealing with complex Objects, this is almost complete description of Pipeline_software and to be more specific Pipes & filters.
Related
I am working on a fraudulent transaction detection project which makes use of spark and primarily uses rule-based approach to risk score incoming transactions. For this rule based approach, several maps are created from the historical data to represent the various patterns in transactions and these are then used later while scoring the transaction. Due to rapid increase in data size, we are now modifying code to generate these maps at each account level.
earlier code was for eg.
createProfile(clientdata)
but now it becomes
accountList.map(account=>createProfile(clientData.filter(s"""account=${account}""")))
Using this approach , the profiles are generated but since this operations are happening sequentially , hence it doesn't seem to be feasible.
Also, createProfile function itself is making use of dataframes, sparkContext/SparkSessions hence, this is leading to the issueof not able to send these tasks to worker nodes as based on my understanding only driver can access the dataframes and sparkSession/sparkContext. Hence , the following code is not working
import sparkSession.implicit._
val accountListRdd=accountList.toSeq.toDF("accountNumber")
accountList.rdd.map(accountrow=>createProfile(clientData.filter(s"""account=${accountrow.get(0).toString}""")))
The above code is not working but represents the logic for the desired output behaviour.
Another approach, i am looking at is using multithreading at driver level using scala Future .But even in this scenario , there are many jvm objects being created in a single createProfile function call , so by increasing threads , even if this approach works , it can lead to a lot jvm objects, which itself canlead to garbage collection and memory overhead issues.
just to put timing perspective, createProfile takes about 10 min on average for a single account and we have 3000 accounts , so sequentially it will take many days. With multi threading even if we achieve a factor of 10 , it will take many days. So we need parallelism in the order of 100s .
One of things that could have worked in case it existed was ..lets say if there is Something like a spark groupBy within a groupBY kind of operation, where at first level we can group By "account" and then do other operations
(currently issue is UDF won't be able to handle the kind of operations we want to perform)
Another solution if practically possible is the way SPark Streaming works-
it has a forEachRDD method and also spark.streaming.concurrentjobs parameter which allows processing of multiple RDDs in parallel . I am not sure how it works but maybe that kind of implementation may help.
Above is the problem description and my current views on it.
Please let me know if anyone has any idea about this! Also ,I will prefer a logical change rather than suggestion of different technology
So i have completed the coursera course on scala and have taken it upon myself to do a small POC showing off the multiprocessor capabilities of scala.
i am looking at creating a very small example where a application can launch multiple tasks(each task will do some network related queries etc) and i can show the usage of multiple cores as well.
Also there will be a thread that will listen on a specific port of a machine and spawn tasks based on what information it receives there.
Any suggestions on how to proceed with this kind of a problem?
I don't want to use AKKA now.
Parallel collections are perhaps the least-effort way to make use of multiple processors in Scala. It naturally leads into how best to organise one's code and data to take advantage of the parallel operations, and more importantly what doesn't get faster.
As a more concrete problem, suppose you have read a CSV file (or XML document, or whatever) and want to parse the data. If the records have already been split into a collection such as a List[String], you can then do .par to create a parallel List, and then a subsequent .map will use all cores where possible. The resulting List[whatever] will retain the same ordering even though the operations were not executed sequentially. Consider summing the values on each line:
val in: List[String] = ...
val out = in.par.map { line =>
val cols = line split ','
cols.map(_.toInt).sum
}
So an in of List("1,2,3", "4,5,6") would result in an out of List(6, 15), as it would without the .par. but it'll run across multiple cores. Whether it's faster is another matter, since there is overhead in using parallel collections that likely makes a trivial example such as this slower. You will want to experiment to see where parallel collections are a benefit for your use cases.
There is a more extensive discussion and documentation of parallel collections at http://docs.scala-lang.org/overviews/parallel-collections/overview.html
What about the sleeping barber problem? You could implement it in a distributed manner over the network, with the barber(s)' spawning service listening on one port and the customers spawning and requesting the barber(s) services over the network.
I think that would be vast and interesting enough while not being impossible.
Then you can build on it to expand it as much as you want, such as adding specialized barbers for different things (hair cut or shaving) and down from there. Sky (or, better, thread's no. cap) is the limit!
I'd like to generate ngram frequencies for a large dataset. Wikipedia, or more specifically, Freebase's WEX is suitable for my purposes.
What's the best and most cost efficient way to do it in the next day or so?
My thoughts are:
PostgreSQL using regex to split sentences and words. I already have the WEX dump in PostgreSQL, and I already have regex to do the splitting (major accuracy isn't required here)
MapReduce with Hadoop
MapReduce with Amazon's Elastic MapReduce, which I know next to nothing about
My experience with Hadoop consists of calculating Pi on three EC2 instances very very inefficiently. I'm good with Java, and I understand the concept of Map + Reduce.
PostgreSQL I fear will take a long, long time, as it's not easily parallelisable.
Any other ways to do it? What's my best bet for getting it done in the next couple days?
Mapreduce will work just fine, and probably you could do most of the input-output shuffling by pig.
See
http://arxiv.org/abs/1207.4371
for some algorithms.
Of course, to make sure you get a running start, you don't actually need to be using mapreduce for this task; just split the input yourself, make the simplest fast program to calculate ngrams of a single input file and aggregate the ngram frequencies later.
Hadoop gives you two good things , which are main in my opinion: parralell task running (map only jobs) and distributed sort (shuffling between map and reduce
For the NGrams, it looks like you need both - parralel tasks (mappers) to emit ngrams and shuffling - to count number of each ngram.
So I think Hadoop here is ideal solution.
Are parallel collections intended to do operations with side effects? If so, how can you avoid race conditions?
For example:
var sum=0
(1 to 10000).foreach(n=>sum+=n); println(sum)
50005000
no problem with this.
But if try to parallelize, race conditions happen:
var sum=0
(1 to 10000).par.foreach(n=>sum+=n);println(sum)
49980037
Quick answer: don't do that. Parallel code should be parallel, not concurrent.
Better answer:
val sum = (1 to 10000).par.reduce(_+_) // depends on commutativity and associativity
See also aggregate.
Parallel case doesn't work because you don't use volatile variables hence not ensuring visibility of your writes and because you have multiple threads that do the following:
read sum into a register
add to the register with the sum value
write the updated value back to memory
If 2 threads do step 1 first one after the other and then proceed to do the rest of the steps above in any order, they will end up overwriting one of the updates.
Use #volatile annotation to ensure visibility of sum when doing something like this. See here.
Even with #volatile, due to the non-atomicity of the increment you will be losing some increments. You should use AtomicIntegers and their incrementAndGet.
Although using atomic counters will ensure correctness, having shared variables here hinders performance greatly - your shared variable is now a performance bottleneck because every thread will try to atomically write to the same cache line. If you wrote to this variable infrequently, it wouldn't be a problem, but since you do it in every iteration, there will be no speedup here - in fact, due to cache-line ownership transfer between processors, it will probably be slower.
So, as Daniel suggested - use reduce for this.
I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).