Count Triangles in Scala - Spark - scala

i am trying to get into data analytics using Spark with Scala. My question is how do i get the triangles in a graph? And i mean not the Triangle Count that comes with graphx, but the actual nodes that consist the triangle.
Suppose we have a graph file, i was able to calculate the triangles in scala, but the same technique does not apply in spark since i have to use RDD operations.
The data i give to the function is a complex List consisting of the src and the List of the destinations of that source; ex. Adj(5, List(1,2,3)), Adj(4, List(9,8,7)), ...
My scala version is this :
(Paths: List[Adj])
Paths.flatMap(i=> Paths.map(j => Paths.map(k => {
if(i.src != j.src && i.src!= k.src && j.src!=k.src){
if(i.dst.contains(j.src) && j.dst.contains(k.src) && k.dst.contains(i.src)){
println(i.src,j.src,k.src) //3 nodes that make a triangle
}
else{
()
}
}
})))
And the output would be something like:
(1,2,3)
(4,5,6)
(2,5,6)
In conclusion i want the same output but in spark environment execution. In addition i am looking for a more efficient way to hold the information about Adjacencies like key mapping and then reducing by key or something. As the spark environment needs a quite different way to approach each problem (big data operations) i would be gratefull if you could explain the way of thinking and give me a small briefing about the functions you used.
Thank you.

Related

Flink: How to write DataSet to a variable instead of to a file

I have a flink batch program written in scala using the DataSet API which results in a final dataset I am interested in. I would like to get that dataset as a variable or value (e.g. a list or sequence of String) within my program, without having to write it to any file. Is it possible?
I have seen that flink allows for collection data sinks in order to debug (the only example in their doc is in Java). However, this is only allowed in local execution, and anyway I don't know its equivalent in Scala. What I would like is to write the final resulting dataset after the whole flink parallel execution is done to a program value or variable.
First, try this for the scala version of collection data sink:
import org.apache.flink.api.scala._
import org.apache.flink.api.java.io.LocalCollectionOutputFormat;
.
.
val env = ExecutionEnvironment.getExecutionEnvironment
// Create a DataSet from a list of elements
val words = env.fromElements("w1","w2", "w3")
var outData:java.util.List[String]= new java.util.ArrayList[String]()
words.output(new LocalCollectionOutputFormat(outData))
// execute program
env.execute("Flink Batch Scala")
println(outData)
Second, if your dataset fits in memory of single machine why do you need to use a distributed processing framework? I think you should think more about your use case! and try to use the right transformations on your dataset.
I used flink 1.72 with scala 2.12. And this is a streaming prediction using SVM that i wrapped up in Model class. I think the most correct answer is using collect(). It'll return Seq. i got this answer after searching for hours. i got the idea from Flink Git - Line 95
var temp_jaringan : DataSet[(Vector,Double)] = model.predict_jaringan(value)
temp_jaringan.print()
var temp_produk : DataSet[(Vector,Double)] = model.predict_produk(value)
temp_produk.print()
var result_jaringan : Seq[(Vector,Double)] = temp_jaringan.collect()
var result_produk : Seq[(Vector,Double)] = temp_produk.collect()
if(result_jaringan(0)._2 == 1.0 && result_produk(0)._2 == 1.0 ){
println("Keduanya")
}else if(result_jaringan(0)._2 == 1.0 && result_produk(0)._2 == -1.0){
println("Jaringan")
}else if(result_jaringan(0)._2 == -1.0 && result_produk(0)._2 == 1.0){
println("Produk")
}else{
println("Bukan Keduanya")
}
It may vary based on other version. cause after using and searching flink material like a mad dog for weeks even months for my final project as graduation requirement, i know that this flink develepment projects need more documentation and tutorial, especially for beginners like me.
anyway, correct me if im wrong. Thanks!

Is it possible to load word2vec pre-trained available vectors into spark?

Is there a way to load Google's or Glove's pre-trained vectors (models) such as GoogleNews-vectors-negative300.bin.gz into spark and performing operations such as findSynonyms that are provided from spark? or do I need to do the loading and operations from scratch?
In this post Load Word2Vec model in Spark , Tom Lous suggests converting the bin file to txt and starting from there, I already did that .. but then what is next?
In a question I posted yesterday I got an answer that models in Parquet format can be loaded in spark, thus I'm posting this question to be sure that there is no other option.
Disclaimer: I'm pretty new to spark, but the below at least works for me.
The trick is figuring out how to construct a Word2VecModel from a set of word vectors as well as handling some of the gotchas in trying to create the model this way.
First, load your word vectors into a Map. For example, I have saved my word vectors to a parquet format (in a folder called "wordvectors.parquet") where the "term" column holds the String word and the "vector" column holds the vector as an array[float], and I can load it like so in Java:
// Loads the dataset with the "term" column holding the word and the "vector" column
// holding the vector as an array[float]
Dataset<Row> vectorModel = pSpark.read().parquet("wordvectors.parquet");
//convert dataset to a map.
Map<String, List<Float>> vectorMap = Arrays.stream((Row[])vectorModel.collect())
.collect(Collectors.toMap(row -> row.getAs("term"), row -> row.getList(1)));
//convert to the format that the word2vec model expects float[] rather than List<Float>
Map<String, float[]> word2vecMap = vectorMap.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, entry -> (float[]) Floats.toArray(entry.getValue())));
//need to convert to scala immutable map because that's what word2vec needs
scala.collection.immutable.Map<String, float[]> scalaMap = toScalaImmutableMap(word2vecMap);
private static <K, V> scala.collection.immutable.Map<K, V> toScalaImmutableMap(Map<K, V> pFromMap) {
final List<Tuple2<K,V>> list = pFromMap.entrySet().stream()
.map(e -> Tuple2.apply(e.getKey(), e.getValue()))
.collect(Collectors.toList());
Seq<Tuple2<K,V>> scalaSeq = JavaConverters.asScalaBufferConverter(list).asScala().toSeq();
return (scala.collection.immutable.Map<K, V>) scala.collection.immutable.Map$.MODULE$.apply(scalaSeq);
}
Now you can construct the model from scratch. Due to a quirk in how Word2VecModel works, you must set the vector size manually, and do so in a weird way. Otherwise it defaults to 100 and you get an error when trying to invoke .transform(). Here is a way I've found that works, not sure if everything is necessary:
//not used for fitting, only used for setting vector size param (not sure if this is needed or if result.set is enough
Word2Vec parent = new Word2Vec();
parent.setVectorSize(300);
Word2VecModel result = new Word2VecModel("w2vmodel", new org.apache.spark.mllib.feature.Word2VecModel(scalaMap)).setParent(parent);
result.set(result.vectorSize(), 300);
Now you should be able to use result.transform() like you would with a self-trained model.
I haven't tested other Word2VecModel functions to see if they work correctly, I only tested .transform().

Apache Spark: multiple outputs in one map task

TL;DR: I have a large file that I iterate over three times to get three different sets of counts out. Is there a way to get three maps out in one pass over the data?
Some more detail:
I'm trying to compute PMI between words and features that are listed in a large file. My pipeline looks something like this:
val wordFeatureCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield ((word, feature), 1)
})
And then I repeat this to get word counts and feature counts separately:
val wordCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield (word, 1)
})
val featureCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield (feature, 1)
})
(I realize I could just iterate over wordFeatureCounts to get the wordCounts and featureCounts, but that doesn't answer my question, and looking at running times in practice I'm not sure it's actually faster to do it that way. Also note that there are some reduceByKey operations and other stuff that I do with this after the counts are computed that aren't shown, as they aren't relevant to the question.)
What I would really like to do is something like this:
val (wordFeatureCounts, wordCounts, featureCounts) = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
val wfCounts = for (feature <- features) yield ((word, feature), 1)
val wCounts = for (feature <- features) yield (word, 1)
val fCounts = for (feature <- features) yield (feature, 1)
??.setOutput1(wfCounts)
??.setOutput2(wCounts)
??.setOutput3(fCounts)
})
Is there any way to do this with spark? In looking for how to do this, I've seen questions about multiple outputs when you're saving the results to disk (not helpful), and I've seen a bit about accumulators (which don't look like what I need), but that's it.
Also note that I can't just yield all of these results in one big list, because I need three separate maps out. If there's an efficient way to split a combined RDD after the fact, that could work, but the only way I can think of to do this would end up iterating over the data four times, instead of the three I currently do (once to create the combined map, then three times to filter it into the maps I actually want).
It is not possible to split an RDD into multiple RDDs. This is understandable if you think about how this would work under the hood. Say you split RDD x = sc.textFile("x") into a = x.filter(_.head == 'A') and b = x.filter(_.head == 'B'). Nothing happens so far, because RDDs are lazy. But now you print a.count. So Spark opens the file, and iterates through the lines. If the line starts with A it counts it. But what do we do with lines starting with B? Will there be a call to b.count in the future? Or maybe it will be b.saveAsTextFile("b") and we should be writing these lines out somewhere? We cannot know at this point. Splitting an RDD is just not possible with the Spark API.
But nothing stops you from implementing something if you know what you want. If you want to get both a.count and b.count you can map lines starting with A into (1, 0) and lines with B into (0, 1) and then sum up the tuples elementwise in a reduce. If you want to save lines with B into a file while counting lines with A, you could use an aggregator in a map before filter(_.head == 'B').saveAsTextFile.
The only generic solution is to store the intermediate data somewhere. One option is to just cache the input (x.cache). Another is to write the contents into separate directories in a single pass, then read them back as separate RDDs. (See Write to multiple outputs by key Spark - one Spark job.) We do this in production and it works great.
This is one of the major disadvantages of Spark over traditional map-reduce programming. An RDD/DF/DS can be transformed into another RDD/DF/DS but you cannot map an RDD into multiple outputs. To avoid recomputation you need to cache the results into some intermediate RDD and then run multiple map operations to generate multiple outputs. The caching solution will work if you are dealing with reasonable size data. But if the data is large compared to the memory available the intermediate outputs will be spilled to disk and the advantage of caching will not be that great. Check out the discussion here - https://issues.apache.org/jira/browse/SPARK-1476. This is an old Jira but relevant. Checkout out the comment by Mridul Muralidharan.
Spark needs to provide a solution where a map operation can produce multiple outputs without the need to cache. It may not be elegant from the functional programming perspective but I would argue, it would be a good compromise to achieve better performance.
I was also quite disappointed to see that this is a hard limitation of Spark over classic MapReduce. I ended up working around it by using multiple successive maps in which I filter out the data I need.
Here's a schematic toy example that performs different calculations on the numbers 0 to 49 and writes both to different output files.
from functools import partial
import os
from pyspark import SparkContext
# Generate mock data
def generate_data():
for i in range(50):
yield 'output_square', i * i
yield 'output_cube', i * i * i
# Map function to siphon data to a specific output
def save_partition_to_output(part_index, part, filter_key, output_dir):
# Initialise output file handle lazily to avoid creating empty output files
file = None
try:
for key, data in part:
if key != filter_key:
# Pass through non-matching rows and skip
yield key, data
continue
if file is None:
file = open(os.path.join(output_dir, '{}-part{:05d}.txt'.format(filter_key, part_index)), 'w')
# Consume data
file.write(str(data) + '\n')
yield from []
finally:
if file is not None:
file.close()
def main():
sc = SparkContext()
rdd = sc.parallelize(generate_data())
# Repartition to number of outputs
# (not strictly required, but reduces number of output files).
#
# To split partitions further, use repartition() instead or
# partition by another key (not the output name).
rdd = rdd.partitionBy(numPartitions=2)
# Map and filter to first output.
rdd = rdd.mapPartitionsWithIndex(partial(save_partition_to_output, filter_key='output_square', output_dir='.'))
# Map and filter to second output.
rdd = rdd.mapPartitionsWithIndex(partial(save_partition_to_output, filter_key='output_cube', output_dir='.'))
# Trigger execution.
rdd.count()
if __name__ == '__main__':
main()
This will create two output files output_square-part00000.txt and output_cube-part00000.txt with the desired output splits.

Spark caching strategy

I have a Spark driver that goes like this:
EDIT - earlier version of the code was different & didn't work
var totalResult = ... // RDD[(key, value)]
var stageResult = totalResult
do {
stageResult = stageResult.flatMap(
// Some code that returns zero or more outputs per input,
// and updates `acc` to number of outputs
...
).reduceByKey((x, y) => x.sum(y))
totalResult = totalResult.union(stageResult)
} while(stageResult.count() > 0)
I know from properties of my data that this will eventually terminate (I'm essentially aggregating up the nodes in a DAG).
I'm not sure of a reasonable caching strategy here - should I cache stageResult each time through the loop? Am I setting up a horrible tower of recursion, since each totalResult depends on all previous incarnations of itself? Or will Spark figure that out for me? Or should I put each RDD result in an array and take one big union at the end?
Suggestions will be welcome here, thanks.
I would rewrite this as follows:
do {
stageResult = stageResult.flatMap(
//Some code that returns zero or more outputs per input
).reduceByKey(_+_).cache
totalResult = totalResult.union(stageResult)
} while(stageResult.count > 0)
I am fairly certain(95%) that the stageResult DAG used in the union will be the correct reference (especially since count should trigger it), but this might need to be double checked.
Then when you call totalResult.ACTION, it will put all of the cached data together.
ANSWER BASED ON UPDATED QUESTION
As long as you have the memory space, then I would indeed cache everything along the way as it stores the data of each stageResult, unioning all of those data points at the end. In fact, each union does not rely on the past as that is not the semantics of RDD.union, it merely puts them together at the end. You could just as easily change your code to use a val due to RDD immutability.
As a final note, maybe the DAG visualization will help understand why there would not be recursive ramifications:

Finding mean and standard deviation of a large dataset

I have about 1500 files on S3 (each file looks like this:)
Format :
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
I read the file as:
import scala.io.Source
val FileRead = Source.fromFile("/home/home/testdataFile1").mkString
Here is an example of what I get:
1152 401368:1.006,401207:1.03
1184 401230:1.119,40049:1.11,40029:1.31
How do I compute the average and standard deviation of the variable 'Score'?
While it's not explicit in the question, Apache Spark is a good tool for doing this in a distributed way. I assume you have set up a Spark cluster. Read the files into an RDD:
val lines: RDD[String] = sc.textFile("s3n://bucket/dir/*")
Pick out the "score" somehow:
val scores: RDD[Double] = lines.map(_.split(":").last.toDouble).cache
.cache saves it in memory. This avoids re-reading the files all the time, but can use a lot of RAM. Remove it if you want to trade speed for RAM.
Calculate the metrics:
val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / (count - 1))
This question is not new, so maybe I can update the answers.
There are stddev functions (stddev, stddev_pop, and stddev_smap) is SparkSQL (import org.apache.spark.sql.functions) since spark version >= 1.6.0.
I use Apache Commons Math for this stuff (http://commons.apache.org/proper/commons-math/userguide/stat.html), albeit from Java. You can stream stuff through the SummaryStatistics class so you aren't limited to the size of memory. Scala to Java interop should allow you to do this, but I haven't tried it. You should be able to each your way through the File line by line and stream the stuff through an instance of SummaryStatistics. How hard could it be in Scala?
Lookie here, someone is off and Scala-izing the whole thing: https://code.google.com/p/scalalab/wiki/ApacheCommonMathsLibraryInScalaLab
I don't think that storage space should be an issue so I would try putting all of the values into an array of doubles then adding up all of the values then use that and the number of elements in the array to calculate the mean.Then add up all of the absolute values of the differences between the value in the mean and divide that by the number of elements. Then take the square root.