How can I compare each RDD line to other lines? Spark scala

How can I compare each RDD line to other lines? Spark scala - scala

I'm working actually on an RDD containing the protein names and their domains.
Example : 'PO7K9I' as protein name and 'IPR036291;IPR0023' as domains
I would like actually to compute the similarity between protein domains, as example, similarity=1 two proteins have exactly same domains, similarity= 0.75 the two proteins share only 3 commun domains/4 and finally similarity=0 : no commun domain has been found between the 2 proteins.
can you please help me?
THis is how my rdd look like:
(P25720,IPR002425;IPR036291;IPR020904;IPR0023) (Q9X2F4,IPR006047;IPR013780;IPR0178) (Q29228,IPR016161;IPR016163;IPR016160;IPR029510;IPR016162;IPR0155) (A5N376,IPR000821;IPR009006;IPR011079;IPR001608;IPR020622;IPR0290) (Q5HG16,IPR001792;IPR036046;IPR0179)

Can you try this way
val rdd = (P25720,IPR002425;IPR036291;IPR020904;IPR0023) (Q9X2F4,IPR006047;IPR013780;IPR0178) (Q29228,IPR016161;IPR016163;IPR016160;IPR029510;IPR016162;IPR0155) (A5N376,IPR000821;IPR009006;IPR011079;IPR001608;IPR020622;IPR0290) (Q5HG16,IPR001792;IPR036046;IPR0179)
val combs = rdd.cartesian(rdd) // for creating the Combinations
combs.map(your similarityCheck func)

Related

How to get number of lines from RDD which contain any digits

Lines of the document as follows:
I am 12 year old.
I go to school.
I am playing.
Its 4 pm.
There are two lines of the document that contain numbers in them. I want to count how many lines are there in the document with number?
This is to be implemented in scala spark.
val lineswithnum=linesRdd.filter(line => (line.contains([^0-9]))).count()
I expect output to be 2 . But I am getting 0

You can use exists method:
val lineswithnum=linesRdd.filter(line => line.exists(_.isDigit)).count()

In line with your original approach and not discounting the other answer(s):
val textFileLines = sc.textFile("/FileStore/tables/so99.txt")
val linesWithNumCollect = textFileLines.filter(_.matches(".*[0-9].*")).count
The .* added so as to capture within a line string.

What are "zip" methods in Scala and Spark?

In Scala, Spark and a lot of other "big data"-type frameworks, languages, libraries I see methods named "zip*". For instance, in Scala, List types have an inherent zipWithIndex method that you can use like so:
val listOfNames : List[String] = getSomehow()
for((name,i) <- listOfNames.zipWithIndex) {
println(s"Names #${i+1}: ${name}")
}
Similarly Spark has RDD methods like zip, zipPartitions, etc.
But the method name "zip" is totally throwing me off. Is this a concept in computing or discrete math?! What's the motivation for all these methods with "zip" in their names?

They are named zip because you are zipping two datasets like a zipper.
To visualize it, take two datasets:
x = [1,2,3,4,5,6]
y = [a,b,c,d,e,f]
and then zip them together to get
1 a
2 b
3 c
4 d
5 e
6 f
I put the extra spacing just give the zipper illusion as you move down the dataset :)

How can I zip two RDDs in PySpark?

I have been trying to merge the two Rdds below averagePoints1 and kpoints2 . It keeps throwing this error
ValueError: Can not deserialize RDD with different number of items in pair: (2, 1)
and I tried many things but I can't the two Rdds are identical, have the same number of partitions . my next to step is to apply euclidean distance function on the two lists to measure the difference so if any one knows how to solve this error or has a different approach I can follow I would really appreciate it.
Thanks in advance
averagePoints1 = averagePoints.map(lambda x: x[1])
averagePoints1.collect()
Out[15]:
[[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
kpoints2 = sc.parallelize(kpoints,4)
In [17]:
kpoints2.collect()
Out[17]:
[[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]

a= [[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
b= [[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]
rdda = sc.parallelize(a)
rddb = sc.parallelize(b)
c = rdda.zip(rddb)
print(c.collect())
check this answer
Combine two RDDs in pyspark

newSample=newCenters.collect() #new centers as a list
samples=zip(newSample,sample) #sample=> old centers
samples1=sc.parallelize(samples)
totalDistance=samples1.map(lambda (x,y):distanceSquared(x[1],y))
for future searchers this is the solution I followed at the end

spark(scala) three separated rdd[org.apache.spark.mllib.linalg.Vector] to a single rdd[Vector]

i have three separated rdd[mllib....vectors] and i need to combine them as a one rdd[mllib vector].
val vvv = my_ds.map(x=>(scaler.transform(Vectors.dense(x(0))),Vectors.dense((x(1)/bv_max_2).toArray),Vectors.dense((x(2)/bv_max_1).toArray)))
more info:
scaler => StandardScaler
bv_max_... is nothing but the DenseVector from breeze lib in case for normalizing (x/max(x))
now i need to make them all as one
i get ([1.],[2.],[3.]) and [[1.],[2.],[3.]]
but i need [1.,2.,3.] as one vector

finally i found ... i dont know if this is the best.
i had 3d data set and i needed to perform x/max(x) normalization on two dimensions and apply standardScaler to another dimension.
my problem was that in the end i had 3 separated Vectors like: eg
[ [1.0],[4,0],[5.0] ]
[ [2.0], [5.0], [6.0]]................but i needed [1.0,4.0,5.0] which can be passed to KMeans.
i changed the above code as :
val vvv = dsx.map(x=>scaler.transform(Vectors.dense(x.days_d)).toArray ++ (x.freq_d/bv_max_freq).toArray ++ (x.food_d/bv_max_food).toArray).map(x=>Vectors.dense(x(0),x(1),x(2)))

How to apply word2vec for k-means clustering?

I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via stanfordNLP and put each sentence into a line in a text file. Then the text file required by deeplearning4j word2vec was ready to process (http://deeplearning4j.org/word2vec).
Since the texts come from scientific fields, there are a lot of mathematical terms or brackets. See the sample sentences below:
The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) .
90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months .
After preparing the text file, I have run word2vec as below:
SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
iter.setPreProcessor(new SentencePreProcessor() {
#Override
public String preProcess(String sentence) {
//System.out.println(sentence.toLowerCase());
return sentence.toLowerCase();
}
});
// Split on white spaces in the line to get words
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
log.info("Building model....");
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(iter)
.tokenizerFactory(t)
.build();
log.info("Fitting Word2Vec model....");
vec.fit();
log.info("Writing word vectors to text file....");
// Write word vectors
WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
This script creates a text file containing many words withe their related vector values in each row as below:
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
As a subsequent step, this text file has been used to form some clusters via k-means in spark. See the code below:
val rawData = sc.textFile("...abs_terms.txt")
val extractedFeatureVector = rawData.map(s => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()
val numberOfClusters = 10
val numberOfInterations = 100
//We use KMeans object provided by MLLib to run
val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)
modell.clusterCenters.foreach(println)
//Get cluster index for each buyer Id
val AltCompByCluster = rawData.map {
row=>
(modell.predict(Vectors.dense(row.split(' ').slice(2,101)
.map(_.toDouble))),row.split(',').slice(0,1).head)
}
AltCompByCluster.foreach(println)
As a result of the latest scala code above, I have retrieved 10 clusters based on the word vectors suggested by word2vec. However, when I have checked my clusters no obvious common words appeared. That is, I could not get reasonable clusters as I expected. Based on this bottleneck of mine I have a few questions:
1) From some tutorials for word2vec I have seen that no data cleaning is made. In other words, prepositions etc. are left in the text. So how should I apply cleaning procedure when applying word2vec?
2) How can I visualize the clustering results in a explanatory way?
3) Can I use word2vec word vectors as input to neural networks? If so which neural network (convolutional, recursive, recurrent) method would be more suitable for my goal?
4) Is word2vec meaningful for my goal?
Thanks in advance.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I compare each RDD line to other lines? Spark scala - scala

Related

How to get number of lines from RDD which contain any digits

What are "zip" methods in Scala and Spark?

How can I zip two RDDs in PySpark?

spark(scala) three separated rdd[org.apache.spark.mllib.linalg.Vector] to a single rdd[Vector]

How to apply word2vec for k-means clustering?

Categories

Resources