How to transform Array[(Double, Double)] into Array[Double] in Scala? - scala

I'm using MLlib of Spark (v1.1.0) and Scala to do k-means clustering applied to a file with points (longitude and latitude).
My file contains 4 fields separated by comma (the last two are the longitude and latitude).
Here, it's an example of k-means clustering using Spark:
https://spark.apache.org/docs/1.1.0/mllib-clustering.html
What I want to do is to read the last two fields of my files that are in a specific directory in HDFS, transform them into an RDD<Vector> o use this method in KMeans class:
train(RDD<Vector> data, int k, int maxIterations)
This is my code:
val data = sc.textFile("/user/test/location/*")
val parsedData = data.map(s => Vectors.dense(s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))))
But when I run it in spark-shell I get the following error:
error: overloaded method value dense with alternatives: (values:
Array[Double])org.apache.spark.mllib.linalg.Vector (firstValue:
Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[(Double, Double)])
So, I don't know how to transform my Array[(Double, Double)] into Array[Double]. Maybe there is another way to read the two fields and convert them into RDD<Vector>, any suggestion?

Previous suggestion using flatMap was based on the assumption that you wanted to map over the elements of the array given by the .split(",") - and offered to satisfy the types, by using Array instead of Tuple2.
The argument received by the .map/.flatMap functions is an element of the original collection, so should be named 'field' (singluar) for clarity. Calling fields(2) selects the 3rd character of each of the elements of the split - hence the source of confusion.
If what you're after is the 3rd and 4th elements of the .split(",") array, converted to Double:
s.split(",").drop(2).take(2).map(_.toDouble)
or if you want all BUT the first to fields converted to Double (if there may be more than 2):
s.split(",").drop(2).map(_.toDouble)

There're two 'factory' methods for dense Vectors:
def dense(values: Array[Double]): Vector
def dense(firstValue: Double, otherValues: Double*): Vector
While the provided type above is Array[Tuple2[Double,Double]] and hence does not type-match:
(Extracting the logic above:)
val parseLineToTuple: String => Array[(Double,Double)] = s => s=> s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))
What is needed here is to create a new Array out of the input String, like this: (again focusing only on the specific parsing logic)
val parseLineToArray: String => Array[Double] = s=> s.split(",").flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble)))
Integrating that in the original code should solve the issue:
val data = sc.textFile("/user/test/location/*")
val vectors = data.map(s => Vectors.dense(parseLineToArray(s))
(You can of course inline that code, I separated it here to focus on the issue at hand)

val parsedData = data.map(s => Vectors.dense(s.split(',').flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble))))

Related

Spark-Scala: Map the first element of list with every other element of list when lists are of varying length

I have dataset of the following type in a textile:
1004,bb5469c5|2021-09-19 01:25:30,4f0d-bb6f-43cf552b9bc6|2021-09-25 05:12:32,1954f0f|2021-09-19 01:27:45,4395766ae|2021-09-19 01:29:13,
1018,36ba7a7|2021-09-19 01:33:00,
1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59,77a90a1c97b|2021-09-19 01:34:53,
1022,3623fe40|2021-09-19 01:33:00,
1028,6c77d26c-6fb86|2021-09-19 01:50:50,f0ac93b3df|2021-09-19 01:51:11,
1032,ac55-4be82f28d|2021-09-19 01:54:20,82229689e9da|2021-09-23 01:19:47,
I read the file using sc.textFile which returns an RDD of type Array[String] after which I perform the operations .map(x=>x.substring(1,x.length()-1)).map(x=>x.split(",").toList)
After split.toList I want to map the first element of each of the lists obtained to every other element of the list for which I use .map(x=>(x(0),x(1))).toDF("c1","c2")
This works fine for those lists which have only one value after split but skips on all other elements of the lists having more than one value for obvious reasons. For eg:
.map(x=>(x(0),x(1))) returns [1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59] but skips out on the third element here 77a90a1c97b|2021-09-19 01:34:53
How can I write a map function which returns [1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59], [1020,77a90a1c97b|2021-09-19 01:34:53] given that all the lists created using .map(x=>x.split(",").toList) are of varying lengths (have varying number of elements)?
I noted the ',' at the end of the file, but split ignores nulls.
The solution is as follows, just try it and you will see it works:
// x._n cannot work here initially.
val rdd = spark.sparkContext.textFile("/FileStore/tables/oddfile_01.txt")
val rdd2 = rdd.map(line => line.split(','))
val rdd3 = rdd2.map(x => (x(0), x.tail.toList))
val rdd4 = rdd3.flatMap{case (x, y) => y.map((x, _))}
rdd4.collect
Cardinality does change in this approach though.

How to create an RDD by selecting specific data from an existing RDD where output should of RDD[String]?

I have scenario to capture some data (not all) from an existing RDD and then pass it to other Scala class for actual operations. Lets see with example data(empnum, empname, emplocation, empsal) in a text file.
11,John,Paris,1000
12,Daniel,UK,3000
first step, I create an RDD with RDD[String] by below code,
val empRDD = spark
.sparkContext
.textFile("empInfo.txt")
So, my requirement is to create another RDD with empnum, empname, emplocation (again with RDD[String]).
For that I have tried below code hence I am getting RDD[String, String, String].
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
I have tried with Slice also, it gives me RDD[Array(String)].
My required RDD should be of RDD[String] to pass to required Scala class to do some operations.
The expected output should be,
11,John,Paris
12,Daniel,UK
Can anyone help me how to achieve?
I would try this
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
val rddString = empReqRDD.map({case(id,name,city) => "%s,%s,%s".format(id,name,city)})
In your initial implementation, the second map is putting the array elements into a 3-tuple, hence the RDD[(String, String, String)].
One way to accomplish your objective is to change the second map to construct a string like so:
empRDD
.map(a=> a.split(","))
.map(x => s"${x(0)},${x(1)},${x(2)}")
Alternatively, and a bit more concise, you could do it by taking the first 3 elements of the array and using the mkString method:
empRDD.map(_.split(',').take(3).mkString(","))
Probably overkill for this use-case, but you could also use a regex to extract the values:
val r = "([^,]*),([^,]*),([^,]*).*".r
empRDD.map { case r(id, name, city) => s"$id,$name,$city" }

How to create Key-Value RDD (Scala)

I have the following RDD (name: AllTrainingDATA_RDD) which is of type
org.apache.spark.rdd.RDD[(String, Double, Double, String)] :
(ICCH_1,4.3,3.0,Iris-setosa)
(ICCH_1,4.4,2.9,Iris-setosa)
(ICCH_1,4.4,3.0,Iris-setosa)
(ICCH_2,4.4,3.2,Iris-setosa)
1st column : ICCH_ID, 2nd column: X_Coordinates, 3rd Column: Y_Coordinates, 4th column: Class
I would like to end up with an RDD which has 2nd and 3rd column as the Key and 4th column as Value. The column ICCH_ID should remain in the RDD.
My currently attempt based on some Internet research is this:
val AllTrainingDATA_RDD_Final = AllTrainingDATA_RDD.map(_.split(",")).keyBy(_(X_COORD,Y_COORD)).mapValues(fields => ("CLASS")).groupByKey().collect()
However I get this error:
error: value split is not a member of (String, Double, Double, String)
P.S. I am using Databricks Community Edition. I am new to Scala.
Let's try to break down your solution, part by part:
val AllTrainingDATA_RDD_Final = AllTrainingDATA_RDD
.map(_.split(","))
.keyBy(_(X_COORD,Y_COORD))
.mapValues(fields => ("CLASS"))
.groupByKey()
.collect()
You first problem is the use of .map(_.split(",")). This is likely a preprocessing stage done on an RDD[String] to extract the comma separated values from the text input lines. But since you've already done that, we can go ahead and drop the part.
Your second problem will come from .keyBy(_(X_COORD,Y_COORD)), and it's going to look something like this:
error: (String, Double, Double, String) does not take parameters
This is because you supplied keyBy an anonymous function that attempts to apply (X_COORD,Y_COORD) on each of the tuples in your RDD, but what you actually want is function that extracts the x and y coordinates (2nd and 3rd values) from your tuple. One way to achieve this is .keyBy{case (_, x, y, _) => (x, y)}
Lastly, your use of mapValues just produces the same String value ("CLASS") for all elements in the RDD. Instead, you can simply take the 4th value in the tuple like so: .mapValues(_._4)
Putting this all together, you get the following code:
val AllTrainingDATA_RDD_Final = AllTrainingDATA_RDD
.keyBy{case (_, x, y, _) => (x, y)}
.mapValues(_._4)
.groupByKey()
.collect()
Since you are new to Scala, I suggest you take some time to get aquatinted with syntax, features and APIs before you continue. It will help you understand and overcome such problems much faster.

How do I run the Spark decision tree with a categorical feature set using Scala?

I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles.
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
// Run training algorithm to build the model
val maxDepth: Int = 3
val isMulticlassWithCategoricalFeatures: Boolean = true
val numClassesForClassification: Int = countPossibilities(labelCol)
val model = DecisionTree.train(LP, Classification, Gini, isMulticlassWithCategoricalFeatures, maxDepth, numClassesForClassification,categoricalFeaturesInfo)
The error I get:
scala> val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
<console>:32: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[String])
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
My resources thus far:
tree config, decision tree, labeledpoint
You can first transform categories to numbers, then load data as if all features are numerical.
When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]() from feature indices to its arity.
For example if you have data as:
1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me
You can first transform data into numerical format as:
1,0,0
2,1,1
1,2,2
3,0,3
1,2,4
In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:
categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))
The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.
So for example, if you have the following dataset:
id,String,Intvalue
1,"a",123
2,"b",456
3,"c",789
4,"a",887
Then you could split your string data, making each value of the strings into a new column
a -> 1,0,0
b -> 0,1,0
c -> 0,0,1
As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.
Now your dataset will be
id,String,Intvalue
1,1,0,0,123
2,0,1,0,456
3,0,0,1,789
4,1,0,0,887
Which now you can convert into Double values and use it into your LabeledPoint.
Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be
a = 0
b = 1
c = 2
But in this case the algorithms will consider a closer to b than to c, which cannot be determined.
You need to confirm the type of array x.
From the error log, it said that the item in array x is string which is not supported in spark.
Current spark Vectors can only be filled by Double.

How can I create a TF-IDF for Text Classification using Spark?

I have a CSV file with the following format :
product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]
The product_idX is a integer and the product_titleX is a String, example :
453478692, Apple iPhone 4 8Go
I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.
I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4.
So I'm reading the file :
val file = sc.textFile("offers.csv")
Then I'm mapping it in tuples RDD[Array[String]]
val tuples = file.map(line => line.split(",")).cache
and after I'm transforming the tuples into pairs RDD[(Int, String)]
val pairs = tuples.(line => (line(0),line(1)))
But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.
Thanks
To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of
document_id, [token_ids]
The second is an inverted index like
token_id, [document_ids]
I'll call those corpus and inv_index respectively.
To get tf we need to count the number of occurrences of each token in each document. So
from collections import Counter
def wc_per_row(row):
cnt = Counter()
for word in row:
cnt[word] += 1
return cnt.items()
tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))
The df is simply the length of each term's inverted index. From that we can calculate the idf.
df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()
# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df.
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()
Now we just have to do a join on the term_id:
def calc_tfidf(tf_tuples, idf_tuples):
return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
(k2, v2) in idf_tuples if k1 == k2]
tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))
This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.
And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.
If anyone can improve on this, I'm very interested.