Spark 2.2: Load org.apache.spark.ml.feature.LabeledPoint from file - scala

The following line of code loads the (soon to be deprecated) mllib.regression.LabeledPoint from file to an RDD[LabeledPoint]:
MLUtils.loadLibSVMFile(spark.sparkContext, s"$path${File.separator}${fileName}_data_sparse").repartition(defaultPartitionSize)
I'm unable to find the equivalent function for ml.feature.LabeledPoint, which is not yet heavily used in the Spark documentation examples.
Can someone point me to the relevant function?

With the ml package you won't need to put the data into a LabeledPoint since you can specify which columns to use for labels/features in all transformations/algorithms. For example:
val gbt = new GBTClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
To load the LibSVM file as a dataframe, simply do:
val df = spark.read.format("libsvm").load(s"$path${File.separator}${fileName}_data_sparse")
Which will return a dataframe with two columns:
The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.
See the documentation for more information.

Related

Populate a Properties Object from Spark Databricks File System

TL:DR
Is there a way to read a Scala/Java properties file from a Databricks file system?
Or, is there a way to convert a spark data frame Rows into a set of text key/value pairs (that Scala will understand)?
Full Problem:
The properties file is not local, it's on the Databricks cluster. Attempts to read a file from "dbfs:/" or "/dbfs" fail to find the file when using the scala.io.Source library. My guess is that Scala Source has no ability to recognize the URI for the Databricks file system(?).
I'm able to read the file into a Spark Dataframe however, but attempts to populate a java.utils.Properties object fail with an error that it doesn't accept the Spark Dataframe "ROW" type. I've tried changing the data frame to an Array and List, but run into the same type mismatch. java.util.List[org.apache.spark.sql.Row] for example, is what I get when converting the data frame to a list. I'm guessing that means dataFrameObject.collectAsList() makes a list of spark rows instead of a text list of key/value pairs.
Obviously I'm new to Scala... If there isn't a way to read/load my properties file directly from DBFS, is there a way to convert the spark Row to a key/value pairs - or a byteStream?
Cheers and thanks,
Simon
If you're using full version of the Databricks, not community edition, then you should be able to access files on DBFS via /dbfs/_the_rest_of_your_path_without_dbfs:/_...
But if you can't access /dbfs/..., then you can still load properties as following:
load the file into Spark using the text format that converts every line in the file into individual row
create text from that rows - first you collect all rows to the driver node, then extract string from rows (using the .getString(0) to fetch first element of the row), and then merging all lines together using the mkString
create reader for that text
create properties object and load data from reader (don't forget to close reader after use):
val path_to_file = "dbfs:/something...."
val df = spark.read.format("text").load(path_to_file)
val allTextg = df.collect().map(_.getString(0)).mkString("\n")
val reader = new java.io.StringReader(allText)
val props = new java.util.Properties()
props.load(reader)
reader.close()
and you can check that properties are loaded with
props.list(System.out)

Ques Classification Using Support Vector Machines

I am trying to classify Questions using SVM. I am following this link for reference -
https://shirishkadam.com/2017/07/03/nlp-question-classification-using-support-vector-machines-spacyscikit-learnpandas/
But they have used SPACY,SCIKIT-LEARN and PANDAS. I want to do the same thing using Spark Mllib.
I am using this code to create a Dataframe -
sc = SparkContext(conf=sconf) # SparkContext
sqlContext = SQLContext(sc)
data = sc.textFile("<path_to_csv_file>")
header = data.first()
trainingDF = sqlContext.createDataFrame(data
.filter(lambda line: line != header)
.map(lambda line: line.split("|"))
.map(lambda line: ([line[0]], [line[2]], [line[6]]))).toDF("Question", "WH-Bigram", "Class")
And I am getting following result by printing the dataframe- trainingDF.show(3)
+--------------------+-------------------+------+
| Question| WH-Bigram| Class|
+--------------------+-------------------+------+
|[How did serfdom ...| [How did]|[DESC]|
|[What films featu...| [What films]|[ENTY]|
|[How can I find a...| [How can]|[DESC]|
My sample csv file is -
#Question|WH|WH-Bigram|Class
How did serfdom develop in and then leave Russia ?|How|How did|DESC
I am using word2vec to create training data for SVM and trying to train using SVM.
word2Vec1 = Word2Vec(vectorSize=2, minCount=0, inputCol="Question", outputCol="result1")
training = word2Vec1.fit(trainingDF).transform(trainingDF)
model = SVMWithSGD.train(training, iterations=100)
After using word2vec my data is converted in this format -
[Row(Question=[u'How did serfdom develop in and then leave Russia ?'], WH-Bigram=[u'How did'], Class=[u'DESC'], result1=DenseVector([0.0237, -0.186])), Row(Question=[u'What films featured the character Popeye Doyle ?'], WH-Bigram=[u'What films'], Class=[u'ENTY'], result1=DenseVector([-0.2429, 0.0935]))]
But when I try to train the dataframe using SVM then getting error that TypeError: data should be an RDD of LabeledPoint, but got <class 'pyspark.sql.types.Row'>
I am stuck here...i think the dataframe that i have created is not correct.
Do any body know how to create a suitable dataframe for training it with SVM. And Please let me know if I am doing something wrong.
Great that you are trying out one of the machine learning methods in Spark, but there are multiple problems with your approach,
1) Your data has multiple classes, it is not a binary classification model hence SVM in Spark won't work on this dataset (you can have a look at the source code here). You can try the one class vs all others approach and train as many models as there are classes in your data. However, you would be better off using something like the MultilayerPerceptronClassifier or the multiclass logistic model in Spark.
2) Secondly, Mllib is very unforgiving in terms of the class labels that you use, you can only specify 0,1,2 or 0.0,1.0,2.0 etc i.e it does not automatically infer the number of classes based on your output column. Even if you specify two classes as 1.0 & 2.0 it will not work it has to be 0.0 & 1.0.
3) You need to use a labeledpoint RDD instead of a spark dataframe, remember that spark.mllib is for use with RDD's whereas spark.ml is for use with dataframes. On help for how to create a Labeledpoint rdd you may refer to the spark documentation here where there are multiple examples.
4) On a feature engineering note, I don't think you would want to take the vectorSize as 2 for your word2vec model (something like 10 would be more appropriate as a starting point), these are simply too less for giving a reasonable prediction.

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

Append a column to Data Frame in Apache Spark 1.3

Is it possible and what would be the most efficient neat method to add a column to Data Frame?
More specifically, column may serve as Row IDs for the existing Data Frame.
In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:
var dataDF = sc.textFile("path/file").toDF()
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID")
dataDF = dataDF.withColumn("ID", rowDF("ID"))
It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.
So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:
sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))
Regarding the general case of appending any column to any data frame:
The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:
val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL
So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).
As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...
It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.
not sure if it works in spark 1.3 but in spark 1.5 I use withColumn:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
df.withColumn("newName",lit("newValue"))
I use this when I need to use a value that is not related to existing columns of the dataframe
This is similar to #NehaM's answer but simpler
I took help from above answer. However, I find it incomplete if we want to change a DataFrame and current APIs are little different in Spark 1.6.
zipWithIndex() returns a Tuple of (Row, Long) which contains each row and corresponding index. We can use it to create new Row according to our need.
val rdd = df.rdd.zipWithIndex()
.map(indexedRow => Row.fromSeq(indexedRow._2.toString +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("Row number", StringType, true)).++(df.schema.fields))
sqlContext.createDataFrame(rdd, newstructure ).show
I hope this will be helpful.
You can use row_number with Window function as below to get the distinct id for each rows in a dataframe.
df.withColumn("ID", row_number() over Window.orderBy("any column name in the dataframe"))
You can also use monotonically_increasing_id for the same as
df.withColumn("ID", monotonically_increasing_id())
And there are some other ways too.

How do I project parquet file in spark?

I load a data set from Parquet files as
val sqc = new org.apache.spark.sql.SQLContext(sc)
val data = sqc.parquetFile("f1,f2,f3,f4,f5")
here files "fN" &c have common columns "c1" and "c2" but some of them may also have other columns.
Thus, when I do
data.registerAsTable("MyTable")
I get the error:
java.lang.RuntimeException: could not merge metadata: key pig.schema has conflicting values
The question is: how do I get those parquet files into a single table
with just two columns?
I.e., how do I project them?
It would seem reasonable to load "fN" one by one, project them, then
merge together using unionAll.
The rough equivalent of a project on a SchemaRDD is .select() which takes an Expression object instance and returns a new SchemaRDD with the filtered fields. After doing the selects you can use unionAll as suggested. e.g.
val sqc = new org.apache.spark.sql.SQLContext(sc)
import sqc._
val file1 = sqc.parquetFile("file1").select('field1, 'field2)
val file2 = sqc.parquetFile("file2").select('field1, 'field2)
val all_files = file1.unionAll(file2)
The import sqc._ is required to load the implicit functions for building Expression instances from symbols).
Do you know how these files are generated ?
If you know then you should know the schema already and categories accordingly.
Otherwise I don't think there is another way. you need to load to one by one. Once you extract the data in schemaRDD but even can caltl unionAll If they belong to same schema.
Check sample code from github project https://github.com/pankaj-infoshore/spark-twitter-analysis where the parquet files are handled.
var path ="/home/infoshore/java/Trends/urls"
var files =new java.io.File(path).listFiles()
var parquetFiles = files.filter(file=>file.isDirectory).map(file=>file.getName)
var tweetsRDD= parquetFiles.map(pfile=>sqlContext.parquetFile(path+"/"+pfile))
var allTweets =tweetsRDD.reduce((s1,s2)=>s1.unionAll(s2))
allTweets.registerAsTable("tweets")
sqlContext.cacheTable("tweets")
import sqlContext._
val popularHashTags = sqlContext.sql("SELECT hashtags,usersMentioned,Url FROMtweets")
Check how I have called UnionAll. You can not call unionAll on schemaRDD which represent different schema.
Let me know If you need specific help
Regards
Pankaj