Training Sparks word2vec with a RDD[String] - scala

I'm new to Spark and Scala so I might have misunderstood some basic things here. I'm trying to train Sparks word2vec model on my own data. According to their documentation, one way to do this is
val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
The text8 dataset contains one line of many words, meaning that input will become an RDD[Seq[String]].
After massaging my own dataset, which has one word per line, using different maps etc. I'm left with an RDD[String], but I can't seem to be able to train the word2vec model on it. I tried doing input.map(v => Seq(v)) which does actually give an RDD[Seq[String]], but that will give one sequence for each word, which I guess is totally wrong.
How can I wrap a sequence around my strings, or is there something else I have missed?
EDIT
So I kind of figured it out. From my clean being an RDD[String] I do val input = sc.parallelize(Seq(clean.collect().toSeq)). This gives me the correct data structure (RDD[Seq[String]]) to fit the word2vec model. However, running collect on a large dataset gives me out of memory error. I'm not quite sure how they intend the fitting to be done? Maybe it is not really parallelizable. Or maybe I'm supposed to have several semi-long sequences of strings inside and RDD, instead of one long sequence like I have now?

It seems that the documentation is updated in an other location (even though I was looking at the "latest" docs). New docs are at: https://spark.apache.org/docs/latest/ml-features.html
The new example drops the text8 example file alltogether. I'm doubting whether the original example ever worked as intended. The RDD input to word2vec should be a set of lists of strings, typically sentences or otherwise constructed n-grams.
Example included for other lost souls:
val documentDF = sqlContext.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)

Why not
input.map(v => v.split(" "))
or whatever would be an appropriate delimiter to split your words on. This will give you the desired sequence of strings - but with valid words.

As I can recall, word2vec in ml take dataframe as argument and word2vec in mllib can take rdd as argument. The example you posted is for word2vec in ml. Here is the official guide: https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec

Related

Functional Style reading large csv file in Scala

I am new to functional-style programming and scala, so my question might seem to be bit primitive.
Is there a specific way to read csv file in scala using functional style? Also, how the inner joins are performed for combining 2 csv files in scala using functional style?
I know spark and generally use data frame but don't have any idea in scala and finding it tough to search on google as well, since don't have much knowledge about it. Also, if anyone knows good links for the functional style programming for scala it would be great help.
The question is indeed too broad.
Is there a specific way to read csv file in scala using functional
style?
So far I don't know of a king's road to parse CSVs completely hassle-free.
CSV parsing includes
going through the input line-by-line
understanding, what to do with the (optional) header
accurately parsing each line each line according to the CSV specification
turning line parts into a business object
I recommend to
turn your input into Iterator[String]
split each line into parts using a library of your choice (e.g. opencsv)
manually create a desired domain object from the line parts
Here is an simple example which (ignores error handling and potential header)
case class Person(name: String, street: String)
val lineParser = new CSVParserBuilder().withSeparator(',').build()
val lines: Iterator[String] = Source.fromInputStream(new FileInputStream("file.csv")).getLines()
val parsedObjects: Iterator[Person] = lines.map(line => {
val parts: Array[String] = lineParser.parseLine(line)
Person(parts(0), parts(1))
})

javanullpointerexception after df.na.fill("Missing") in scala?

I've been trying to learn/use Scala for machine learning and to do that I need to convert string variables to an index of dummies.
The way I've done it is with the StringIndexer in Scala. Before running I've used df.na.fill("missing") to replace missing values. Even after I run that I still get a NullPointerException.
Is there something else I should be doing or something else I should be checking? I used printSchema to filter only on the string columns to get the list of columns I needed to run StringIndexer on.
val newDf1 = reweight.na.fill("Missing")
val cat_cols = Array("highest_tier_nm", "day_of_week", "month",
"provided", "docsis", "dwelling_type_grp", "dwelling_type_cd", "market"
"bulk_flag")
val transformers: Array[org.apache.spark.ml.PipelineStage] = cat_cols
.map(cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index"))
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers
val categorical = new Pipeline().setStages(stages)
val cat_reweight = categorical.fit(newDf)
Normally when using machine learning you would train the model with one part of the data and then test it with another part. Hence, there are two different methods to use to reflect this. You have only used fit() which is equivalent to training a model (or a pipeline).
This mean that your cat_reweight is not a dataframe, it is a PipelineModel. A PipelineModel have a function transform() that takes data with the same format as the one used for training and gives a dataframe as output. In other words, you should add .transform(newDf1) after fit(newDf1).
Another possible issue is that in your code you have used fit(newDf) instead of fit(newDf1). Make sure the correct dataframe is used for both the fit() and transform() methods, otherwise you will get a NullPointerException.
It works for me when running locally, however, if you still get an error you could try to cache() after replacing the nulls and then performing an action to make sure all transformations are done.
Hope it helps!

Extract column values of Dataframe as List in Apache Spark

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.
Any suggestions would be appreciated. Thank you!
This should return the collection containing single list:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
Without the mapping, you just get a Row object, which contains every column from the database.
Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping
P.S. due to automatic conversion you can skip the .rdd part.
With Spark 2.x and Scala 2.11
I'd think of 3 possible ways to convert values of a specific column to a List.
Common code snippets for all the approaches
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate
import spark.implicits._ // for .toDF() method
val df = Seq(
("first", 2.0),
("test", 1.5),
("choose", 8.0)
).toDF("id", "val")
Approach 1
df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)
What happens now? We are collecting data to Driver with collect() and picking element zero from each record.
This could not be an excellent way of doing it, Let's improve it with the next approach.
Approach 2
df.select("id").rdd.map(r => r(0)).collect.toList
//res10: List[Any] = List(one, two, three)
How is it better? We have distributed map transformation load among the workers rather than a single Driver.
I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in the next approach.
Approach 3
df.select("id").map(r => r.getString(0)).collect.toList
//res11: List[String] = List(one, two, three)
Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.
Conclusion
All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).
Databricks notebook
I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.
i.e. A DataFrame, containing a column named "Raw"
To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:
MyDataFrame.rdd.map(lambda x: x.Raw).collect()
In Scala and Spark 2+, try this (assuming your column name is "s"):
df.select('s').as[String].collect
sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets
it works perfectly
List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return row.getAs("column_name").toString();
}
}).collect();
logger.info(String.format("list is %s",whatever_list)); //verification
Since no one has given any solution in java(Real Programming Language)
Can thank me later
from pyspark.sql.functions import col
df.select(col("column_name")).collect()
here collect is functions which in turn convert it to list.
Be ware of using the list on the huge data set. It will decrease performance.
It is good to check the data.
Below is for Python-
df.select("col_name").rdd.flatMap(lambda x: x).collect()
An updated solution that gets you a list:
dataFrame.select("YOUR_COLUMN_NAME").map(r => r.getString(0)).collect.toList
This is java answer.
df.select("id").collectAsList();

how to make saveAsTextFile NOT split output into multiple file?

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Does the number of outputs correspond to the number of reducers it uses?
Does this mean the output is compressed?
I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.
The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with
val arr = year.collect()
And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.
If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.
For those working with a larger dataset:
rdd.collect() should not be used in this case as it will collect all data as an Array in the driver, which is the easiest way to get out of memory.
rdd.coalesce(1).saveAsTextFile() should also not be used as the parallelism of upstream stages will be lost to be performed on a single node, where data will be stored from.
rdd.coalesce(1, shuffle = true).saveAsTextFile() is the best simple option as it will keep the processing of upstream tasks parallel and then only perform the shuffle to one node (rdd.repartition(1).saveAsTextFile() is an exact synonym).
rdd.saveAsSingleTextFile() as provided bellow additionally allows one to store the rdd in a single file with a specific name while keeping the parallelism properties of rdd.coalesce(1, shuffle = true).saveAsTextFile().
Something that can be inconvenient with rdd.coalesce(1, shuffle = true).saveAsTextFile("path/to/file.txt") is that it actually produces a file whose path is path/to/file.txt/part-00000 and not path/to/file.txt.
The following solution rdd.saveAsSingleTextFile("path/to/file.txt") will actually produce a file whose path is path/to/file.txt:
package com.whatever.package
import org.apache.spark.rdd.RDD
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.CompressionCodec
object SparkHelper {
// This is an implicit class so that saveAsSingleTextFile can be attached to
// SparkContext and be called like this: sc.saveAsSingleTextFile
implicit class RDDExtensions(val rdd: RDD[String]) extends AnyVal {
def saveAsSingleTextFile(path: String): Unit =
saveAsSingleTextFileInternal(path, None)
def saveAsSingleTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit =
saveAsSingleTextFileInternal(path, Some(codec))
private def saveAsSingleTextFileInternal(
path: String, codec: Option[Class[_ <: CompressionCodec]]
): Unit = {
// The interface with hdfs:
val hdfs = FileSystem.get(rdd.sparkContext.hadoopConfiguration)
// Classic saveAsTextFile in a temporary folder:
hdfs.delete(new Path(s"$path.tmp"), true) // to make sure it's not there already
codec match {
case Some(codec) => rdd.saveAsTextFile(s"$path.tmp", codec)
case None => rdd.saveAsTextFile(s"$path.tmp")
}
// Merge the folder of resulting part-xxxxx into one file:
hdfs.delete(new Path(path), true) // to make sure it's not there already
FileUtil.copyMerge(
hdfs, new Path(s"$path.tmp"),
hdfs, new Path(path),
true, rdd.sparkContext.hadoopConfiguration, null
)
// Working with Hadoop 3?: https://stackoverflow.com/a/50545815/9297144
hdfs.delete(new Path(s"$path.tmp"), true)
}
}
}
which can be used this way:
import com.whatever.package.SparkHelper.RDDExtensions
rdd.saveAsSingleTextFile("path/to/file.txt")
// Or if the produced file is to be compressed:
import org.apache.hadoop.io.compress.GzipCodec
rdd.saveAsSingleTextFile("path/to/file.txt.gz", classOf[GzipCodec])
This snippet:
First stores the rdd with rdd.saveAsTextFile("path/to/file.txt") in a temporary folder path/to/file.txt.tmp as if we didn't want to store data in one file (which keeps the processing of upstream tasks parallel)
And then only, using the hadoop file system api, we proceed with the merge (FileUtil.copyMerge()) of the different output files to create our final output single file path/to/file.txt.
You could call coalesce(1) and then saveAsTextFile() - but it might be a bad idea if you have a lot of data. Separate files per split are generated just like in Hadoop in order to let separate mappers and reducers write to different files. Having a single output file is only a good idea if you have very little data, in which case you could do collect() as well, as #aaronman said.
As others have mentioned, you can collect or coalesce your data set to force Spark to produce a single file. But this also limits the number of Spark tasks that can work on your dataset in parallel. I prefer to let it create a hundred files in the output HDFS directory, then use hadoop fs -getmerge /hdfs/dir /local/file.txt to extract the results into a single file in the local filesystem. This makes the most sense when your output is a relatively small report, of course.
In Spark 1.6.1 the format is as shown below. It creates a single output file.It is best practice to use it if the output is small enough to handle.Basically what it does is that it returns a new RDD that is reduced into numPartitions partitions.If you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)
pair_result.coalesce(1).saveAsTextFile("/app/data/")
You can call repartition() and follow this way:
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
var repartitioned = year.repartition(1)
repartitioned.saveAsTextFile("C:/Users/TheBhaskarDas/Desktop/wc_spark00")
You will be able to do it in the next version of Spark, in the current version 1.0.0 it's not possible unless you do it manually somehow, for example, like you mentioned, with a bash script call.
I also want to mention that the documentation clearly states that users should be careful when calling coalesce with a real small number of partitions . this can cause upstream partitions to inherit this number of partitions.
I would not recommend using coalesce(1) unless really required.
Here's my answer to output a single file. I just added coalesce(1)
val year = sc.textFile("apat63_99.txt")
.map(_.split(",")(1))
.flatMap(_.split(","))
.map((_,1))
.reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Code:
year.coalesce(1).saveAsTextFile("year")

Readin a two-dimensional array using scala

Suppose I have a txt file named "input.txt" and I want to use scala to read it in. The dimension of the file is not available in the beginning.
So, how to construct such an Array[Array[Float]]? What I want is a simple and neat way rather than write some code like in Java to iterates over lines and parse each number. I think functional programming should be quite good at it.. but cannot think of one up to now.
Best Regards
If your input is correct, you can do it in such way:
val source = io.Source.fromFile("input.txt")
val data = source.getLines().map(line => line.split(" ").map(_.toFloat)).toArray
source.close()
Update: for additional information about using Source check this thread