I'm working on implementing a Spark LDA model (via the Scala API), and am having trouble with the necessary formatting steps for my data. My raw data (stored in a text file) is in the following format, essentially a list of tokens and the documents they correspond to. A simplified example:
doc XXXXX term XXXXX
1 x 'a' x
1 x 'a' x
1 x 'b' x
2 x 'b' x
2 x 'd' x
...
Where the XXXXX columns are garbage data I don't care about. I realize this is an atypical way of storing corpus data, but it's what I have. As is I hope is clear from the example, there's one line per token in the raw data (so if a given term appears 5 times in a document, that corresponds to 5 lines of text).
In any case, I need to format this data as sparse term-frequency vectors for running a Spark LDA model, but am unfamiliar with Scala so having some trouble.
I start with:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD
val corpus:RDD[Array[String]] = sc.textFile("path/to/data")
.map(_.split('\t')).map(x => Array(x(0),x(2)))
And then I get the vocabulary data I'll need to generate the sparse vectors:
val vocab: RDD[String] = corpus.map(_(1)).distinct()
val vocabMap: Map[String, Int] = vocab.collect().zipWithIndex.toMap
What I don't know is the proper mapping function to use here such that I end up with a sparse term frequency vector for each document that I can then feed into the LDA model. I think I need something along these lines...
val documents: RDD[(Long, Vector)] = corpus.groupBy(_(0)).zipWithIndex
.map(x =>(x._2,Vectors.sparse(vocabMap.size, ???)))
At which point I can run the actual LDA:
val lda = new LDA().setK(n_topics)
val ldaModel = lda.run(documents)
Basically, I don't what function to apply to each group so that I can feed term frequency data (presumably as a map?) into a sparse vector. In other words, how do I fill in the ??? in the code snippet above to achieve the desired effect?
One way to handle this:
make sure that spark-csv package is available
load data into DataFrame and select columns of interest
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true") // Optional, providing schema is prefered
.option("delimiter", "\t")
.load("foo.csv")
.select($"doc".cast("long").alias("doc"), $"term")
index term column:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("term")
.setOutputCol("termIndexed")
val indexed = indexer.fit(df)
.transform(df)
.drop("term")
.withColumn("termIndexed", $"termIndexed".cast("integer"))
.groupBy($"doc", $"termIndexed")
.agg(count(lit(1)).alias("cnt").cast("double"))
convert to PairwiseRDD
import org.apache.spark.sql.Row
val pairs = indexed.map{case Row(doc: Long, term: Int, cnt: Double) =>
(doc, (term, cnt))}
group by doc:
val docs = pairs.groupByKey
create feature vectors
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.functions.max
val n = indexed.select(max($"termIndexed")).first.getInt(0) + 1
val docsWithFeatures = docs.mapValues(vs => Vectors.sparse(n, vs.toSeq))
now you have all you need to create LabeledPoints or apply additional processing
Related
I used Movielens 20 million dataset which contain file called rating .csv(UserId,MovieId,Rating) .I applied Alternating Least Square(ALS) which give output userId,FeatureVector in 10 parquet files . Dimensinality Reduction
I want to make normalize for featureVector using z-score method.
I want to subtract vector(featureVector) from constant scalar 2.484 ,divide value into 1.8305 and save value in parquet files.features column
val df = sqlContext.read.parquet("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet")
sqlContext.sql("select features from df")
df.withColumn("output", "features" -2.484).show(20)
How to subtract vector from each value of scalar?
if you want to subtract 2.484 from each vector value you can try it
import spark.implicits._
import org.apache.spark.ml.linalg.{DenseVector, Vectors}
import org.apache.spark.rdd.RDD
val df = Seq(Vector(1.0,2.0,2.5,3.0,3.5)).toDF("features")
val rdd: RDD[DenseVector] = df.select('features)
.rdd
.map(s => s.getAs[Seq[Double]]("features").toArray)
.map(s => Vectors.dense(s.map(s => s - 2.3333)).toDense)
rdd.take(10).foreach(println(_))
output:
[-1.3333,-0.33329999999999993,0.16670000000000007,0.6667000000000001,1.1667]
I have written the following code to feed data to a machine learning algorithm in Spark 2.3. The code below runs fine. I need to enhance this code to be able to convert not just 3 columns but any number of columns, uploaded via the csv file. For instance, if I had loaded 5 columns, how can I put them automatically in the Vector.dense command below, or some other way to generate the same end result? Does anyone know how this can be done?
val data2 = spark.read.format("csv").option("header",
"true").load("/data/c7.csv")
val goodBadRecords = data2.map(
row =>{
val n0 = row(0).toString.toLowerCase().toDouble
val n1 = row(1).toString.toLowerCase().toDouble
val n2 = row(2).toString.toLowerCase().toDouble
val n3 = row(3).toString.toLowerCase().toDouble
(n0, Vectors.dense(n1,n2,n3))
}
).toDF("label", "features")
Thanks
Regards,
Adeel
A VectorAssembler can do the job:
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features [...] into a single feature vector
Based on your code, the solution would look like:
val data2 = spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true") //1
.load("/data/c7.csv")
val fields = data2.schema.fieldNames
val assembler = new VectorAssembler()
.setInputCols(fields.tail) //2
.setOutputCol("features") //3
val goodBadRecords = assembler.transform(data2)
.withColumn("label", col(fields(0))) //4
.drop(fields:_*) //5
Remarks:
A schema is necessary for the input data, as the VectorAssembler only accepts the following input column types: all numeric types, boolean type, and vector type (same link). You seem to have a csv with doubles, so infering the schema should work. But of course, any other method to transform the string data to doubles is also ok.
Use all but the first column as input for the VectorAssembler
Name the result column of the VectorAssembler features
Create a new column called label as copy of the first column
Drop all orginal columns. This last step is optional as the learning algorithm usually only looks at the label and feature column and ignores all other columns
I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs
I am using spark mllib for one of my projects in which I need to calculate document similarities.
I first converted the documents to vectors using tf-idf transform of the mllib, then converted it into RowMatrix and used the columnSimilarities() method.
I referred to tf-idf documentation and used the DIMSUM implementation for cosine similarities.
in spark-shell this is the scala code is executed:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val documents = sc.textFile("test1").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
// now use the RowMatrix to compute cosineSimilarities
// which implements DIMSUM algorithm
val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities() // returns a CoordinateMatrix
Now let's say my input file, test1 in this code block is a simple file with 5 short documents (less than 10 terms each), one on each row.
Since I am just testing this code, I would like to see the output of mat.columnSimilarities() which is in object sim.
I would like to see the similarity of 1st document vector with 2nd, 3rd and so on.
I referred to spark documentation for CoordinateMatrix which is the type of object returned by columnSimilarities method of RowMatrix class and referred by sim.
By going through more documentation, I figured I could convert the CoordinateMatrix to RowMatrix, then convert the rows of RowMatrix to arrays and then print like this println(sim.toRowMatrix().rows.toArray().mkString("\n")) .
But that gives some output which I couldn't understand.
Can anyone help? Any kind of resource links etc would help a lot!
Thanks!
You can try the following, no need to convert to row matrix format
val transformedRDD = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
To retrieve the elements you can invoke the following action
transformedRDD.collect()
I am learning how to use Apache Spark and I am trying to get the average temperature from each hour from a data set. The data set that I am trying to use is from weather information stored in a csv. I am having trouble finding how to first read in the csv file and then calculating the average temperature for each hour.
From the spark documentation I am using the example Scala line to read in a file.
val textFile = sc.textFile("README.md")
I have given the link for the data file below. I am using the file called JCMB_2014.csv as it is the latest one with all months covered.
Weather Data
Edit:
The code I have tried so far is:
class SimpleCSVHeader(header:Array[String]) extends Serializable {
val index = header.zipWithIndex.toMap
def apply(array:Array[String], key:String):String = array(index(key))
}
val csv = sc.textFile("JCMB_2014.csv")
val data = csv.map(line => line.split(",").map(elem => elem.trim))
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header
val header = new SimpleCSVHeader(data.take(1)(0))
val rows = data.filter(line => header(line,"date-time") != "date-time")
val users = rows.map(row => header(row,"date-time")
val usersByHits = rows.map(row => header(row,"date-time") -> header(row,"surface temperature (C)").toInt)
Here is sample code for calculating averages on hourly basis
Step1:Read file, Filter header,extract time and temp columns
scala> val hourlyTemps = lines.map(line=>line.split(",")).filter(entries=>(!"time".equals(entries(3)))).map(entries=>(entries(3).toInt/60,(entries(8).toFloat,1)))
scala> hourlyTemps.take(1)
res25: Array[(Int, (Float, Int))] = Array((9,(10.23,1)))
(time/60) discards minutes and keeps only hours
Step2:Aggregate temperatures and no of occurrences
scala> val aggregateTemps=hourlyTemps.reduceByKey((a,b)=>(a._1+b._1,a._2+b._2))
scala> aggreateTemps.take(1)
res26: Array[(Int, (Double, Int))] = Array((34,(8565.25,620)))
Step2:Calculate Averages using total and no of occurrences
Find the final result below.
val avgTemps=aggregateTemps.map(tuple=>(tuple._1,tuple._2._1/tuple._2._2))
scala> avgTemps.collect
res28: Array[(Int, Float)] = Array((34,13.814922), (4,11.743354), (16,14.227251), (22,15.770312), (28,15.5324545), (30,15.167026), (14,13.177828), (32,14.659948), (36,12.865237), (0,11.994799), (24,15.662579), (40,12.040322), (6,11.398838), (8,11.141323), (12,12.004652), (38,12.329914), (18,15.020147), (20,15.358524), (26,15.631921), (10,11.192643), (2,11.848178), (13,12.616284), (19,15.198371), (39,12.107664), (15,13.706351), (21,15.612191), (25,15.627121), (29,15.432097), (11,11.541124), (35,13.317129), (27,15.602408), (33,14.220147), (37,12.644306), (23,15.83412), (1,11.872819), (17,14.595772), (3,11.78971), (7,11.248139), (9,11.049844), (31,14.901464), (5,11.59693))
You may want to provide Structure definition of your CSV file and convert your RDD to DataFrame, like described in the documentation. Dataframes provide a whole set of useful predefined statistic functions as well as the possibility to write some simple custom functions. You then will be able to compute the average with:
dataFrame.groupBy(<your columns here>).agg(avg(<column to compute average>)