how to print Map[String, Array[Float]] in scala? - scala

I am using word2vec function which is inside mllib library of Spark. I want to print word vectors which I am getting as output to "getVectors" function
My code looks like this:
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
object word2vec {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("word2vec")
val sc = new SparkContext(conf)
val input = sc.textFile("file:///home/snap-01/balance.csv").map(line => line.split(",").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
model.save(sc, "myModelPath")
val sameModel = Word2VecModel.load(sc, "myModelPath")
val vec = sameModel.getVectors
print(vec)
}
}
I am getting "Map(Balance -> [F#2932e15f)"

Try this :
vec.foreach { case (key, values) => println("key " + key + " - " + values.mkString("-")
}

Alternatively,
println(vec.mapValues(_.toList))
But keep an eye on the memory required to do so.

Related

Migrate code Scala to databricks notebook

Working to get this code running using notebooks in databricks(already tested and working with an IDE), can not get this working if I change the structure of the code.
import java.io.{BufferedReader, InputStreamReader}
import java.text.SimpleDateFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object TestUnit {
val dateFormat = new SimpleDateFormat("yyyyMMdd")
case class Averages (cust: String, Num: String, date: String, credit: Double)
def main(args: Array[String]): Unit = {
val inputFile = "s3a://tfsdl-ghd-wb/raidnd/Cleartablet.csv"
val outputFile = "s3a://tfsdl-ghd-wb/raidnd/Incte_19&20.csv"
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFileLines(fileSystem, inputFile, skipHeader = true)
.toSeq
val filtinp = inputData.filter(x => x.nonEmpty)
.map(x => x.split(","))
.map(x => Revenue(x(6), x(5), x(0), x(8).toDouble))
// Create output writer
val writer = new PrintWriter(new File(outputFile))
// Header for output CSV file
writer.write("Date,customer,number,Credit,Average Credit/SKU\n")
filtinp.foreach{x =>
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
}
// Write row to output csv file
writer.write(s"${x.day},${x.customer},${x.number},${x.credit},${avgcredit1},${avgcredit2}\n")
writer.close() // close the writer`
}
}

How to implement the Seq.grouped(size:Int): Seq[Seq[A]] for Dataset in Spark

I want to try to implement the
def grouped(size: Int): Iterator[Repr] that Seq has but for Dataset in Spark.
So the input should be ds: Dataset[A], size: Int and output Seq[Dataset[A]] where each of the Dataset[A] in the output can't be bigger than size.
How should I proceed ? I tried with repartition and mapPartitions but I am not sure where to go from there.
Thank you.
Edit: I found the glom method in RDD but it produce a RDD[Array[A]] how do I go from this to the other way around Array[RDD[A]] ?
here you go, something that you want
/*
{"countries":"pp1"}
{"countries":"pp2"}
{"countries":"pp3"}
{"countries":"pp4"}
{"countries":"pp5"}
{"countries":"pp6"}
{"countries":"pp7"}
*/
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.{SparkConf, SparkContext};
object SparkApp extends App {
override def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local").set("spark.ui.enabled", "false")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val dataFrame: DataFrame = sqlContext.read.json("/data.json")
val k = 3
val windowSpec = Window.partitionBy("grouped").orderBy("countries")
val newDF = dataFrame.withColumn("grouped", lit("grouping"))
var latestDF = newDF.withColumn("row", row_number() over windowSpec)
val totalCount = latestDF.count()
var lowLimit = 0
var highLimit = lowLimit + k
while(lowLimit < totalCount){
latestDF.where(s"row <= $highLimit and row > $lowLimit").show(false)
lowLimit = lowLimit + k
highLimit = highLimit + k
}
}
}
Here is the solution I found but I am not sure if that can works reliably:
override protected def batch[A](
input: Dataset[A],
batchSize: Int
): Seq[Dataset[A]] = {
val count = input.count()
val partitionQuantity = Math.ceil(count / batchSize).toInt
input.randomSplit(Array.fill(partitionQuantity)(1.0 / partitionQuantity), seed = 0)
}

Passing functions in Spark

This is my idea
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
object pizD {
def filePath = {
new File(this.getClass.getClassLoader.getResource("wikipedia/wikipedia.dat").toURI).getPath
}
def regex(line: String): pichA = {
......
......
pichA(t1, t2)
}
}
case class pichA(t1: String, t2: String)
object dushP {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val mirdd: RDD[pichA] = ???
How to integrate sc.textfile with my methods filePath and regex?I want to combine in order to get new rdd.
val baseRDD =sc.textfile(pizD.filepath).filter(line => {
val value = pizD.regex(line)
if(value !=null)
true
else false
})
Assuming pizD.filepath will give you file name as string and regex() would return null value if regex din match. If the understanding is correct, then above code would do the trick.

writing SparkRDD to a HBase table using Scala

I am trying to write a SparkRDD to HBase table using scala(haven't used before). The entire code is this :
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import scala.collection.JavaConverters._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark._
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.rdd.PairRDDFunctions
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.client._
object HBaseWrite {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseWrite").setMaster("local").set("spark.driver.allowMultipleContexts","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val outputTable = "tablename"
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 120000)
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.setInt("hbase.client.scanner.caching", 10000)
sparkConf.registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.client.Result]))
val jobConfig: JobConf = new JobConf(conf,this.getClass)
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE,outputTable)
val x = 12
val y = 15
val z = 25
var newarray = Array(x,y,z)
val newrddtohbase = sc.parallelize(newarray)
def convert(a:Int) : Tuple2[ImmutableBytesWritable,Put] = {
val p = new Put(Bytes.toBytes(a))
p.add(Bytes.toBytes("columnfamily"),
Bytes.toBytes("col_1"), Bytes.toBytes(a))
new Tuple2[ImmutableBytesWritable,Put](new ImmutableBytesWritable(a.toString.getBytes()), p);
}
new PairRDDFunctions(newrddtohbase.map(convert)).saveAsHadoopDataset(jobConfig)
sc.stop()
}
}
The error I get after doing HBaseWrite(main(Array()) is this:
org.apache.spark.SparkException: Task not serializable
How do I proceed to get it done?
The thing you are doing wrong here is defining the convert inside main
If you write this code in this way it may work :
object HBaseWrite {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseWrite").setMaster("local").set("spark.driver.allowMultipleContexts","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val outputTable = "tablename"
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 120000)
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.setInt("hbase.client.scanner.caching", 10000)
sparkConf.registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.client.Result]))
val jobConfig: JobConf = new JobConf(conf,this.getClass)
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE,outputTable)
val x = 12
val y = 15
val z = 25
var newarray = Array(x,y,z)
val newrddtohbase = sc.parallelize(newarray)
val convertFunc = convert _
new PairRDDFunctions(newrddtohbase.map(convertFunc)).saveAsHadoopDataset(jobConfig)
sc.stop()
}
def convert(a:Int) : Tuple2[ImmutableBytesWritable,Put] = {
val p = new Put(Bytes.toBytes(a))
p.add(Bytes.toBytes("columnfamily"),
Bytes.toBytes("col_1"), Bytes.toBytes(a))
new Tuple2[ImmutableBytesWritable,Put](new ImmutableBytesWritable(a.toString.getBytes()), p);
}
}
P.S.: The code is not tested , but it should work !
For example, the below method takes Int as argument and returns Double
var toDouble: (Int) => Double = a => {
a.toDouble
}
You can use toDouble(2) and it returns 2.0
The same way you can convert your method to function literal as below.
val convert: (Int) => Tuple2[ImmutableBytesWritable,Put] = a => {
val p = new Put(Bytes.toBytes(a))
p.add(Bytes.toBytes("columnfamily"),
Bytes.toBytes("col_1"), Bytes.toBytes(a))
new Tuple2[ImmutableBytesWritable,Put](new ImmutableBytesWritable(a.toString.getBytes()), p);
}

sortByKey in Spark

New to Spark and Scala. Trying to sort a word counting example. My code is based on this simple example.
I want to sort the results alphabetically by key. If I add the key sort to an RDD:
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
then I get a compile error:
error: No implicit view available from java.io.Serializable => Ordered[java.io.Serializable].
[INFO] val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
I don't know what the lack of an implicit view means. Can someone tell me how to fix it? I am running the Cloudera 5 Quickstart VM. I think it bundles Spark version 0.9.
Source of the Scala job
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
Array("NO NAME")
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Some (unsorted) output
("INTERNATIONAL EYELETS INC",879)
("SHAQUITA SALLEY",865)
("PAZ DURIGA",791)
("TERESSA ALCARAZ",824)
("MING CHAIX",878)
("JACKSON SHIELDS YEISER",837)
("AUDRY HULLINGER",875)
("GABRIELLE MOLANDS",802)
("TAM TACKER",775)
("HYACINTH VITELA",837)
No implicit view means there is no scala function like this defined
implicit def SerializableToOrdered(x :java.io.Serializable) = new Ordered[java.io.Serializable](x) //note this function doesn't work
The reason this error is coming out is because in your function you are returning two different types with a super type of java.io.Serializable (ones a String the other an Array[String]). Also reduceByKey for obvious reasons requires the key to be an Orderable. Fix it like this
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
"NO NAME"
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Now the function just returns Strings instead of two different types