appending data to hbase table from a spark rdd using scala - scala

I am trying to add data to a HBase Table. I have done the following so far:
def convert (a:Int,s:String) : Tuple2[ImmutableBytesWritable,Put]={
val p = new Put(a.toString.getBytes())
p.add(Bytes.toBytes("ColumnFamily"),Bytes.toBytes("col_2"), s.toString.getBytes())//a.toString.getBytes())
println("the value of a is: " + a)
new Tuple2[ImmutableBytesWritable,Put](new ImmutableBytesWritable(Bytes.toBytes(a)), p);
}
new PairRDDFunctions(newrddtohbaseLambda.map(x=>convert(x, randomstring))).saveAsHadoopDataset(jobConfig)
newrddtohbaseLambda is this:
val x = 12
val y = 15
val z = 25
val newarray = Array(x,y,z)
val newrddtohbaseLambda = sc.parallelize(newarray)
"randomstring" is this
val randomstring = "abc, xyz, dfg"
Now, what this does is, it adds abc,xyz,dfg to rows 12, 15 and 25 after deleting the already present values in these rows. I want that value to be present and add abc,xyz,dfg instead replace. How can I get it done? Any help would be appreciated.

Related

How to set type of dataset when applying transformations and how to implement transformations without using spark.sql.functions._?

I am a beginner for Scala and have been working on the following problem:
Example dataset named as given_dataset with player number and points scored
|player_no| |points|
1 25.0
1 20.0
1 21.0
2 15.0
2 18.0
3 24.0
3 25.0
3 29.0
Problem 1:
I have a dataset and need to calculate total points scored, average points per game, and number of games played. I am unable to explicitly set the data type to "double", "int", "float", when I apply the transformations. (Perhaps this is because they are untyped transformations?) Would anyone be able to help on this and correct my error?
No data type specified (but code is able to run)
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").orderBy("player_no")
Please note that I would like to retain the player number as I plan to merge total_points_dataset, games_played_dataset, and avg_points_dataset together.
Data type specified, but code crashes!
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[Double].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[Int].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[Double].orderBy("player_no")
Problem 2:
I would like to implement the above without using the library spark.sql.functions e.g. through functions such as map, groupByKey etc. If possible, could anyone provide an example for this and point me towards the right direction?
If you don't want to use import org.apache.spark.sql.types.{FloatType, IntegerType, StructType} then you have to cast it either at the time of reading or using as[(Int, Double)] in the dataset. Below is the example while reading from CSV file for your dataset:
/** A function that splits a line of input into (player_no, points) tuples. */
def parseLine(line: String): (Int, Float) = {
// Split by commas
val fields = line.split(",")
// Extract the player_no and points fields, and convert to integer & float
val player_no = fields(0).toInt
val points = fields(1).toFloat
// Create a tuple that is our result.
(player_no, points)
}
And then read as below:
val sc = new SparkContext("local[*]", "StackOverflow75354293")
val lines = sc.textFile("data/stackoverflowdata-noheader.csv")
val dataset = lines.map(parseLine)
val total_points_dataset2 = dataset.reduceByKey((x, y) => x + y)
val total_points_dataset2_sorted = total_points_dataset2.sortByKey(ascending = true)
total_points_dataset2_sorted.foreach(println)
val games_played_dataset2 = dataset.countByKey().toList.sorted
games_played_dataset2.foreach(println)
val avg_points_dataset2 =
dataset
.mapValues(x => (x, 1))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(x => x._1 / x._2)
.sortByKey(ascending = true)
avg_points_dataset2.collect().foreach(println)
I locally tried running both ways and both are working fine, we can check the below output also:
(3,78.0)
(1,66.0)
(2,33.0)
(1,3)
(2,2)
(3,3)
(1,22.0)
(2,16.5)
(3,26.0)
For details you can see it on mysql page
Regarding "Problem 1" try
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[(Int, Double)].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[(Int, Long)].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[(Int, Double)].orderBy("player_no")

Finding average of values against key using RDD in Spark

I have created RDD with first column is Key and rest of columns are values against that key. Every row has a unique key. I want to find average of values against every key. I created Key value pair and tried following code but it is not producing desired results. My code is here.
val rows = 10
val cols = 6
val partitions = 4
lazy val li1 = List.fill(rows,cols)(math.random)
lazy val li2 = (1 to rows).toList
lazy val li = (li1, li2).zipped.map(_ :: _)
val conf = new SparkConf().setAppName("First spark").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(li,partitions)
val gr = rdd.map( x => (x(0) , x.drop(1)))
val gr1 = gr.values.reduce((x,y) => x.zip(y).map(x => x._1 +x._2 )).foldLeft(0)(_+_)
gr1.take(3).foreach(println)
I want result to be displayed like
1 => 1.1 ,
2 => 2.7
and so on for all keys
First I am not sure what this line is doing,
lazy val li = (li1, li2).zipped.map(_ :: _)
Instead, you could do this,
lazy val li = li2 zip li1
This will create the List of tuples of the type (Int, List[Double]).
And the solution to find the average values against keys could be as below,
rdd.map{ x => (x._1, x._2.fold(0.0)(_ + _)/x._2.length) }.collect.foreach(x => println(x._1+" => "+x._2))

add columns in dataframes dynamically with column names as elements in List

I have List[N] like below
val check = List ("a","b","c","d")
where N can be any number of elements.
I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y)
I have tried all possible ways, like withColumn, selectExpr, nothing works.
Please consider substring(X,Y) where X and Y as some numbers based on some metadata
Below are my different codes which I tried, but none worked,
val df = sqlContext.read.text("xxxxx")
val coder: (String => String) = (arg: String) => {
val param = "NULL"
if (arg.length() > Y )
arg.substring(X,Y)
else
val sqlfunc = udf(coder)
val check = List ("a","b","c","d")
for (name <- check){val testDF2 = df.withColumn(name, sqlfunc(df("value")))}
testDF2 has only last column d and other columns such as a,b,c are not added in table
var z:Array[String] = new Array[String](check.size)
var i=0
for ( x <- check ) {
if ( (i+1) == check.size) {
z(i) = s""""substring(a.value,X,Y) as $x""""
i = i+1}
else{
z(i) = s""""substring(a.value,X,Y) as $x","""
i = i+1}}
val zz = z.mkString(" ")
df.alias("a").selectExpr(s"$zz").show()
This throws error
Please help how to add columns in DF dynamically with column names as elements in List
I am expecting an Df like below
-----------------------------
Value| a | b | c | d | .... N
-----------------------------
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
-----------------------------
you can dynamically add columns from your list using for instance this answer by user6910411 to a similar question (see her/his full answer for more possibilities):
val newDF = check.foldLeft(<yourdf>)((df, name) => df.withColumn(name,<yourUDF>$"value"))

How do I remove empty dataframes from a sequence of dataframes in Scala

How do I remove empty data frames from a sequence of data frames? In this below code snippet, there are many empty data frames in twoColDF. Also another question for the below for loop, is there a way that I can make this efficient? I tried rewriting this to below line but didn't work
//finalDF2 = (1 until colCount).flatMap(j => groupCount(j).map( y=> finalDF.map(a=>a.filter(df(cols(j)) === y)))).toSeq.flatten
var twoColDF: Seq[Seq[DataFrame]] = null
if (colCount == 2 )
{
val i = 0
for (j <- i + 1 until colCount) {
twoColDF = groupCount(j).map(y => {
finalDF.map(x => x.filter(df(cols(j)) === y))
})
}
}finalDF = twoColDF.flatten
Given a set of DataFrames, you can access each DataFrame's underlying RDD and use isEmpty to filter out the empty ones:
val input: Seq[DataFrame] = ???
val result = input.filter(!_.rdd.isEmpty())
As for your other question - I can't understand what your code tries to do, but I'd first try to convert it into something more functional (remove use of vars and imperative conditionals). If I'm guessing the meaning of your inputs, here's something that might be equivalent to what you're trying to do:
var input: Seq[DataFrame] = ???
// map of column index to column values -
// for each combination we'd want a new DF where that column has that value
// I'm assuming values are Strings, can be anything else
val groupCount: Map[Int, Seq[String]] = ???
// for each combination of DF + column + value - produce the filtered DF where this column has this value
val perValue: Seq[DataFrame] = for {
df <- input
index <- groupCount.keySet
value <- groupCount(index)
} yield df.filter(col(df.columns(index)) === value)
// remove empty results:
val result: Seq[DataFrame] = perValue.filter(!_.rdd.isEmpty())

How to take the first element of each element of a RDD[Double,Double] and create a diagonal matrix with it?

I'm working on Spark in Scala and I want to transform
Array[(Double, Double)] = Array((0.9398785848878621,1.0), (0.25788885483788343,1.0), (0.6093264774118677,1.0), (0.19736451516248585,0.0), (0.9952925254744414,1.0), (0.6920511147023924,0.0...
into something like
Array[Double]=Array(0.9398785848878621, 0.25788885483788343, 0.6093264774118677, 0.19736451516248585, 0.9952925254744414 , 0.6920511147023924 ...
How can I do it?
Then how can I use this Array[Double] to create a diagonal matrix ?
Just take the first part of your tuple :
val a = Array((0.9398785848878621,1.0), (0.25788885483788343,1.0))
val result = a.map(_._1)
Try this:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val a = Array((0.9398785848878621,1.0), (0.25788885483788343,1.0), ...)
val res1 = a.map(_._1)
val entries = New Array[Double](res1.length)
for (i <- 0 to res1.length - 1){
entries(i) = MatrixEntry(i,i,res1(i))
}
val mat = CoordinateMatrix(sc.parallelize(res1))