How to create a simple DataFrame with random values - scala

I am trying to create a very simple DataFrame with for example 3 columns and 3 rows.
I would like to have something like this:
+------+---+-----+
|nameID|age| Code|
+------+---+-----+
|2123 | 80| 4553|
|65435 | 10| 5454|
+------+---+-----+
How can I create that Dataframe in Scala (is an example).
I have the next program:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object ejemploApp extends App{
val schema = StructType(List(
StructField("name", LongType, true),
StructField("pandas", LongType, true),
StructField("id", LongType, true)))
}
val outputDF = sqlContext.createDataFrame(sc.emptyRDD, schema)
First problem:
It is throwing error in the outputDF that cannot resolve symbol schema.
Second problem:
How can I add the random numbers to the DataFrame?

I would do something like this:
val nRows = 10
import scala.util.Random
val df = (1 to nRows)
.map(_ => (Random.nextLong,Random.nextLong,Random.nextLong))
.toDF("nameID","age","Code")
+--------------------+--------------------+--------------------+
| nameID| age| Code|
+--------------------+--------------------+--------------------+
| 5805854653225159387|-1935762756694500432| 1365584391661863428|
| 4308593891267308529|-1117998169834014611| 366909655761037357|
|-6520321841013405169| 7356990033384276746| 8550003986994046206|
| 6170542655098268888| 1233932617279686622| 7981198094124185898|
|-1561157245868690538| 1971758588103543208| 6200768383342183492|
|-8160793384374349276|-6034724682920319632| 6217989507468659178|
| 4650572689743320451| 4798386671229558363|-4267909744532591495|
| 1769492191639599804| 7162442036876679637|-4756245365203453621|
| 6677455911726550485| 8804868511911711123|-1154102864413343257|
|-2910665375162165247|-7992219570728643493|-3903787317589941578|
+--------------------+--------------------+--------------------+
Of course, the age isn't very realistic, but you can change your random numbers as you wish (i.e. using scalas modulo function and absolute value), you could e.g.
val df = (1 to nRows)
.map(id => (id.toLong,Math.abs(Random.nextLong % 100L),Random.nextLong))
.toDF("nameID","age","Code")
+------+---+--------------------+
|nameID|age| Code|
+------+---+--------------------+
| 1| 17| 7143235115334699492|
| 2| 83|-3863778506510275412|
| 3| 31|-3839786144396379186|
| 4| 40| 8057989112338559775|
| 5| 67| 7601061291211506255|
| 6| 71| 7393782421106239325|
| 7| 43| 28349510524075085|
| 8| 50| 539042255545625624|
| 9| 41|-8654000375112432924|
| 10| 82|-1300111870445007499|
+------+---+--------------------+
EDIT: make sure you have the implicits imported:
Spark 1.6:
import sqlContext.implicits._
Spark 2:
import sparkSession.implicits._

Related

base64 decoding of a dataframe

I have an encoded dataframe and I managed to get it decoded using following code in PySpark. Is there any simple way where I can have an additional column in the dataframe itself through Scala/PySpark?
import base64
import numpy as np
df = spark.read.parquet("file_path")
encodedColumn = base64.decodestring(df.take(1)[0].column2)
t1 = np.frombuffer(encodedColumn ,dtype='<f4')
I looked up multiple similar questions, but couldnt get them to work.
Edit:
Got it working with help from a colleague.
def binaryToFloatArray(stringValue: String): Array[Float] = {
val t:Array[Byte] = Base64.getDecoder().decode(stringValue)
val b = ByteBuffer.wrap(t).order(ByteOrder.LITTLE_ENDIAN).asFloatBuffer()
val copy = new Array[Float](2048)
b.get(copy)
return copy
}
val binaryToFloatArrayUDF = udf(binaryToFloatArray _)
val finalResultDf = dftest.withColumn("myFloatArray", binaryToFloatArrayUDF(col("_2"))).drop("_2")
You have base64 and unbase64 functions for this.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=streaming#pyspark.sql.functions.base64
You could
from pyspark.sql.functions import unbase64,base64
got = spark.createDataFrame([(1, "Jon"), (2, "Danny"), (3, "Tyrion")], ("id", "name"))
+---+------+
| id| name|
+---+------+
| 1| Jon|
| 2| Danny|
| 3|Tyrion|
+---+------+
encoded_got = got.withColumn('encoded_base64_name', base64(got.name))
+---+------+-------------------+
| id| name|encoded_base64_name|
+---+------+-------------------+
| 1| Jon| Sm9u|
| 2| Danny| RGFubnk=|
| 3|Tyrion| VHlyaW9u|
+---+------+-------------------+
decoded_got = encoded_got.withColumn('decoded_base64', unbase64(encoded_got.encoded_base64).cast("string"))
# Need to use cast("string") to convert from binary to string
+---+------+--------------+--------------+
| id| name|encoded_base64|decoded_base64|
+---+------+--------------+--------------+
| 1| Jon| Sm9u| Jon|
| 2| Danny| RGFubnk=| Danny|
| 3|Tyrion| VHlyaW9u| Tyrion|
+---+------+--------------+--------------+

How to point or select a cell in a dataframe, Spark - Scala

I want to find the time difference of 2 cells.
With arrays in python I would do a for loop the st[i+1] - st[i] and store the results somewhere.
I have this dataframe sorted by time. How can I do it with Spark 2 or Scala, a pseudo-code is enough.
+--------------------+-------+
| st| name|
+--------------------+-------+
|15:30 |dog |
|15:32 |dog |
|18:33 |dog |
|18:34 |dog |
+--------------------+-------+
If the sliding diffs are to be computed per partition by name, I would use the lag() Window function:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
("a", 100), ("a", 120),
("b", 200), ("b", 240), ("b", 270)
).toDF("name", "value")
val window = Window.partitionBy($"name").orderBy("value")
df.
withColumn("diff", $"value" - lag($"value", 1).over(window)).
na.fill(0).
orderBy("name", "value").
show
// +----+-----+----+
// |name|value|diff|
// +----+-----+----+
// | a| 100| 0|
// | a| 120| 20|
// | b| 200| 0|
// | b| 240| 40|
// | b| 270| 30|
// +----+-----+----+
On the other hand, if the sliding diffs are to be computed across the entire dataset, Window function without partition wouldn't scale hence I would resort to using RDD's sliding() function:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = df.rdd
val diffRDD = rdd.sliding(2).
map{ case Array(x, y) => Row(y.getString(0), y.getInt(1), y.getInt(1) - x.getInt(1)) }
val headRDD = sc.parallelize(Seq(Row.fromSeq(rdd.first.toSeq :+ 0)))
val headDF = spark.createDataFrame(headRDD, df.schema.add("diff", IntegerType))
val diffDF = spark.createDataFrame(diffRDD, df.schema.add("diff", IntegerType))
val resultDF = headDF union diffDF
resultDF.show
// +----+-----+----+
// |name|value|diff|
// +----+-----+----+
// | a| 100| 0|
// | a| 120| 20|
// | b| 200| 80|
// | b| 240| 40|
// | b| 270| 30|
// +----+-----+----+
Something like:
object Data1 {
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
def main(args: Array[String]) : Unit = {
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("Test")
.master("local[1]")
.getOrCreate()
import org.apache.spark.sql.functions.col
val rows = Seq(Row(1, 1), Row(1, 1), Row(1, 1))
val schema = List(StructField("int1", IntegerType, true), StructField("int2", IntegerType, true))
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(rows),
StructType(schema)
)
someDF.withColumn("diff", col("int1") - col("int2")).show()
}
}
gives
+----+----+----+
|int1|int2|diff|
+----+----+----+
| 1| 1| 0|
| 1| 1| 0|
| 1| 1| 0|
+----+----+----+
If you are specifically looking to diff adjacent elements in a collection then in Scala I would zip the collection with its tail to give a collection containing tuples of adjacent pairs.
Unfortunately there isn't a tail method on RDDs or DataFrames/Sets
You could do something like:
val a = myDF.rdd
val tail = myDF.rdd.zipWithIndex.collect{
case (index, v) if index > 1 => v}
a.zip(tail).map{ case (l, r) => /* diff l and r st column */}.collect

Converting from org.apache.spark.sql.Dataset to CoordinateMatrix

I have a spark SQL dataset whose schema defined as follows,
User_id <String> | Item_id <String> | Bought_Status <Boolean>
I would like to convert this to a Sparse matrix to apply recommender systems algorithms. This is very huge RDD datasets so I read that CoordinateMatrix is the right way to create a sparse matrix out of this.
However I got stuck at a point where the API doc says that RDD[MatrixEntry] is mandatory to create a CoordinateMatrix. Also MatrixEntry needs a format of int,int, long.
I am not able to convert my data scheme to this format. Can you please help me on how to convert this data to a sparse matrix in spark? I am currently programming using scala
Please note that matrix entity is of type long, long, double
Reference: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry
Also, as user/ item columns are string, those need to be indexed before processing. Here is how you can create coordinatematrix with scala:
//Imports needed
scala> import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
scala> import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
scala> import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.StringIndexer
//Let's create a dummy dataframe
scala> val df = spark.sparkContext.parallelize(List(
| ("u1","i1" ,true),
| ("u1","i2" ,true),
| ("u2","i3" ,false),
| ("u2","i4" ,false),
| ("u3","i1" ,true),
| ("u3","i3" ,true),
| ("u4","i3" ,false),
| ("u4","i4" ,false))).toDF("user","item","bought")
df: org.apache.spark.sql.DataFrame = [user: string, item: string ... 1 more field]
scala> df.show
+----+----+------+
|user|item|bought|
+----+----+------+
| u1| i1| true|
| u1| i2| true|
| u2| i3| false|
| u2| i4| false|
| u3| i1| true|
| u3| i3| true|
| u4| i3| false|
| u4| i4| false|
+----+----+------+
//Index user/ item columns
scala> val indexer1 = new StringIndexer().setInputCol("user").setOutputCol("userIndex")
indexer1: org.apache.spark.ml.feature.StringIndexer = strIdx_2de8d35b8301
scala> val indexed1 = indexer1.fit(df).transform(df)
indexed1: org.apache.spark.sql.DataFrame = [user: string, item: string ... 2 more fields]
scala> val indexer2 = new StringIndexer().setInputCol("item").setOutputCol("itemIndex")
indexer2: org.apache.spark.ml.feature.StringIndexer = strIdx_493ce45dbec3
scala> val indexed2 = indexer2.fit(indexed1).transform(indexed1)
indexed2: org.apache.spark.sql.DataFrame = [user: string, item: string ... 3 more fields]
scala> val tempDF = indexed2.withColumn("userIndex",indexed2("userIndex").cast("long")).withColumn("itemIndex",indexed2("itemIndex").cast("long")).withColumn("bought",indexed2("bought").cast("double")).select("userIndex","itemIndex","bought")
tempDF: org.apache.spark.sql.DataFrame = [userIndex: bigint, itemIndex: bigint ... 1 more field]
scala> tempDF.show
+---------+---------+------+
|userIndex|itemIndex|bought|
+---------+---------+------+
| 0| 1| 1.0|
| 0| 3| 1.0|
| 1| 0| 0.0|
| 1| 2| 0.0|
| 2| 1| 1.0|
| 2| 0| 1.0|
| 3| 0| 0.0|
| 3| 2| 0.0|
+---------+---------+------+
//Create coordinate matrix of size 4*4
scala> val corMat = new CoordinateMatrix(tempDF.rdd.map(m => MatrixEntry(m.getLong(0),m.getLong(1),m.getDouble(2))), 4, 4)
corMat: org.apache.spark.mllib.linalg.distributed.CoordinateMatrix = org.apache.spark.mllib.linalg.distributed.CoordinateMatrix#16be6b36
//Check the content of coordinate matrix
scala> corMat.entries.collect
res2: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(MatrixEntry(0,1,1.0), MatrixEntry(0,3,1.0), MatrixEntry(1,0,0.0), MatrixEntry(1,2,0.0), MatrixEntry(2,1,1.0), MatrixEntry(2,0,1.0), MatrixEntry(3,0,0.0), MatrixEntry(3,2,0.0))
Hope, this helps!

Writing Spark UDAFs in Scala to return Array type as output

I have a dataframe as below -
val myDF = Seq(
(1,"A",100),
(1,"E",300),
(1,"B",200),
(2,"A",200),
(2,"C",300),
(2,"D",100)
).toDF("id","channel","time")
myDF.show()
+---+-------+----+
| id|channel|time|
+---+-------+----+
| 1| A| 100|
| 1| E| 300|
| 1| B| 200|
| 2| A| 200|
| 2| C| 300|
| 2| D| 100|
+---+-------+----+
For each id, I want the channel sorted by time in ascending fashion. I want to implement an UDAF for this logic.
I would like to call this UDAF as -
scala > spark.sql("""select customerid , myUDAF(customerid,channel,time) group by customerid """).show()
Ouptut dataframe should look like -
+---+-------+
| id|channel|
+---+-------+
| 1|[A,B,E]|
| 2|[D,A,C]|
+---+-------+
I am trying to write an UDAF but unable to implement it -
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
class myUDAF extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function
override def inputSchema : org.apache.spark.sql.types.Structype =
Structype(
StructField("id" , IntegerType)
StructField("channel", StringType)
StructField("time", IntegerType) :: Nil
)
// This is the internal fields we would keep for computing the aggregate
// output
override def bufferSchema : Structype =
Structype(
StructField("Sequence", ArrayType(StringType)) :: Nil
)
// This is the output type of my aggregate function
override def dataType : DataType = ArrayType(StringType)
// no comments here
override def deterministic : Booelan = true
// initialize
override def initialize(buffer: MutableAggregationBuffer) : Unit = {
buffer(0) = Seq("")
}
}
Please help.
This will do it (no need to define your own UDF):
df.groupBy("id")
.agg(sort_array(collect_list( // NOTE: sort based on the first element of the struct
struct("time", "channel"))).as("stuff"))
.select("id", "stuff.channel")
.show(false)
+---+---------+
|id |channel |
+---+---------+
|1 |[A, B, E]|
|2 |[D, A, C]|
+---+---------+
I would not write an UDAF for that. In my experience UDAF are rather slow, especially with complex types. I would use the collect_list & UDF approach:
val sortByTime = udf((rws:Seq[Row]) => rws.sortBy(_.getInt(0)).map(_.getString(1)))
myDF
.groupBy($"id")
.agg(collect_list(struct($"time",$"channel")).as("channel"))
.withColumn("channel", sortByTime($"channel"))
.show()
+---+---------+
| id| channel|
+---+---------+
| 1|[A, B, E]|
| 2|[D, A, C]|
+---+---------+
A much simpler way without UDF.
import org.apache.spark.sql.functions._
myDF.orderBy($"time".asc).groupBy($"id").agg(collect_list($"channel") as "channel").show()

Updating a column of a dataframe based on another column value

I'm trying to update the value of a column using another column's value in Scala.
This is the data in my dataframe :
+-------------------+------+------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+------+------+-----+------+----+--------------------+-----------+
| 1| 0| 0| Name| 0|Desc| | 0|
| 2| 2.11| 10000|Juice| 0| XYZ|2016/12/31 : Inco...| 0|
| 3|-0.500|-24.12|Fruit| -255| ABC| 1994-11-21 00:00:00| 0|
| 4| 0.087| 1222|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5| 0.087| 1222|Bread|-22.06| | | 0|
+-------------------+------+------+-----+------+----+--------------------+-----------+
Here column _c5 contains a value which is incorrect(value in Row2 has the string Incorrect) based on which I'd like to update its isBadRecord field to 1.
Is there a way to update this field?
You can use withColumn api and use one of the functions which meet your needs to fill 1 for bad record.
For your case you can write a udf function
def fillbad = udf((c5 : String) => if(c5.contains("Incorrect")) 1 else 0)
And call it as
val newDF = dataframe.withColumn("isBadRecord", fillbad(dataframe("_c5")))
Instead of reasoning about updating it, I'll suggest you think about it as you would in SQL; you could do the following:
import org.spark.sql.functions.when
val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe
import spark.implicits._
df.select(
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
Here is a self-contained script that you can copy and paste on your Spark shell to see the result locally:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
sc.setLogLevel("ERROR")
val schema =
StructType(Seq(
StructField("UniqueRowIdentifier", IntegerType),
StructField("_c0", DoubleType),
StructField("_c1", DoubleType),
StructField("_c2", StringType),
StructField("_c3", DoubleType),
StructField("_c4", StringType),
StructField("_c5", StringType),
StructField("isBadRecord", IntegerType)))
val contents =
Seq(
Row(1, 0.0 , 0.0 , "Name", 0.0, "Desc", "", 0),
Row(2, 2.11 , 10000.0 , "Juice", 0.0, "XYZ", "2016/12/31 : Incorrect", 0),
Row(3, -0.5 , -24.12, "Fruit", -255.0, "ABC", "1994-11-21 00:00:00", 0),
Row(4, 0.087, 1222.0 , "Bread", -22.06, "", "2017-02-14 00:00:00", 0),
Row(5, 0.087, 1222.0 , "Bread", -22.06, "", "", 0)
)
val df = spark.createDataFrame(sc.parallelize(contents), schema)
df.show()
val withBadRecords =
df.select(
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
withBadRecords.show()
Whose relevant output is the following:
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 0|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 1|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
The best option is to create a UDF and try to convert it do Date format.
If it can be converted then return 0 else return 1
This work even if you have an bad date format
val spark = SparkSession.builder().master("local")
.appName("test").getOrCreate()
import spark.implicits._
//create test dataframe
val data = spark.sparkContext.parallelize(Seq(
(1,"1994-11-21 Xyz"),
(2,"1994-11-21 00:00:00"),
(3,"1994-11-21 00:00:00")
)).toDF("id", "date")
// create udf which tries to convert to date format
// returns 0 if success and returns 1 if failure
val check = udf((value: String) => {
Try(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(value)) match {
case Success(d) => 1
case Failure(e) => 0
}
})
// Add column
data.withColumn("badData", check($"date")).show
Hope this helps!