Spark sorting of delimited data - scala

I am new to Spark. Can you give any idea what is the problem with below code:
val rawData="""USA | E001 | ABC DE | 19850607 | IT | $100
UK | E005 | CHAN CL | 19870512 | OP | $200
USA | E003 | XYZ AB | 19890101 | IT | $250
USA | E002 | XYZ AB | 19890705 | IT | $200"""
val sc = ...
val data= rawData.split("\n")
val rdd= sc.parallelize(data)
val data1=rdd.flatMap(line=> line.split(" | "))
val data2 = data1.map(arr => (arr(2), arr.mkString(""))).sortByKey(false)
data2.saveAsTextFile("./sample_data1_output")
Here, .sortByKey(false) is not working and compiler gives me error:
[error] /home/admin/scala/spark-poc/src/main/scala/SparkApp.scala:26: value sortByKey is not a member of org.apache.spark.rdd.RDD[(String, String)]
[error] val data2 = data1.map(arr => (arr(2), arr.mkString(""))).sortByKey(false)
Question is how to get MappedRDD? Or on what object should I call sortByKey()?

Spark provides additional operations, like sortByKey(), on RDDs of pairs. These operations are available through a class called PairRDDFunctions and Spark uses implicit conversions to automatically perform the RDD -> PairRDDFunctions wrapping.
To import the implicit conversions, add the following lines to the top of your program:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
This is discussed in the Spark programming guide's section on Working with key-value pairs.

Related

How to merge/join Spark/Scala RDD to List so each value in RDD gets a new row with each List item

Lets say I have a List[String] and I want to merge it with a RDD Object so that each object in the RDD gets each value in the List added to it:
List[String] myBands = ["Band1","Band2"];
Table: BandMembers
|name | instrument |
| ----- | ---------- |
| slash | guitar |
| axl | vocals |
case class BandMembers ( name:String, instrument:String );
var myRDD = BandMembersTable.map(a => new BandMembers(a.name, a.instrument));
//join the myRDD to myBands
// how do I do this?
//var result = myRdd.join/merge/union(myBands);
Desired result:
|name | instrument | band |
| ----- | ---------- |------|
| slash | guitar | band1|
| slash | guitar | band2|
| axl | vocals | band1|
| axl | vocals | band2|
I'm not quite sure how to go about this in the best way for Spark/Scala. I know I can convert to DF and then use spark sql to do the joins, but there has to be a better way with the RDD and List, or so I think.
The style is a bit off here, but assuming you really need RDD's instead of Dataset
So with RDD:
case class BandMembers ( name:String, instrument:String )
val myRDD = spark.sparkContext.parallelize(BandMembersTable.map(a => new BandMembers(a.name, a.instrument)))
val myBands = spark.sparkContext.parallelize(Seq("Band1","Band2"))
val res = myRDD.cartesian(myBands).map { case (a,b) => Row(a.name, a.instrument, b) }
With Dataset:
case class BandMembers ( name:String, instrument:String )
val myRDD = BandMembersTable.map(a => new BandMembers(a.name, a.instrument)).toDS
val myBands = Seq("Band1","Band2").toDS
val res = myRDD.crossJoin(myBands)
Input data:
val BandMembersTable = Seq(BandMembers("a", "b"), BandMembers("c", "d"))
val myBands = Seq("Band1","Band2")
Output with Dataset:
+----+----------+-----+
|name|instrument|value|
+----+----------+-----+
|a |b |Band1|
|a |b |Band2|
|c |d |Band1|
|c |d |Band2|
+----+----------+-----+
Println with RDDs (these are Rows)
[a,b,Band1]
[c,d,Band2]
[c,d,Band1]
[a,b,Band2]
Consider using RDD zip for this.. From official docs
RDD<scala.Tuple2<T,U>> zip(RDD other, scala.reflect.ClassTag evidence$11)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD,

Spark Scala VectorAssembler IllegalArgumentException: XX does not exist. Available: number

I'm using Scala on Spark and I have dense matrix like this:
vp
res63: org.apache.spark.ml.linalg.DenseMatrix =
-0.26035262239241047 -0.9349256695883184
0.08719326360909431 -0.06393629243008418
0.006698866707269257 0.04124873027993731
0.011979122705128064 -0.005430767154896149
0.049075485175059275 0.04810618828561001
0.001605411530424612 0.015016736357799364
0.9587151228619724 -0.2534046936998956
-0.04668498310146597 0.06015550772431999
-0.022360873382091598 -0.22147143481166376
-0.014153052584280682 -0.025947327705852636
I want use VectorAssembler to create feature column so I transform vp:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val c = vp.toArray.toSeq
val vp_df = c.toDF("number")
vp_df.createOrReplaceTempView("vp_df")
val vp_list = vp_df.collect.map(_.toSeq).flatten
val vp_string = vp_list.map(_.toString)
vp_string
res64: Array[String] = Array(-0.26035262239241047, 0.08719326360909431, 0.006698866707269257, 0.011979122705128064, 0.049075485175059275, 0.001605411530424612, 0.9587151228619724, -0.04668498310146597, -0.022360873382091598, -0.014153052584280682, -0.9349256695883184, -0.06393629243008418, 0.04124873027993731, -0.005430767154896149, 0.04810618828561001, 0.015016736357799364, -0.2534046936998956, 0.06015550772431999, -0.22147143481166376, -0.025947327705852636)
Then I use VectorAssembler:
val assembler = new VectorAssembler().setInputCols(vp_string).setOutputCol("features")
val output = assembler.transform(vp_df)
output.select("features").show(false)
But I have an error and I don't understand why
IllegalArgumentException: -0.26035262239241047 does not exist. Available: number
I don't know how this is possible, I've done the AssemblerVector several times and this is the first time I've seen this
Your "number" column is already in an array of double, so all you need is to convert this column into a dense vector.
val arrayToVectorUDF = udf((array : Seq[Double]) => {
Vectors.dense(array.toArray)
})
vp_df.withColumn("vector", arrayToVectorUDF(col("number")))
Update: I misunderstood your code.
The number column is a DoubleType column, so all you need to do is pass the column name to the vector assembler.
import org.apache.spark.ml.linalg.DenseMatrix
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val data = (1 to 20).map(_.toDouble).toArray
val dm = new DenseMatrix(2, 10, data)
val vp_df = dm.toArray.toSeq.toDF("number")
val assembler = new VectorAssembler().setInputCols(Array("number")).setOutputCol("features")
val output = assembler.transform(vp_df)
output.select("features").show(false)
+--------+
|features|
+--------+
|[1.0] |
|[2.0] |
|[3.0] |
|[4.0] |
|[5.0] |
|[6.0] |
|[7.0] |
|[8.0] |
|[9.0] |
|[10.0] |
|[11.0] |
|[12.0] |
|[13.0] |
|[14.0] |
|[15.0] |
|[16.0] |
|[17.0] |
|[18.0] |
|[19.0] |
|[20.0] |
+--------+

Merging rows into a single struct column in spark scala has efficiency problems, how do we do it better?

I am trying to speed up and limit the cost of taking several columns and their values and inserting them into a map in the same row. This is a requirement because we have a legacy system that is reading from this job and it isn't yet ready to be refactored. There is also another map with some data that needs to be combined with this.
Currently we have a few solutions all of which seem to result in about the same run time on the same cluster with around 1TB of data stored in Parquet:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import spark.implicits._
def jsonToMap(s: String, map: Map[String, String]): Map[String, String] = {
implicit val formats = org.json4s.DefaultFormats
val jsonMap = if(!s.isEmpty){
parse(s).extract[Map[String, String]]
} else {
Map[String, String]()
}
if(map != null){
map ++ jsonMap
} else {
jsonMap
}
}
val udfJsonToMap = udf(jsonToMap _)
def addMap(key:String, value:String, map: Map[String,String]): Map[String,String] = {
if(map == null) {
Map(key -> value)
} else {
map + (key -> value)
}
}
val addMapUdf = udf(addMap _)
val output = raw.columns.foldLeft(raw.withColumn("allMap", typedLit(Map.empty[String, String]))) { (memoDF, colName) =>
if(colName.startsWith("columnPrefix/")){
memoDF.withColumn("allMap", when(col(colName).isNotNull, addMapUdf(substring_index(lit(colName), "/", -1), col(colName), col("allTagsMap")) ))
} else if(colName.equals("originalMap")){
memoDF.withColumn("allMap", when(col(colName).isNotNull, udfJsonToMap(col(colName), col("allMap"))))
} else {
memoDF
}
}
takes about 1h on 9 m5.xlarge
val resourceTagColumnNames = raw.columns.filter(colName => colName.startsWith("columnPrefix/"))
def structToMap: Row => Map[String,String] = { row =>
row.getValuesMap[String](resourceTagColumnNames)
}
val structToMapUdf = udf(structToMap)
val experiment = raw
.withColumn("allStruct", struct(resourceTagColumnNames.head, resourceTagColumnNames.tail:_*))
.select("allStruct")
.withColumn("allMap", structToMapUdf(col("allStruct")))
.select("allMap")
Also runs in about 1h on the same cluster
This code all works but it isn't fast enough it is about 10 times longer than every other transform we have right now and it is a bottle neck for us.
Is there another way to get this result that is more efficient?
Edit: I have also tried limiting the data by a key however because the values in the columns I am merging can change despite the key remaining the same I cannot limit the data size without risking data loss.
Tl;DR: using only spark sql built-in functions can significantly speedup computation
As explained in this answer, spark sql native functions are more
performant than user-defined functions. So we can try to implement the solution to your problem using only
spark sql native functions.
I show two main versions of implementation. One using all the sql functions existing in last version
of Spark available at the time I wrote this answer, which is Spark 3.0. And another using only sql functions
existing in spark version when the question was asked, so functions existing in Spark 2.3. All the used functions
in this version are also available in Spark 2.2
Spark 3.0 implementation with sql functions
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
val mapFromPrefixedColumns = map_filter(
map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c.dropWhile(_ != '/').tail), col(c))): _*),
(_, v) => v.isNotNull
)
val mapFromOriginalMap = when(col("originalMap").isNotNull && col("originalMap").notEqual(""),
from_json(col("originalMap"), MapType(StringType, StringType))
).otherwise(
map()
)
val comprehensiveMapExpr = map_concat(mapFromPrefixedColumns, mapFromOriginalMap)
raw.withColumn("allMap", comprehensiveMapExpr)
Spark 2.2 implementation with sql functions
In spark 2.2, We don't have the functions map_concat (available in spark 2.4) and map_filter (available in spark 3.0).
I replace them with user-defined functions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
def filterNull(map: Map[String, String]): Map[String, String] = map.toSeq.filter(_._2 != null).toMap
val filter_null_udf = udf(filterNull _)
def mapConcat(map1: Map[String, String], map2: Map[String, String]): Map[String, String] = map1 ++ map2
val map_concat_udf = udf(mapConcat _)
val mapFromPrefixedColumns = filter_null_udf(
map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c.dropWhile(_ != '/').tail), col(c))): _*)
)
val mapFromOriginalMap = when(col("originalMap").isNotNull && col("originalMap").notEqual(""),
from_json(col("originalMap"), MapType(StringType, StringType))
).otherwise(
map()
)
val comprehensiveMapExpr = map_concat_udf(mapFromPrefixedColumns, mapFromOriginalMap)
raw.withColumn("allMap", comprehensiveMapExpr)
Implementation with sql functions without json mapping
The last part of the question contains a simplified code without mapping of the json column and without filtering of
null values in result map. I created the following implementation for this specific case. As I don't use functions
that were added between spark 2.2 and spark 3.0, I don't need two versions of this implementation:
import org.apache.spark.sql.functions._
val mapFromPrefixedColumns = map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c), col(c))): _*)
raw.withColumn("allMap", mapFromPrefixedColumns)
Run
For the following dataframe as input:
+--------------------+--------------------+--------------------+----------------+
|columnPrefix/column1|columnPrefix/column2|columnPrefix/column3|originalMap |
+--------------------+--------------------+--------------------+----------------+
|a |1 |x |{"column4": "k"}|
|b |null |null |null |
|c |null |null |{} |
|null |null |null |null |
|d |2 |null | |
+--------------------+--------------------+--------------------+----------------+
I obtain the following allMap column:
+--------------------------------------------------------+
|allMap |
+--------------------------------------------------------+
|[column1 -> a, column2 -> 1, column3 -> x, column4 -> k]|
|[column1 -> b] |
|[column1 -> c] |
|[] |
|[column1 -> d, column2 -> 2] |
+--------------------------------------------------------+
And for the mapping without json column:
+---------------------------------------------------------------------------------+
|allMap |
+---------------------------------------------------------------------------------+
|[columnPrefix/column1 -> a, columnPrefix/column2 -> 1, columnPrefix/column3 -> x]|
|[columnPrefix/column1 -> b, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 -> c, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 ->, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 -> d, columnPrefix/column2 -> 2, columnPrefix/column3 ->] |
+---------------------------------------------------------------------------------+
Benchmark
I generated a csv file of 10 millions lines, uncompressed (about 800 Mo), containing one column without column prefix,
nine columns with column prefix, and one colonne containing json as a string:
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+
|id |columnPrefix/column1|columnPrefix/column2|columnPrefix/column3|columnPrefix/column4|columnPrefix/column5|columnPrefix/column6|columnPrefix/column7|columnPrefix/column8|columnPrefix/column9|originalMap |
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+
|1 |iwajedhor |ijoefzi |der |ob |galsu |ril |le |zaahuz |fuzi |{"column10":"true"}|
|2 |ofo |davfiwir |lebfim |roapej |lus |roum |te |javes |karutare |{"column10":"true"}|
|3 |jais |epciel |uv |piubnak |saajo |doke |ber |pi |igzici |{"column10":"true"}|
|4 |agami |zuhepuk |er |pizfe |lafudbo |zan |hoho |terbauv |ma |{"column10":"true"}|
...
The benchmark is to read this csv file, create the column allMap, and write this column to parquet. I ran this on my local machine and I obtained the following results
+--------------------------+--------------------+-------------------------+-------------------------+
| implementations | current (with udf) | sql functions spark 3.0 | sql functions spark 2.2 |
+--------------------------+--------------------+-------------------------+-------------------------+
| execution time | 138 seconds | 48 seconds | 82 seconds |
| improvement from current | 0 % faster | 64 % faster | 40 % faster |
+--------------------------+--------------------+-------------------------+-------------------------+
I also ran against the second implementation in the question, that drop the mapping of the json column and the filtering of null value in map.
+--------------------------+-----------------------+------------------------------------+
| implementations | current (with struct) | sql functions without json mapping |
+--------------------------+-----------------------+------------------------------------+
| execution time | 46 seconds | 35 seconds |
| improvement from current | 0 % | 23 % faster |
+--------------------------+-----------------------+------------------------------------+
Of course, the benchmark is rather basic, but we can see an improvement compared to the implementations that use user-defined functions
Conclusion
When you have a performance issue and you use user-defined functions, it can be a good idea to try to replace those user-defined functions by
spark sql functions

Convert ArrayType(FloatType,false) to VectorUTD

I want to perform cluster analysis using K-Means on itemFactors produced by ALS. Although the itemFactors of ALSModel returns a dataframe that contains the id and the features of the itemFactors, this data structure seems to be unsuitable for K-Means.
Here's the code for collaborative filtering using ALS:
val als = new ALS()
.setRegParam(0.01)
.setNonnegative(false)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training)
val predictions = model.transform(testing)
val item_factors = model.itemFactors
item_factors dataframe looks like
+---+-------------------------------------------------------------------------------------------------------------------------------------+
|id |features |
+---+-------------------------------------------------------------------------------------------------------------------------------------+
|10 |[-0.1317064, 0.07098049, -0.042259596, -0.28769347, 0.58783025, -0.33474237, 0.31248248, -0.34541374, 0.33257273, 0.06327486] |
|20 |[-0.0033912044, 0.31334892, -0.080896676, -0.75597364, -0.016326033, -0.34558973, 0.045129072, -0.38614395, -0.02269395, -0.16486467]|
|30 |[0.19784503, -0.313929, -0.67753965, -0.7700008, 0.08975326, -0.03427274, 0.49707127, 0.05604595, 0.078268416, 0.08767615] |
|40 |[0.29390565, -0.22765353, -0.9278744, -0.59953785, 0.184721, -0.061099682, 0.33711356, 0.094112396, 0.08261518, -0.30668002] |
|50 |[-0.4070981, -0.0013739555, -0.21247752, -0.3771588, 0.3029064, -0.3883846, 0.4752892, 0.30097932, 0.5130039, 0.2938855] |
|60 |[0.1413918, -0.074142076, -0.87392575, -0.07855377, -0.11006678, -0.44359666, 0.33419594, -0.16027139, -0.2440797, -0.1596081] |
|70 |[-0.26080364, -0.11437138, 0.046630252, -0.70999575, 0.014645281, -0.69176155, 0.05397229, -0.24038066, -0.429569, 0.5660369] |
|80 |[0.6104476, -0.35322133, -0.80230886, -0.5302148, -0.26538768, -0.25481275, 0.20784922, -0.10604211, 0.26007786, 0.47488773] |
|90 |[0.6976714, -0.5851011, -0.64844996, -0.82472694, 0.102610275, -0.45195442, 0.24074861, 0.2683314, 0.11396688, -0.52693856] |
|100|[-0.11564436, 0.21467225, -0.42873487, -0.54825515, 0.20628366, -0.28728506, 0.18303588, 0.11490151, -0.033433616, -0.08694091] |
|110|[-0.530162, 0.22694068, -0.30889827, -0.091455124, 0.52988344, -0.7247424, 0.029707031, 0.43658048, 0.21511139, -0.22376455] |
|120|[0.59780246, -0.3396686, -0.58882934, -0.11867501, -0.6055776, -0.82480395, -0.22715187, -0.4544479, 0.012708589, -0.22158282] |
|130|[0.9630984, -0.012603591, -0.37178686, -1.0995674, -0.57324636, -0.7460034, 1.2981551, 0.15384857, -1.0350431, -0.58156097] |
|140|[-0.1617866, 0.3927005, -0.26183906, -0.3666182, -0.015750444, -0.28372696, 0.3577147, -0.18155682, 0.22410324, -0.5632848] |
|150|[-0.20490485, 0.37170428, -0.47898963, 0.0686825, 0.31148073, -0.4663402, 0.2088939, -0.0071071014, 0.44748953, 0.0067634075] |
|160|[0.31892687, 0.30109385, -0.036033046, -0.58646286, 0.015361498, -0.5640331, 0.010378816, -0.52527076, -0.20914118, -0.07263985] |
|170|[0.13082151, -0.082676716, 0.15034986, -0.7333888, 0.14089121, -0.34780806, 0.51327425, -0.43825528, 0.2210635, -0.19778338] |
|180|[-0.45791233, -0.64516217, 0.3496911, -0.6879449, 0.11970334, -0.3473338, 0.30204558, -0.18284592, 0.5934964, 0.06711411] |
|190|[0.41464698, 0.04347724, -0.9297292, -1.2885705, -0.5567429, 0.2531382, 0.11184802, -0.46155334, -0.3385828, 0.789031] |
|200|[0.37707302, -0.023397477, -0.47769275, -0.99200153, -0.11546725, -0.125011, -0.07772487, -0.5624814, -0.026348682, -0.33438805] |
+---+-------------------------------------------------------------------------------------------------------------------------------------+
And here is the code for K-Means clustering.
val kmeans = new KMeans().setK(10).setSeed(1L)
val kmeans_model = kmeans.fit(item_factors)
val predictions = kmeans_model.transform(item_factors)
The error I get when the item_factors dataframe is fed into the K-Means is shown below:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually ArrayType(FloatType,false).
You can map array to vector:
import org.apache.spark.ml.linalg._
import scala.collection.mutable.WrappedArray
val itemFactors = model.itemFactors
val convertUDF = udf((array : Seq[Double]) => {
Vectors.dense(array.toArray)
})
val withVector = itemFactors
.withColumn("features", convertUDF('features.cast("array<double>")))

How to normalize or standardize the data having multiple columns/variables in spark using scala?

I am new to apache spark and scala. I have data set like this which I am taking from csv file and converting it into RDD using scala.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234 | 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
I want to calculate z-score value or to standardize the data. So I am calculating the z-score for each column and then try to combine them so I get standard scale.
Here is my code for calculating the z-score for first column
val scores1 = sorted.map(_.split(",")(0)).cache
val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / count)
val zscore = sorted.map(x => math.round((x.toDouble - mean)/stddev))
How do I calculate for each column ? Or is there any other way to normalize or standardize the data ?
My requirement is to assign the rank(or scale).
Thanks
If you want to standardize the columns, you can use the StandardScaler class from Spark MLlib. Data should be in the form of RDD[Vectors[Double], where Vectors are a part of MLlib Linalg package. You can choose to use mean or standard deviation or both to standardize your data.
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
val data = sc.parallelize(Array(
Array(1.0,2.0,3.0),
Array(4.0,5.0,6.0),
Array(7.0,8.0,9.0),
Array(10.0,11.0,12.0)))
// Converting RDD[Array] to RDD[Vectors]
val features = data.map(a => Vectors.dense(a))
// Creating a Scaler model that standardizes with both mean and SD
val scaler = new StandardScaler(withMean = true, withStd = true).fit(features)
// Scale features using the scaler model
val scaledFeatures = scaler.transform(features)
This scaledFeatures RDD contains the Z-score of all columns.
Hope this answer helps. Check the Documentation for more info.
You may want to use below code to perform Standard Scaling on required columns.Vector Assembler is used to select required columns that need to be transformed. StandardScaler constructor also provides you an option to select values of Mean and Standard deviation
Code:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature
import org.apache.spark.ml.feature.StandardScaler
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/user/hadoop/data/your_dataset.csv")
df.show(Int.MaxValue)
val assembler = new VectorAssembler().setInputCols(Array("recent","Freq","Monitor")).setOutputCol("features")
val transformVector = assembler.transform(df)
val scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false)
val scalerModel = scaler.fit(transformVector)
val scaledData = scalerModel.transform(transformVector)
scaledData.show() 20, False
scaledData.show(Int.MaxValue)
scaledData.show(20, false)