This is my first time using spark or scala so I am a newbie. I have a 2D array, and I need to convert it to a dataframe. The sample data is a joined table that is in the form of rectangle (double), point (a,b) also doubles, and a boolean of whether or not the point lies within the rectangle. My end goal is to return a dataframe with the name of the rectangle, and how many times it appears where ST_contains is true. Since the query returns all the instances where it is true, I simply am trying to sort by rectangle (they are named as doubles) and count each occurrence. I put that in an array and then try to convert it to a dataset. Here is some of my code and what I have tried:
// Join two datasets (not my code)
spark.udf.register("ST_Contains",(queryRectangle:String, pointString:String)=>(HotzoneUtils.ST_Contains(queryRectangle, pointString)))
val joinDf = spark.sql("select rectangle._c0 as rectangle, point._c5 as point from rectangle,point where ST_Contains(rectangle._c0,point._c5)")
joinDf.createOrReplaceTempView("joinResult")
// MY CODE
// above join gets a view with rectangle, point, and true. so I need to loop through and count how many for each rectangle
//sort by rectangle asc first
joinDf.orderBy("rectangle")
var a = Array.ofDim[String](1, 2)
for (row <- joinDf.rdd.collect){
var count = 1
var previous_r = -1.0
var r = row.mkString(",").split(",")(0).toDouble
var p = row.mkString(",").split(",")(1).toDouble
var c = row.mkString(",").split(",")(2).toDouble
if (previous_r != -1){
if (previous_r == r){
//add another to the count
count = count + 1
}
else{
//stick the result in an array
a ++= Array(Array(previous_r.toString, count.toString))
}
}
previous_r = r
}
//create dataframe from array and return it
val df = spark.createDataFrame(a).toDF()
But I keep getting this error:
inferred type arguments [Array[String]] do not conform to method createDataFrame's type parameter bounds [A <: Product]
val df = spark.createDataFrame(a).toDF()
I also tried it without the .toDf() portion and still no luck. I tried it without the createDataFrame command and just the .toDf but that did not work either.
A few things here:
createDataFrame has multiple variations and the one you end up trying is probably:
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
Array[String] is no Seq[A <: Product]: String is not a Product.
The fastest approach I can think of is go into a Seq and then a DataFrame:
import spark.implicits._
Array("some string")
.toSeq
.toDF
or parallelize the Array[String] into a RDD[String] and then create the DataFrame.
That second toDF() has no value, createDataFrame already returns a DataFrame (if it worked).
Related
I have an assignment where I need to load a csv dataset in a spark-shell using spark.read.csv(), and accomplish the following:
Convert the dataset to RDD
Remove the heading (first record (line) in the dataset)
Convert the first two fields to integers
Convert other fields except the last one to doubles. Questions marks should be NaN. The
last field should be converted to a Boolean.
I was able to do steps 1 and 2 with the following code:
//load the dataset as an RDD
val dataRDD = spark.read.csv("block_1.csv").rdd //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[14] at rdd at <console>:23
dataRDD.count() //output 574914
//import Row since RDD is of Row
import org.apache.spark.sql.Row
//function to recognize if a string contains "id_1"
def isHeader(r : Row) = r.toString.contains("id_1")
//filter function will take !isHeader function and apply it to all lines in dataRDD and the //return will form another RDD
val nohead = dataRDD.filter(x => !isHeader(x))
nohead.count() //output is now 574913
nohead.first //output is [37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE]
nohead //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[15] at filter at <console>:28
I'm trying to convert the fields but every time I use a function like toDouble I get an error stating not a member of:
:25: error: value toDouble is not a member of
org.apache.spark.sql.Row
if ("?".equals(s)) Double.NaN else s.toDouble
I'm not sure what I'm doing wrong and I've taken a look at the website https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Row.html#anyNull()
but I still don't know what I'm doing wrong.
I'm not sure how to convert something if there isn't a toDouble, toInt, or toBoolean function.
Can someone please guide me in the right direction to figure what I'm doing wrong? Where I can possibly look to answer? I need to convert the first two fields to integers, the other fields except for the last one to doubles. Question marks should be NaN. The last field should be converted to Boolean.
Convert the first two fields to integers
Convert other fields except the last one to doubles. Questions marks should be NaN. The last field should be converted to a Boolean.
You can do both 3 and 4 at once using a parse function.
First create the toDouble function since it is used in the parse function:
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line: String) = {
val pieces = line.split(',')
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
(id1, id2, scores, matched)
}
After you do this, you can call parse on each row in your RDD using map; however, you still have the type issue. To fix this, you could convert nohead from an RDD[Row] to an RDD[String]; however its probably easier to just convert the row to a string as you pass it:
val parsed = noheader.map(line => parse(line.mkString(",")))
This will give parsed as type: RDD[(Int, Int, Array[Double], Boolean)]
I have a data frame doubleSeq whose structure is as below
res274: org.apache.spark.sql.DataFrame = [finalFeatures: vector]
The first record of the column is as follows
res281: org.apache.spark.sql.Row = [[3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]]
I want to extract the double array
[3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]
from this -
doubleSeq.head(1)(0)(0)
gives
Any = [3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]
Which is not solving my problem
Scala Spark - split vector column into separate columns in a Spark DataFrame
Is not solving my issue but its an indicator
So you want to extract a Vector from a Row, and turn it into an array of doubles.
The problem with your code is that the get method (and the implicit apply method you are using) returns an object of type Any. Indeed, a Row is a generic, unparametrized object and there is no way to now at compile time what types it contains. It's a bit like Lists in java 1.4 and before. To solve it in spark, you can use the getAs method that you can parametrize with a type of your choosing.
In your situation, you seem to have a dataframe containing a vector (org.apache.spark.ml.linalg.Vector).
import org.apache.spark.ml.linalg._
val firstRow = df.head(1)(0) // or simply df.head
val vect : Vector = firstRow.getAs[Vector](0)
// or all in one: df.head.getAs[Vector](0)
// to transform into a regular array
val array : Array[Double] = vect.toArray
Note also that you can access columns by name like this:
val vect : Vector = firstRow.getAs[Vector]("finalFeatures")
I have an rdd that i am trying to filter for only float type. Do Spark rdds provide any way of doing this?
I have a csv where I need only float values greater than 40 into a new rdd. To achieve this, i am checking if it is an instance of type float and filtering them. When I filter with a !, all the strings are still there in the output and when i dont use !, the output is empty.
val airports1 = airports.filter(line => !line.split(",")(6).isInstanceOf[Float])
val airports2 = airports1.filter(line => line.split(",")(6).toFloat > 40)
At the .toFloat , i run into NumberFormatException which I've tried to handle in a try catch block.
Since you have a plain string and you are trying to get float values from it, you are not actually filtering by type. But, if they can be parsed to float instead.
You can accomplish that using a flatMap together with Option.
import org.apache.spark.sql.SparkSession
import scala.util.Try
val spark = SparkSession.builder.master("local[*]").appName("Float caster").getOrCreate()
val sc = spark.sparkContext
val data = List("x,10", "y,3.3", "z,a")
val rdd = sc.parallelize(data) // rdd: RDD[String]
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption) // filtered: RDD[Float]
filtered.collect() // res0: Array[Float] = Array(10.0, 3.3)
For the > 40 part you can either, perform another filter after or filter the inner Option.
(Both should perform more or less equals due spark laziness, thus choose the one is more clear for you).
// Option 1 - Another filter.
val filtered2 = filtered.filter(x => x > 40)
// Option 2 - Filter the inner option in one step.
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption.filter(x => x > 40))
Let me know if you have any question.
I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].
Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API.
Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic), where equivalence is determined by the data (column names and column values for each row) being identical save for the ordering of rows & columns?
The motivation for the question is that there are often many ways to compute some big data result, each with its own trade-offs. As one explores these trade-offs, it is important to maintain correctness and hence the need to check for the equivalence/equality on a meaningful test data set.
Scala (see below for PySpark)
The spark-fast-tests library has two methods for making DataFrame comparisons (I'm the creator of the library):
The assertSmallDataFrameEquality method collects DataFrames on the driver node and makes the comparison
def assertSmallDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
if (!actualDF.schema.equals(expectedDF.schema)) {
throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
}
if (!actualDF.collect().sameElements(expectedDF.collect())) {
throw new DataFrameContentMismatch(contentMismatchMessage(actualDF, expectedDF))
}
}
The assertLargeDataFrameEquality method compares DataFrames spread on multiple machines (the code is basically copied from spark-testing-base)
def assertLargeDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
if (!actualDF.schema.equals(expectedDF.schema)) {
throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
}
try {
actualDF.rdd.cache
expectedDF.rdd.cache
val actualCount = actualDF.rdd.count
val expectedCount = expectedDF.rdd.count
if (actualCount != expectedCount) {
throw new DataFrameContentMismatch(countMismatchMessage(actualCount, expectedCount))
}
val expectedIndexValue = zipWithIndex(actualDF.rdd)
val resultIndexValue = zipWithIndex(expectedDF.rdd)
val unequalRDD = expectedIndexValue
.join(resultIndexValue)
.filter {
case (idx, (r1, r2)) =>
!(r1.equals(r2) || RowComparer.areRowsEqual(r1, r2, 0.0))
}
val maxUnequalRowsToShow = 10
assertEmpty(unequalRDD.take(maxUnequalRowsToShow))
} finally {
actualDF.rdd.unpersist()
expectedDF.rdd.unpersist()
}
}
assertSmallDataFrameEquality is faster for small DataFrame comparisons and I've found it sufficient for my test suites.
PySpark
Here's a simple function that returns true if the DataFrames are equal:
def are_dfs_equal(df1, df2):
if df1.schema != df2.schema:
return False
if df1.collect() != df2.collect():
return False
return True
or simplified
def are_dfs_equal(df1, df2):
return (df1.schema == df2.schema) and (df1.collect() == df2.collect())
You'll typically perform DataFrame equality comparisons in a test suite and will want a descriptive error message when the comparisons fail (a True / False return value doesn't help much when debugging).
Use the chispa library to access the assert_df_equality method that returns descriptive error messages for test suite workflows.
There are some standard ways in the Apache Spark test suites, however most of these involve collecting the data locally and if you want to do equality testing on large DataFrames then that is likely not a suitable solution.
Checking the schema first and then you could do an intersection to df3 and verify that the count of df1,df2 & df3 are all equal (however this only works if there aren't duplicate rows, if there are different duplicates rows this method could still return true).
Another option would be getting the underlying RDDs of both of the DataFrames, mapping to (Row, 1), doing a reduceByKey to count the number of each Row, and then cogrouping the two resulting RDDs and then do a regular aggregate and return false if any of the iterators are not equal.
I don't know about idiomatic, but I think you can get a robust way to compare DataFrames as you describe as follows. (I'm using PySpark for illustration, but the approach carries across languages.)
a = spark.range(5)
b = spark.range(5)
a_prime = a.groupBy(sorted(a.columns)).count()
b_prime = b.groupBy(sorted(b.columns)).count()
assert a_prime.subtract(b_prime).count() == b_prime.subtract(a_prime).count() == 0
This approach correctly handles cases where the DataFrames may have duplicate rows, rows in different orders, and/or columns in different orders.
For example:
a = spark.createDataFrame([('nick', 30), ('bob', 40)], ['name', 'age'])
b = spark.createDataFrame([(40, 'bob'), (30, 'nick')], ['age', 'name'])
c = spark.createDataFrame([('nick', 30), ('bob', 40), ('nick', 30)], ['name', 'age'])
a_prime = a.groupBy(sorted(a.columns)).count()
b_prime = b.groupBy(sorted(b.columns)).count()
c_prime = c.groupBy(sorted(c.columns)).count()
assert a_prime.subtract(b_prime).count() == b_prime.subtract(a_prime).count() == 0
assert a_prime.subtract(c_prime).count() != 0
This approach is quite expensive, but most of the expense is unavoidable given the need to perform a full diff. And this should scale fine as it doesn't require collecting anything locally. If you relax the constraint that the comparison should account for duplicate rows, then you can drop the groupBy() and just do the subtract(), which would probably speed things up notably.
Java:
assert resultDs.union(answerDs).distinct().count() == resultDs.intersect(answerDs).count();
There are 4 Options depending on whether you have duplicate rows or not.
Let's say we have two DataFrames, z1 and z1. Option 1/2 are good for rows without duplicates. You can try these in spark-shell.
Option 1: do except directly
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Column
def isEqual(left: DataFrame, right: DataFrame): Boolean = {
if(left.columns.length != right.columns.length) return false // column lengths don't match
if(left.count != right.count) return false // record count don't match
return left.except(right).isEmpty && right.except(left).isEmpty
}
Option 2: generate row hash by columns
def createHashColumn(df: DataFrame) : Column = {
val colArr = df.columns
md5(concat_ws("", (colArr.map(col(_))) : _*))
}
val z1SigDF = z1.select(col("index"), createHashColumn(z1).as("signature_z1"))
val z2SigDF = z2.select(col("index"), createHashColumn(z2).as("signature_z2"))
val joinDF = z1SigDF.join(z2SigDF, z1SigDF("index") === z2SigDF("index")).where($"signature_z1" =!= $"signature_z2").cache
// should be 0
joinDF.count
Option 3: use GroupBy(for DataFrame with duplicate rows)
val z1Grouped = z1.groupBy(z1.columns.map(c => z1(c)).toSeq : _*).count().withColumnRenamed("count", "recordRepeatCount")
val z2Grouped = z2.groupBy(z2.columns.map(c => z2(c)).toSeq : _*).count().withColumnRenamed("count", "recordRepeatCount")
val inZ1NotInZ2 = z1Grouped.except(z2Grouped).toDF()
val inZ2NotInZ1 = z2Grouped.except(z1Grouped).toDF()
// both should be size 0
inZ1NotInZ2.show
inZ2NotInZ1.show
Option 4, use exceptAll, which should also work for data with duplicate rows
// Source Code: https://github.com/apache/spark/blob/50538600ec/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2029
val inZ1NotInZ2 = z1.exceptAll(z2).toDF()
val inZ2NotInZ1 = z2.exceptAll(z1).toDF()
// same here, // both should be size 0
inZ1NotInZ2.show
inZ2NotInZ1.show
Try doing the following:
df1.except(df2).isEmpty
A scalable and easy way is to diff the two DataFrames and count the non-matching rows:
df1.diff(df2).where($"diff" != "N").count
If that number is not zero, then the two DataFrames are not equivalent.
The diff transformation is provided by spark-extension.
It identifies Inserted, Changed, Deleted and uN-changed rows.
You can do this using a little bit of deduplication in combination with a full outer join. The advantage of this approach is that it does not require you to collect results to the driver, and that it avoids running multiple jobs.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
// Generate some random data.
def random(n: Int, s: Long) = {
spark.range(n).select(
(rand(s) * 10000).cast("int").as("a"),
(rand(s + 5) * 1000).cast("int").as("b"))
}
val df1 = random(10000000, 34)
val df2 = random(10000000, 17)
// Move all the keys into a struct (to make handling nulls easy), deduplicate the given dataset
// and count the rows per key.
def dedup(df: Dataset[Row]): Dataset[Row] = {
df.select(struct(df.columns.map(col): _*).as("key"))
.groupBy($"key")
.agg(count(lit(1)).as("row_count"))
}
// Deduplicate the inputs and join them using a full outer join. The result can contain
// the following things:
// 1. Both keys are not null (and thus equal), and the row counts are the same. The dataset
// is the same for the given key.
// 2. Both keys are not null (and thus equal), and the row counts are not the same. The dataset
// contains the same keys.
// 3. Only the right key is not null.
// 4. Only the left key is not null.
val joined = dedup(df1).as("l").join(dedup(df2).as("r"), $"l.key" === $"r.key", "full")
// Summarize the differences.
val summary = joined.select(
count(when($"l.key".isNotNull && $"r.key".isNotNull && $"r.row_count" === $"l.row_count", 1)).as("left_right_same_rc"),
count(when($"l.key".isNotNull && $"r.key".isNotNull && $"r.row_count" =!= $"l.row_count", 1)).as("left_right_different_rc"),
count(when($"l.key".isNotNull && $"r.key".isNull, 1)).as("left_only"),
count(when($"l.key".isNull && $"r.key".isNotNull, 1)).as("right_only"))
summary.show()
try {
return ds1.union(ds2)
.groupBy(columns(ds1, ds1.columns()))
.count()
.filter("count % 2 > 0")
.count()
== 0;
} catch (Exception e) {
return false;
}
Column[] columns(Dataset<Row> ds, String... columnNames) {
List<Column> l = new ArrayList<>();
for (String cn : columnNames) {
l.add(ds.col(cn));
}
return l.stream().toArray(Column[]::new);}
columns method is supplementary and can be replaced by any method that returns Seq
Logic:
Union both the datasets, if columns are not matching, it will throw an exception and hence return false.
If columns are matching then groupBy on all columns and add a column count. Now, all the rows have count in the multiple of 2 (even for duplicate rows).
Check if there is any row that has count not divisible by 2, those are the extra rows.