How to compare two datasets?

How to compare two datasets? - scala

I am running a spark application that reads data from a few hive tables(IP addresses) and compares each element(IP address) in a dataset with all other elements(IP addresses) from the other datasets. The end result would be something like:
+---------------+--------+---------------+---------------+---------+----------+--------+----------+
| ip_address|dataset1|dataset2 |dataset3 |dataset4 |dataset5 |dataset6| date|
+---------------+--------+---------------+---------------+---------+----------+--------+----------+
| xx.xx.xx.xx.xx| 1 | 1| 0| 0| 0| 0 |2017-11-06|
| xx.xx.xx.xx.xx| 0 | 0| 1| 0| 0| 1 |2017-11-06|
| xx.xx.xx.xx.xx| 1 | 0| 0| 0| 0| 1 |2017-11-06|
| xx.xx.xx.xx.xx| 0 | 0| 1| 0| 0| 1 |2017-11-06|
| xx.xx.xx.xx.xx| 1 | 1| 0| 1| 0| 0 |2017-11-06|
---------------------------------------------------------------------------------------------------
For doing the comparison, I am converting the dataframes resulting from the hiveContext.sql("query") statement into Fastutil objects. Like this:
val df= hiveContext.sql("query")
val dfBuffer = new it.unimi.dsi.fastutil.objects.ObjectArrayList[String](df.map(r => r(0).toString).collect())
Then, I am using an iterator to iterate over each collection and write the rows to a file using FileWriter.
val dfIterator = dfBuffer.iterator()
while (dfIterator.hasNext){
val p = dfIterator.next().toString
//logic
}
I am running the application with --num-executors 20 --executor-memory 16g --executor-cores 5 --driver-memory 20g
The process runs for about 18-19 hours in total for about 4-5 million records with one to one comparisons on a daily basis.
However, when I checked the Application Master UI, I noticed that no activity takes place after the initial conversion of dataframes to fastutil collection objects is done (this takes only a few minutes after the job is launched). I see the count and collect statements used in the code producing new jobs till the conversion is done. After that, no new jobs are launched when the comparison is running.
What does this imply? Does it mean that the distributed processing is
not happening at all?
I understand that collection objects are not treated as RDDs, could
this be the reason for this?
How is spark executing my program without using the resources
assigned?
Any help would be appreciated, Thank you!

After the line:
val dfBuffer = new it.unimi.dsi.fastutil.objects.ObjectArrayList[String](df.map(r => r(0).toString).collect())
esp. that part of the above line:
df.map(r => r(0).toString).collect()
which collect is the very main thing to notice, no Spark jobs are ever performed on dfBuffer (which is a regular local one JVM data structure).
Does it mean that the distributed processing is not happening at all?
Correct. collect brings all the data on a single JVM where the driver runs (and is exactly the reason why you should not be doing it unless...you know what you are doing and what problems it may cause).
I think the above answers all the other questions.
A possible solution to your problem of comparing two datasets (in Spark and a distributed fashion) would be to join a dataset with the reference dataset and count to compare whether the number of records didn't change.

Related

Why don't I see smaller tasks for my requested repartitioning?

I have a dataset I want to repartition evenly into 10 buckets per unique value of a column, and I want to size this result into a large number of partitions so that each is small.
col_1 is guaranteed to be one of the values in ["CREATE", "UPDATE", "DELETE"]
My code looks like the following:
df.show()
"""
+------+-----+-----+
| col_1|col_2|index|
+------+-----+-----+
|CREATE| 0| 0|
|CREATE| 0| 1|
|UPDATE| 0| 2|
|UPDATE| 0| 3|
|DELETE| 0| 4|
|DELETE| 0| 5|
|CREATE| 0| 6|
|CREATE| 0| 7|
|CREATE| 0| 8|
+------+-----+-----+
"""
df = df.withColumn(
"partition_column",
F.concat(
F.col("col_1"),
F.round( # Pick a random number between 0 and 9
F.random() * F.lit(10),
0
)
)
)
df = df.repartition(1000, F.col("partition_col"))
I see that most of my tasks run and finish with zero rows of data, I would expect the data to be evenly distributed on my partition_col into 1000 partitions?

It's important to understand that the mechanism Spark uses to distribute its data is based upon the hash value of the columns you provide to the repartition() call.
In this case, you have one column with random values between 0 and 9, combined with another column that only ever has one of 3 different values in it.
Therefore, you'll have 10 * 3 unique combinations of values going into the repartition() call. This means that when you call the underlying hash on this column, you'll only ever have 30 unique values from which Spark will do its modulus 1000 on top of it. Therefore, the most number of partitions you will ever have is 30.
You'll need to distribute your data into a greater number of random values if you want to go above partition counts of 30, or figure out another partitioning strategy entirely :)

Set column value depending on previous ones with Spark without repeating grouping attribute

Given the DataFrame:
+------------+---------+
|variableName|dataValue|
+------------+---------+
| IDKey| I1|
| b| y|
| a| x|
| IDKey| I2|
| a| z|
| b| w|
| c| q|
+------------+---------+
I want to create a new column with corresponding IDKey values, where each value changes whenever the dataValue for IDKey changes, here's the expected output :
+------------+---------+----------+
|variableName|dataValue|idkeyValue|
+------------+---------+----------+
| IDKey| I1| I1|
| b| y| I1|
| a| x| I1|
| IDKey| I2| I2|
| a| z| I2|
| b| w| I2|
| c| q| I2|
+------------+---------+----------+
I tried by doing the following code which uses mapPartitions() and a global variable
var currentVarValue = ""
frame
.mapPartitions{ partition =>
partition.map { row =>
val (varName, dataValue) = (row.getString(0), row.getString(1))
val idKeyValue = if (currentVarValue != dataValue && varName == "IDKey") {
currentVarValue = dataValue
dataValue
} else {
currentVarValue
}
ExtendedData(varName, dataValue, currentVarValue)
}
}
But this won't work because of two fundamental things: Spark doesn't handle global variables and also, this doesn't comply with functional programming style
I will gladly appreciate any help on this. Thanks!

You cannot solve this elegantly and performant in a Spark way as
there is not enough initial information provided for Spark to process
all data guaranteed to be in the same partition. If we do all
processing in the same partition, then this is not the true intent of
Spark.
In fact a sensible partitionBy cannot be issued (over Window function). The issue here is that the data represents a long list of sequential such data that would require looking across partitions for if data in the previous partition relates to the current partition. That could be done, but it's quite a job. zero323 has an answer somewhere here that tries to solve this, but if I remember correctly, it is cumbersome.
The logic to do it is easy enough, but using Spark is problematic for this.
Without a partitionBy data all gets shuffled to a single partition and could result in OOM and space problems.
Sorry.

Spark: Flatten simple multi-column DataFrame

How to flatten a simple (i.e. no nested structures) dataframe into a list?
My problem set is detecting all the node pairs that have been changed/added/removed from a table of node pairs.
This means I have a "before" and "after" table to compare. Combining the before and after dataframe yields rows that describe where a pair appears in one dataframe but not the other.
Example:
+-----------+-----------+-----------+-----------+
|before.id1 |before.id2 |after.id1 |after.id2 |
+-----------+-----------+-----------+-----------+
| null| null| E2| E3|
| B3| B1| null| null|
| I1| I2| null| null|
| A2| A3| null| null|
| null| null| G3| G4|
The goal is to get a list of all the (distinct) nodes in the entire dataframe which would look like:
{A2,A3,B1,B3,E2,E3,G3,G4,I1,I2}
Potential approaches:
Union all the columns separately and distinct
flatMap and distinct
map and flatten
Since the structure is well known and simple it seems like there should be an equally straightforward solution. Which approach, or others, would be the simplest approach?
Other notes
Order of id1-id2 pair is only important to for change detection
Order in the resulting list is not important
DataFrame is between 10k and 100k rows
distinct in the resulting list is nice to have, but not required; assuming is trivial with the distinct operation

Try following, converting all rows into seqs and then collect all rows and then flatten the data and remove null value:
val df = Seq(("A","B"),(null,"A")).toDF
val result = df.rdd.map(_.toSeq.toList)
.collect().toList.flatten.toSet - null

How to efficiently select only columns which contain variation in values for a given subset of rows?

For context, my ultimate goal is to remove nearly-duplicated rows from a very large dataframe. Here is some dummy data:
+---+--------+----------+---+-------+-------+---+-------+-------+
|key|unique_1| unique_2|...|col_125|col_126|...|col_414|col_415|
+---+--------+----------+---+-------+-------+---+-------+-------+
| 1| 123|01-01-2000|...| 1| true|...| 100| 25|
| 2| 123|01-01-2000|...| 0| false|...| 100| 25|
| 3| 321|12-12-2012|...| 3| true|...| 99| 1|
| 4| 321|12-12-2012|...| 3| false|...| 99| 5|
+---+--------+----------+---+-------+-------+---+-------+-------+
In this data, combinations of observations from unique_1 and unique_2 should be distinct, but they aren't always. When they are repeated, they have the same values for the vast majority of the columns, but have variation on a very small set of other columns. I am trying to develop a strategy to deal with the near-duplicates, but it is complicated because each set of near-duplicates has a different set of columns which contain variation.
I'm trying to see the columns that contain variation for a single set of near-duplicates at a time - like this:
+---+-------+-------+
|key|col_125|col_126|
+---+-------+-------+
| 1| 1| true|
| 2| 20| false|
+---+-------+-------+
or this:
+---+-------+-------+
|key|col_126|col_415|
+---+-------+-------+
| 3| true| 1|
| 4| false| 5|
+---+-------+-------+
I've successfully gotten this result with a few different approaches. This was my first attempt:
def findColumnsWithDiffs(df: DataFrame): DataFrame = {
df.columns.foldLeft(df){(a,b) =>
a.select(b).distinct.count match {
case 1 => a.drop(b)
case _ => a
}
}
}
val smallFrame = originalData.filter(($"key" === 1) || ($"key" === 2))
val desiredOutput = findColumnsWithDiffs(smallFrame)
And this works insofar as it gave me what I want, but it is so unbelievably slow. It is approximately 10x slower for the function above to run then it takes to display all of the data in smallFrame (and I think that the performance only gets worse with the size of the data - although I have not tested that hypothesis thoroughly).
I thought that using fold instead of foldLeft might yield some improvements, so I rewrote the findColumnsWithDiffs function like this:
def findColumnsWithDiffsV2(df: DataFrame): DataFrame = {
val colsWithDiffs = df.columns.map(colName => List(colName)).toList.fold(Nil){(a,b) =>
df.select(col(b(0))).distinct.count match {
case 1 => a
case _ => a ++ b
}
}
df.select(colsWithDiffs.map(colName => col(colName)):_*)
}
But performance was the same. I also tried was to map each column to the number of distinct values it has and work from there, but again performance was the same. At this point I'm out of ideas. My hunch is that the filter is being performed for each column which is why it is so terribly slow, but I don't know how to verify that theory and/or change what I'm doing to fix it if I'm correct. Does anyone have ideas to improve the efficiency of what I'm doing?
I'm currently using spark 2.1.0 / scala 2.11.8

All of the approaches to identifying the distinct values are fine, the issue is with the lazy evaluation of the filter. To improve performance, call smallFrame.cache before using findColsWithDiffs. This will save the filtered data in memory, which will be fine because it is only a few rows at a time.

How to randomly selecting rows from one dataframeusing information from another dataframe

The following I am attempting in Scala-Spark.
I'm hoping someone can give me some guidance on how to tackle this problem or provide me with some resources to figure out what I can do.
I have a dateCountDF with a count corresponding to a date. I would like to randomly select a certain number of entries for each dateCountDF.month from another Dataframe entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate and then place all the results into a new Dataframe. See Bellow for Data Example
I'm not at all sure how to approach this problem from a Spark-SQl or Spark-MapReduce perspective. The furthest I got was the naive approach, where I use a foreach on a dataFrame and then refer to the other dataframe within the function. But this doesn't work because of the distributed nature of Spark.
val randomEntites = dateCountDF.foreach(x => {
val count:Int = x(1).toString().toInt
val result = entitiesDF.take(count)
return result
})
DataFrames
**dateCountDF**
| Date | Count |
+----------+----------------+
|2016-08-31| 4|
|2015-12-31| 1|
|2016-09-30| 5|
|2016-04-30| 5|
|2015-11-30| 3|
|2016-05-31| 7|
|2016-11-30| 2|
|2016-07-31| 5|
|2016-12-31| 9|
|2014-06-30| 4|
+----------+----------------+
only showing top 10 rows
**entitiesDF**
| ID | FirstDate | LastDate |
+----------+-----------------+----------+
| 296| 2014-09-01|2015-07-31|
| 125| 2015-10-01|2016-12-31|
| 124| 2014-08-01|2015-03-31|
| 447| 2017-02-01|2017-01-01|
| 307| 2015-01-01|2015-04-30|
| 574| 2016-01-01|2017-01-31|
| 613| 2016-04-01|2017-02-01|
| 169| 2009-08-23|2016-11-30|
| 205| 2017-02-01|2017-02-01|
| 433| 2015-03-01|2015-10-31|
+----------+-----------------+----------+
only showing top 10 rows
Edit:
For clarification.
My inputs are entitiesDF and dateCountDF. I want to loop through dateCountDF and for each row I want to select a random number of entities in entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate

To select random you do like this in scala
import random
def sampler(df, col, records):
# Calculate number of rows
colmax = df.count()
# Create random sample from range
vals = random.sample(range(1, colmax), records)
# Use 'vals' to filter DataFrame using 'isin'
return df.filter(df[col].isin(vals))
select random number of rows you want store in dataframe and the add this data in the another dataframe for this you can use unionAll.
also you can refer this answer

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to compare two datasets? - scala

Related

Why don't I see smaller tasks for my requested repartitioning?

Set column value depending on previous ones with Spark without repeating grouping attribute

Spark: Flatten simple multi-column DataFrame

How to efficiently select only columns which contain variation in values for a given subset of rows?

How to randomly selecting rows from one dataframeusing information from another dataframe

Categories

Resources