I am new to Spark Scala. I am having a situation where I have to compare values of a particular column in a data set, for example:
Source Data
Source Destination Distance
Austin Houston 200
Dallas Houston 400
Kansas Dallas 700
Resultant
Source1 Destination1 Distance1 Source2 Destination2 Distance2 DistDiff
Dallas Houston 400 Kansas Dallas 700 300
As per the situation, I have to compare the distance of subsequent rows and if the difference is more than or equals to 300 then save the records in the Resultant data set
700 - 400 = 300
The examples which I have encountered are having functions which are executing on per row basis on any particular data set, however my scenario is to play with consecutive rows.
You mentioned you can sort rows by datetime. So, assuming it's sorted using sortBy or sortByKey to create an ordered rdd, and also assuming you have an even number of rows (so each row has another one to calculate difference with) you can:
Give each row an index using zipWithIndex.
Split the RDD into two RDDs, one with even-numbered indices and one with odd-numbered indices, by filtering on the index created.
zip the split RDDs together, creating a new RDD of Tuple2 with even-indexed rows on the left and odd-indexed rows on the right.
map the result to calculate the difference between left/right of each row.
This may be done in the following way:
Add an index column to your sorted rdd
Make sure the rdd has an even number of rows N
Make an rdd rdd_even1 to contain the even rows with indices [0, N-2]
Make an rdd rdd_odd1 to contain the odd rows [1, N-1]
Make an rdd rdd_even2 to contain the even rows [2, N-2]
Make an rdd rdd_odd2 to contain the odd rows [1, N-3]
Now you need to repartition rdd_even1 and *rdd_odd1 before zipping, because zipping won't work if both rdd's do not have the same number of elements in all partitions (in pyspark at least). You can do it in memory using collect and parallelize, but most likely you have to write the rdd's to HDFS and re-read them, controlling for the paritioning
Do the same for rdd_even2 and rdd_odd2
zip the rdd's from step 7 to rdd_zip1
zip the rdd's from step 8 to rdd_zip2
Call rdd_zip1.union(rdd_zip2)
Now you can call map() on the union to get your "resultant" with the required differences
Good luck.
Related
For example,
I have a dataframe with 12000 rows, and I define a threshold of 3000(externally supplied to my code via config), so I would like to split this dataframe into 4 dataframes with 3000 rows each.
If the dataframe has 12500 rows, I will split it into 5 dataframes, 4 with 3000 rows and last one with 500.
**The significance of dataframe here is that if the rowcount of the dataframe is lesser than this, I will not tinker with it.
In order to get a close behavior (but not exactly what you specified), you can use the function randomSplit: randomSplit(weights: Array[Double])
The weights are the proportions in each output dataframe. Henve, you cannot specify exactly the number.
In your case (12500 rows), df.randomSplit(Array(1,1,1,1)) to separate it into 4 approximately equal parts. Or:
val n = (df.length / 4).toInt
df.randomSplit((1 to n).map(x => 1).toArray)
PS: I think you will cuts of a spark dataframe with "exact number of rows" with Spark
I have a dataset of 200 Billion rows and I want to extract 1 million rows randomly to start working on data model.
I am using pyspark.
What should be the best way to handle a billion rows?
You can use the sample method on a dataframe.
For example:
# Create a 0.0005% sample without replacement, with a random seed of 42
# (1 million/200 billion) = 0.000005
>>> df.sample(withReplacement=False, fraction=0.000005, seed=42).count()
How do I bucket the range of values(divide the entire range of values into a series of intervals) then count how many values fall into each interval.
I have a spark DataFrame with few numeric columns. In each column, I want to bucket the range of values then count how many values fall into each interval.
You can use the scala Bucketizer. There's a good example here:
https://spark.apache.org/docs/2.2.0/ml-features.html#bucketizer
After you use the bucketizer you have a dataframe with a bucket index (i.e index 1, 2, and 3 might correspond to values 1-5, 6-10, 11-15, respectively). You can do a .groupBy and .agg (or use SQL) to get a count of records in each index group).
I need to count how many values of one of the columns of df1 are present in one of the columns of df2. (I just need the number of matched values)
I wouldn't be asking this question if efficiency wasn't such a big concern:
df1 contains 100,000,000+ records
df2 contains 1,000,000,000+ records
Just an off the top of my head idea for the case that intersection won't cut it:
For the datatype that is contained in the columns, find two hash functions h1, h2 such that
h1 produces hashes roughly uniformly between 0 and N
h2 produces hashes roughly uniformly between 0 and M
such that M * N is approximately 1B, e.g. M = 10k, N = 100k,
then:
map each entry x from the column from df1 to (h1(x), x)
map each entry x from the column from df2 to (h1(x), x)
group both by h1 into buckets with xs
join on h1 (that's gonna be the nasty shuffle)
then locally, for each pair of buckets (b1, b2) that came from df1 and df2 and had the same h1 hash code, do essentially the same:
compute h2 for all bs from b1 and from b2,
group by the hash code h2
Compare the small sub-sub-buckets that remain by converting everything toSet and computing the intersection directly.
Everything that remains after intersection is present in both df1 and df2, so compute size and sum the results across all partitions.
The idea is to select N small enough so that the buckets with M entries still comfortably fit on a single node, but at the same time prevent that the whole application dies on the first shuffle trying to find out where is what by sending every key to everyone else. For example, using SHA-256 as "hash code" for h1 wouldn't help much, because the keys would be essentially unique, so that you could take the original data directly and try to do a shuffle with that. However, if you restrict N to some reasonably small number, e.g. 10k, you obtain a rough approximation of where what is, so that you can then regroup the buckets and start the second stage with h2.
Essentially it's just a random guess, I didn't test it. It could well be that the built-in intersection is smarter than everything I could possibly come up with.
I am using Spark, Scala to process data. I have one question couldn't figure out. I have a dataframe, which is one column:
data
1
2
3
4
5
I want it to a single vector
[1.0,2.0,3.0,4.0,5.0]
How can I implemented it ? I tried
df.collect().toVector or rdd.foreach, but everytime it returns to me an array of vectors [1,0], [2.0], [3.0], [4.0], [5.0], not one single vector.
This is happening because when you collect a dataframe you get an Array of rows. You need to extract the values from the row objects.
df.collect().map(x => x.getDouble(0)).toVector