How to random rows from Billion rows dataset in Pyspark - pyspark

I have a dataset of 200 Billion rows and I want to extract 1 million rows randomly to start working on data model.
I am using pyspark.
What should be the best way to handle a billion rows?

You can use the sample method on a dataframe.
For example:
# Create a 0.0005% sample without replacement, with a random seed of 42
# (1 million/200 billion) = 0.000005
>>> df.sample(withReplacement=False, fraction=0.000005, seed=42).count()

Related

Creating a pivot table in PySpark

New to PySpark and would like to make a table that counts the unique pairs of values from two columns and shows the average of another column over all rows with those pairs of values. My code so far is:
df1 = df.withColumn('trip_rate', df.total_amount / df.trip_distance)
df1.groupBy('PULocationID', 'DOLocationID').count().orderBy('count', ascending=False).show()
I want to add the average of the trip rate for each unique pair as a column. Can you help me please?

I want to split a big dataframe into multiple dataframes on basis of the rowcount in spark using scala. I am not able to figure it out

For example,
I have a dataframe with 12000 rows, and I define a threshold of 3000(externally supplied to my code via config), so I would like to split this dataframe into 4 dataframes with 3000 rows each.
If the dataframe has 12500 rows, I will split it into 5 dataframes, 4 with 3000 rows and last one with 500.
**The significance of dataframe here is that if the rowcount of the dataframe is lesser than this, I will not tinker with it.
In order to get a close behavior (but not exactly what you specified), you can use the function randomSplit: randomSplit(weights: Array[Double])
The weights are the proportions in each output dataframe. Henve, you cannot specify exactly the number.
In your case (12500 rows), df.randomSplit(Array(1,1,1,1)) to separate it into 4 approximately equal parts. Or:
val n = (df.length / 4).toInt
df.randomSplit((1 to n).map(x => 1).toArray)
PS: I think you will cuts of a spark dataframe with "exact number of rows" with Spark

efficient aggregation (sum) on a single column Data Frame in spark scala

I have a Spark Data Frame with a single column and large number of rows (in billions). I am trying to calculate the sum of the values in each row using the code shown below. However, it is very slow. Is there an efficient way to calculate the sum?
val df = sc.parallelize(Array(1,3,5,6,7,10,30)).toDF("colA")
df.show()
df.agg(sum("colA")).first().get(0) //very slow
Similar query was posted here: How to sum the values of one column of a dataframe in spark/scala
The focus of this query is however about efficiency.

How can I convert one column data to a vector using Spark Scala

I am using Spark, Scala to process data. I have one question couldn't figure out. I have a dataframe, which is one column:
data
1
2
3
4
5
I want it to a single vector
[1.0,2.0,3.0,4.0,5.0]
How can I implemented it ? I tried
df.collect().toVector or rdd.foreach, but everytime it returns to me an array of vectors [1,0], [2.0], [3.0], [4.0], [5.0], not one single vector.
This is happening because when you collect a dataframe you get an Array of rows. You need to extract the values from the row objects.
df.collect().map(x => x.getDouble(0)).toVector

Compare column values in consecutive rows in Scala

I am new to Spark Scala. I am having a situation where I have to compare values of a particular column in a data set, for example:
Source Data
Source Destination Distance
Austin Houston 200
Dallas Houston 400
Kansas Dallas 700
Resultant
Source1 Destination1 Distance1 Source2 Destination2 Distance2 DistDiff
Dallas Houston 400 Kansas Dallas 700 300
As per the situation, I have to compare the distance of subsequent rows and if the difference is more than or equals to 300 then save the records in the Resultant data set
700 - 400 = 300
The examples which I have encountered are having functions which are executing on per row basis on any particular data set, however my scenario is to play with consecutive rows.
You mentioned you can sort rows by datetime. So, assuming it's sorted using sortBy or sortByKey to create an ordered rdd, and also assuming you have an even number of rows (so each row has another one to calculate difference with) you can:
Give each row an index using zipWithIndex.
Split the RDD into two RDDs, one with even-numbered indices and one with odd-numbered indices, by filtering on the index created.
zip the split RDDs together, creating a new RDD of Tuple2 with even-indexed rows on the left and odd-indexed rows on the right.
map the result to calculate the difference between left/right of each row.
This may be done in the following way:
Add an index column to your sorted rdd
Make sure the rdd has an even number of rows N
Make an rdd rdd_even1 to contain the even rows with indices [0, N-2]
Make an rdd rdd_odd1 to contain the odd rows [1, N-1]
Make an rdd rdd_even2 to contain the even rows [2, N-2]
Make an rdd rdd_odd2 to contain the odd rows [1, N-3]
Now you need to repartition rdd_even1 and *rdd_odd1 before zipping, because zipping won't work if both rdd's do not have the same number of elements in all partitions (in pyspark at least). You can do it in memory using collect and parallelize, but most likely you have to write the rdd's to HDFS and re-read them, controlling for the paritioning
Do the same for rdd_even2 and rdd_odd2
zip the rdd's from step 7 to rdd_zip1
zip the rdd's from step 8 to rdd_zip2
Call rdd_zip1.union(rdd_zip2)
Now you can call map() on the union to get your "resultant" with the required differences
Good luck.