Apache Spark: Get number of records per partition - scala

I want to check how can we get information about each partition such as total no. of records in each partition on driver side when Spark job is submitted with deploy mode as a yarn cluster in order to log or print on the console.

I'd use built-in function. It should be as efficient as it gets:
import org.apache.spark.sql.functions.spark_partition_id
df.groupBy(spark_partition_id).count

You can get the number of records per partition like this :
df
.rdd
.mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
.toDF("partition_number","number_of_records")
.show
But this will also launch a Spark Job by itself (because the file must be read by spark to get the number of records).
Spark could may also read hive table statistics, but I don't know how to display those metadata..

For future PySpark users:
from pyspark.sql.functions import spark_partition_id
rawDf.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().show()

Spark/scala:
val numPartitions = 20000
val a = sc.parallelize(0 until 1e6.toInt, numPartitions )
val l = a.glom().map(_.length).collect() # get length of each partition
print(l.min, l.max, l.sum/l.length, l.length) # check if skewed
PySpark:
num_partitions = 20000
a = sc.parallelize(range(int(1e6)), num_partitions)
l = a.glom().map(len).collect() # get length of each partition
print(min(l), max(l), sum(l)/len(l), len(l)) # check if skewed
The same is possible for a dataframe, not just for an RDD.
Just add DF.rdd.glom... into the code above.
Credits: Mike Dusenberry # https://issues.apache.org/jira/browse/SPARK-17817

Spark 1.5 solution :
(sparkPartitionId() exists in org.apache.spark.sql.functions)
import org.apache.spark.sql.functions._
df.withColumn("partitionId", sparkPartitionId()).groupBy("partitionId").count.show
as mentioned by #Raphael Roth
mapPartitionsWithIndex is best approach, will work with all version of spark since its RDD based approach

PySpark:
from pyspark.sql.functions import spark_partition_id
df.select(spark_partition_id().alias("partitionId")).groupBy("partitionId").count()

Related

How to read partitioned table from BigQuery to Spark dataframe (in PySpark)

I have a BQ table and it's partitioned by the default _PARTITIONTIME. I want to read one of its partitions to Spark dataframe (PySpark). However, the spark.read API doesn't seem to recognize the partition column. Below is the code (which doesn't work):
table = 'myProject.myDataset.table'
df = spark.read.format('bigquery').option('table', table).load()
df_pt = df.filter("_PARTITIONTIME = TIMESTAMP('2019-01-30')")
The partition is quite large so I'm not able to read as a pandas dataframe.
Thank you very much.
Good question
I filed https://github.com/GoogleCloudPlatform/spark-bigquery-connector/issues/50 to track this.
A work around today is the filter parameter to read
df = spark.read.format('bigquery').option('table', table) \
.option('filter', "_PARTITIONTIME = '2019-01-30'")).load()
should work today.
Try using the "$" operator: https://cloud.google.com/bigquery/docs/creating-partitioned-tables
So, the table you'd be pulling from is "myProject.myDataset.table$20190130"
table = 'myProject.myDataset.table'
partition = '20190130'
df = spark.read.format('bigquery').option('table', f'{table}${partition}').load()

Long execution time when running PySpark-SQL on hadoop cluster?

I have a data set of weather data and I am trying to query it to get average lows and average highs for each year. I have no problem submitting the job and getting the desired result but it is taking hours to run. I thought it would run much faster, Am I doing something wrong or is it just not as fast as I'm thinking it should be?
The data is a csv file with over 100,000,000 entries.
THe columns are date, weather station, measurement(TMAX or TMIN), and value
I am running the job on my university's hadoop cluster, I don't have much more information than that about the cluster.
Thanks in advance!
import sys
from random import random
from operator import add
from pyspark.sql import SQLContext, Row
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonPi")
sqlContext = SQLContext(sc)
file = sys.argv[1]
lines = sc.textFile(file)
parts = lines.map(lambda l: l.split(","))
obs = parts.map(lambda p: Row(station=p[0], date=int(p[1]) , measurement=p[2] , value=p[3] ) )
weather = sqlContext.createDataFrame(obs)
weather.registerTempTable("weather")
#AVERAGE TMAX/TMIN PER YEAR
query2 = sqlContext.sql("""select SUBSTRING(date,1,4) as Year, avg(value)as Average, measurement
from weather
where value<130 AND value>-40
group by measurement, SUBSTRING(date,1,4)
order by SUBSTRING(date,1,4) """)
query2.show()
query2.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("hdfs:/user/adduccij/tmax_tmin_year.csv")
sc.stop()
Make sure that spark job in fact started in cluster (and not local) mode. e.g. If you're using yarn, then job is launched in 'yarn-client' mode.
If that's true, then make sure you've provided enough #executors/cores/ executor and driver memory. You can get the actual cluster/job information from either the resource manager (e.g. yarn) page or from spark context (sqlContext.getAllConfs).
100Mil records is not that small. Let's say each record is 30 bytes, still the overall size is 3gb and that can take a while if you only have a handful of executors.
Let's say that the above suggestions do not help, then try to find out which part of the query is taking long. Few speed up tips are:
Cache the weather dataframe
Break the query into 2 parts: 1st part does group by, and output is cached
2nd part does order by
instead of coalesce, write the rdd with default shards and then do a mergeFrom to get your csv output from shell.

How to write a Dataset to Kafka topic?

I am using Spark 2.1.0 and Kafka 0.9.0.
I am trying to push the output of a batch spark job to kafka. The job is supposed to run every hour but not as streaming.
While looking for an answer on the net I could only find kafka integration with Spark streaming and nothing about the integration with the batch job.
Does anyone know if such thing is feasible ?
Thanks
UPDATE :
As mentioned by user8371915, I tried to follow what was done in Writing the output of Batch Queries to Kafka.
I used a spark shell :
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
Here is the simple code that I tried :
val df = Seq(("Rey", "23"), ("John", "44")).toDF("key", "value")
val newdf = df.select(to_json(struct(df.columns.map(column):_*)).alias("value"))
newdf.write.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("topic", "alerts").save()
But I get the error :
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:497)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 50 elided
Have any idea what is this related to ?
Thanks
tl;dr You use outdated Spark version. Writes are enabled in 2.2 and later.
Out-of-the-box you can use Kafka SQL connector (the same as used with Structured Streaming). Include
spark-sql-kafka in your dependencies.
Convert data to DataFrame containing at least value column of type StringType or BinaryType.
Write data to Kafka:
df
.write
.format("kafka")
.option("kafka.bootstrap.servers", server)
.save()
Follow Structured Streaming docs for details (starting with Writing the output of Batch Queries to Kafka).
If you have a dataframe and you want to write it to a kafka topic, you need to convert columns first to a "value" column that contains data in a json format. In scala it is
import org.apache.spark.sql.functions._
val kafkaServer: String = "localhost:9092"
val topicSampleName: String = "kafkatopic"
df.select(to_json(struct("*")).as("value"))
.selectExpr("CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServer)
.option("topic", topicSampleName)
.save()
For this error
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
I think you need to parse the message to Key value pair. Your dataframe should have value column.
Let say if you have a dataframe with student_id, scores.
df.show()
>> student_id | scores
1 | 99.00
2 | 98.00
then you should modify your dataframe to
value
{"student_id":1,"score":99.00}
{"student_id":2,"score":98.00}
To convert you can use similar code like this
df.select(to_json(struct($"student_id",$"score")).alias("value"))

Spark dataframe write method writing many small files

I've got a fairly simple job coverting log files to parquet. It's processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12 thousand files.
Job works as follows:
val events = spark.sparkContext
.textFile(s"$stream/$sourcetype")
.map(_.split(" \\|\\| ").toList)
.collect{case List(date, y, "Event") => MyEvent(date, y, "Event")}
.toDF()
df.write.mode(SaveMode.Append).partitionBy("date").parquet(s"$path")
It collects the events with a common schema, converts to a DataFrame, and then writes out as parquet.
The problem I'm having is that this can create a bit of an IO explosion on the HDFS cluster, as it's trying to create so many tiny files.
Ideally I want to create only a handful of parquet files within the partition 'date'.
What would be the best way to control this? Is it by using 'coalesce()'?
How will that effect the amount of files created in a given partition? Is it dependent on how many executors I have working in Spark? (currently set at 100).
you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter
Try this:
df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
In Python you can rewrite Raphael's Roth answer as:
(df
.repartition("date")
.write.mode("append")
.partitionBy("date")
.parquet("{path}".format(path=path)))
You might also consider adding more columns to .repartition to avoid problems with very large partitions:
(df
.repartition("date", another_column, yet_another_colum)
.write.mode("append")
.partitionBy("date)
.parquet("{path}".format(path=path)))
The simplest solution would be to replace your actual partitioning by :
df
.repartition(to_date($"date"))
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
You can also use more precise partitioning for your DataFrame i.e the day and maybe the hour of an hour range. and then you can be less precise for writer.
That actually depends on the amount of data.
You can reduce entropy by partitioning DataFrame and the write with partition by clause.
I came across the same issue and I could using coalesce solved my problem.
df
.coalesce(3) // number of parts/files
.write.mode(SaveMode.Append)
.parquet(s"$path")
For more information on using coalesce or repartition you can refer to the following spark: coalesce or repartition
Duplicating my answer from here: https://stackoverflow.com/a/53620268/171916
This is working for me very well:
data.repartition(n, "key").write.partitionBy("key").parquet("/location")
It produces N files in each output partition (directory), and is (anecdotally) faster than using coalesce and (again, anecdotally, on my data set) faster than only repartitioning on the output.
If you're working with S3, I also recommend doing everything on local drives (Spark does a lot of file creation/rename/deletion during write outs) and once it's all settled use hadoop FileUtil (or just the aws cli) to copy everything over:
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
// ...
def copy(
in : String,
out : String,
sparkSession: SparkSession
) = {
FileUtil.copy(
FileSystem.get(new URI(in), sparkSession.sparkContext.hadoopConfiguration),
new Path(in),
FileSystem.get(new URI(out), sparkSession.sparkContext.hadoopConfiguration),
new Path(out),
false,
sparkSession.sparkContext.hadoopConfiguration
)
}
how about trying running scripts like this as map job consolidating all the parquet files into one:
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat

Spark DataFrame filtering: retain element belonging to a list

I am using Spark 1.5.1 with Scala on Zeppelin notebook.
I have a DataFrame with a column called userID with Long type.
In total I have about 4 million rows and 200,000 unique userID.
I have also a list of 50,000 userID to exclude.
I can easily build the list of userID to retain.
What is the best way to delete all the rows that belong to the users to exclude?
Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain?
I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1.5.1 on my local machine, an I have decent RAM memory of 16GB and the initial DataFrame fits in the memory.
Here is the code that I am applying:
import org.apache.spark.sql.functions.lit
val finalDataFrame = initialDataFrame.where($"userID".in(listOfUsersToKeep.map(lit(_)):_*))
In the code above:
the initialDataFrame has 3885068 rows, each row has 5 columns, one of these columns called userID and it contains Long values.
The listOfUsersToKeep is an Array[Long] and it contains 150,000 Long userID.
I wonder if there is a more efficient solution than the one I am using.
Thanks
You can either use join:
val usersToKeep = sc.parallelize(
listOfUsersToKeep.map(Tuple1(_))).toDF("userID_")
val finalDataFrame = usersToKeep
.join(initialDataFrame, $"userID" === $"userID_")
.drop("userID_")
or a broadcast variable and an UDF:
import org.apache.spark.sql.functions.udf
val usersToKeepBD = sc.broadcast(listOfUsersToKeep.toSet)
val checkUser = udf((id: Long) => usersToKeepBD.value.contains(id))
val finalDataFrame = initialDataFrame.where(checkUser($"userID"))
It should be also possible to broadcast a DataFrame:
import org.apache.spark.sql.functions.broadcast
initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")