Could you please suggest alternative way of implementing distinct in spark data frame.
I tried both SQL and spark distinct but since the dataset size (>2 Billion) it fails on the shuffle .
If I increase the node and memory to >250GB, process run for a longe time (more than 7 hours).
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes").distinct()
val df2 = df1.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df2.createOrReplaceTempView("df2")
val df3 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df2
LATERAL VIEW explode(codes) exploded_table as d
""")
df3.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)
I am trying a read a SQL Table (15 million rows) using Spark into Dataframe, I want to leverage Multi-Core to Do the read very Fast and do the Partition, What are the column/s I can select to partition ? is it ID, UUID, Sequence, date-time? How should I calculate the Number of Partitions?
There are multiple of complex questions in your question :
- What are the column/s I can select to partition ?
It depends on your needs and your computing goals and transformations that you will do next with spark on your data. (if groupBy(key) and your key is date-time, then you should partitionBy date-time)
- Number of partition depends on : the size of your data, your hardware ressources, yours needs ... it is a complex question, you have to take into account shuffle partitions for transformations also (default is 200, value advised by spark: 3 * number of cpu)
val sparkSession = org.apache.spark.sql.SparkSession.builder
.master("local[*]")
.appName("JdbcMicroService")
.getOrCreate()
sparkSession.conf.set("spark.sql.shuffle.partitions", 3 * nb CPU)
def requestPostgreSql(sparkSession: SparkSession, database : database, dateOfRequest : String) : DataFrame = {
val url = "jdbc:postgresql://" + database.url + "/" + database.databaseName
val requestDF = sparkSession.read.format("jdbc")
.option("Driver", "org.postgresql.Driver")
.option("url", url)
.option("dbtable",database.tableName)
.option("user",database.user)
.option("password",database.passwd)
.load()
.repartition(col("colName"))
requestDF
}
I have to generate 3000000 files as the output of spark job.
I have two input file :
File 1 -> Size=3.3 Compressed, No.Of Records=13979835
File 2 -> Size=1.g Compressed, No.Of Records=6170229
Spark Job is doing the following:
reading both this file and joining them based on common column1. -> DataFrame-A
Grouping result of DataFrame-A based on one column2 -> DataFrame-B
From DataFrame-B used array_join for the aggregated column and separate that column by '\n' char. -> DataFrame-C
Writing result of DataFrame-C partition by column2.
val DF1 = sparkSession.read.json("FILE1") // |ID |isHighway|isRamp|pvId |linkIdx|ffs |length |
val DF12 = sparkSession.read.json("FILE2") // |lId |pid |
val joinExpression = DF1.col("pvId") === DF2.col("lId")
val DFA = DF.join(tpLinkDF, joinExpression, "inner").select(col("ID").as("SCAR"), col("lId"), col("length"), col("ffs"), col("ar"), col("pid")).orderBy("linkIdx")
val DFB = DFA.select(col("SCAR"),concat_ws(",", col("lId"), col("length"),col("ffs"), col("ar"), col("pid")).as("links")).groupBy("SCAR").agg(collect_list("links").as("links"))
val DFC = DFB.select(col("SCAR"), array_join(col("links"), "\n").as("links"))
DFC.write.format("com.databricks.spark.csv").option("quote", "\u0000").partitionBy("SCAR").mode(SaveMode.Append).format("csv").save("/tmp")
I have to generate 3000000 files as output of spark job.
After running some test I got an idea to run this job in batch like :
query startIdx: 0, endIndex:100000
query startIdx: 100000, endIndex:200000
query startIdx: 200000, endIndex:300000
and so.... on till
query startIdx: 2900000, endIndex:3000000
I am trying to load data from an RDBMS table on Postgres to Hive table on HDFS.
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${query}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("numPartitions",15).load()
The Hive table is dynamically partitioned based on two columns: source_system_name,period_year
I have these column names present in a metadata table: metatables
val spColsDF = spark.read.format("jdbc").option("url",hiveMetaConURL)
.option("dbtable", "(select partition_columns from metainfo.metatables where tablename='finance.xx_gl_forecast') as colsPrecision")
.option("user", metaUserName)
.option("password", metaPassword)
.load()
I am trying to move the partition columns: source_system_name, period_year to the end of the dataFrame: yearDF because the columns that are used in Hive dynamic partitioning should be at the end.
To do that, I came up with the following logic:
val partition_columns = spColsDF.select("partition_columns").collect().map(_.getString(0)).toSeq
val allColsOrdered = yearDF.columns.diff(partition_columns) ++ partition_columns
val allCols = allColsOrdered.map(coln => org.apache.spark.sql.functions.col(coln))
val resultDF = yearDF.select(allCols:_*)
When I execute the code, I get the exception:org.apache.spark.sql.AnalysisException as below:
Exception in thread "main" 18/08/28 18:09:30 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
org.apache.spark.sql.AnalysisException: cannot resolve '`source_system_name,period_year`' given input columns: [cost_center, period_num, period_name, currencies, cc_channel, scenario, xx_pk_id, period_year, cc_region, reference_code, source_system_name, source_record_type, xx_last_update_tms, xx_last_update_log_id, book_type, cc_function, product_line, ptd_balance_text, project, ledger_id, currency_code, xx_data_hash_id, qtd_balance_text, pl_market, version, qtd_balance, period, ptd_balance, ytd_balance_text, xx_hvr_last_upd_tms, geography, year, del_flag, trading_partner, ytd_balance, xx_data_hash_code, xx_creation_tms, forecast_id, drm_org, account, business_unit, gl_source_name, gl_source_system_name];;
'Project [forecast_id#26L, period_year#27, period_num#28, period_name#29, drm_org#30, ledger_id#31L, currency_code#32, source_system_name#33, source_record_type#34, gl_source_name#35, gl_source_system_name#36, year#37, period#38, scenario#39, version#40, currencies#41, business_unit#42, account#43, trading_partner#44, cost_center#45, geography#46, project#47, reference_code#48, product_line#49, ... 20 more fields]
+- Relation[forecast_id#26L,period_year#27,period_num#28,period_name#29,drm_org#30,ledger_id#31L,currency_code#32,source_system_name#33,source_record_type#34,gl_source_name#35,gl_source_system_name#36,year#37,period#38,scenario#39,version#40,currencies#41,business_unit#42,account#43,trading_partner#44,cost_center#45,geography#46,project#47,reference_code#48,product_line#49,... 19 more fields] JDBCRelation((select forecast_id,period_year,period_num,period_name,drm_org,ledger_id,currency_code,source_system_name,source_record_type,gl_source_name,gl_source_system_name,year,period,scenario,version,currencies,business_unit,account,trading_partner,cost_center,geography,project,reference_code,product_line,book_type,cc_region,cc_channel,cc_function,pl_market,ptd_balance,qtd_balance,ytd_balance,xx_hvr_last_upd_tms,xx_creation_tms,xx_last_update_tms,xx_last_update_log_id,xx_data_hash_code,xx_data_hash_id,xx_pk_id,null::integer as del_flag,ptd_balance::character varying as ptd_balance_text,qtd_balance::character varying as qtd_balance_text,ytd_balance::character varying as ytd_balance_text from analytics.xx_gl_forecast where period_year='2017') as year2017) [numPartitions=1]
But if I pass the same column names in another way as following, the code works fine:
val lastCols = Seq("source_system_name","period_year")
val allColsOrdered = yearDF.columns.diff(lastCols) ++ lastCols
val allCols = allColsOrdered.map(coln => org.apache.spark.sql.functions.col(coln))
val resultDF = yearDF.select(allCols:_*)
Could anyone tell me what is the mistake I am doing here ?
If you look at the error:
cannot resolve '`source_system_name,period_year`
It means that, the following line:
spColsDF.select("partition_columns").collect().map(_.getString(0)).toSeq
is returning something like:
Array("source_system_name,period_year")
that means that both the column names are concatenated and form the first element of the array instead of seperate elements like you want.
To get the desired result, you need to split it on ,. For eg, the following should work.
spColsDf.select("partition_columns").collect.flatMap(_.getAs[String](0).split(","))
When extracting small number of partitions from large C* table using RDDs, we can use this:
val rdd = … // rdd including partition data
val data = rdd.repartitionByCassandraReplica(keyspace, tableName)
.joinWithCassandraTable(keyspace, tableName)
Do we have available an equally effective approach using DataFrames?
Update (Apr 26, 2017):
To be more concrete, I prepared an example.
I have 2 tables in Cassandra:
CREATE TABLE ids (
id text,
registered timestamp,
PRIMARY KEY (id)
)
CREATE TABLE cpu_utils (
id text,
date text,
time timestamp,
cpu_util int,
PRIMARY KEY (( id, date ), time)
)
The first one contains a list of valid IDs and the second one cpu utilization data. I would like to efficiently get average cpu utilization per each id in table ids for one day, say "2017-04-25".
The most efficient way with the RDDs that I know of is the following:
val sc: SparkContext = ...
val date = "2017-04-25"
val partitions = sc.cassandraTable(keyspace, "ids")
.select("id").map(r => (r.getString("id"), date))
val data = partitions.repartitionByCassandraReplica(keyspace, "cpu_utils")
.joinWithCassandraTable(keyspace, "cpu_utils")
.select("id", "cpu_util").values
.map(r => (r.getString("id"), (r.getDouble("cpu_util"), 1)))
// aggrData in form: (id, (avg(cpu_util), count))
// example row: ("718be4d5-11ad-4849-8aab-aa563c9c290e",(6,723))
val aggrData = data.reduceByKey((a, b) => (
1d * (a._1 * a._2 + b._1 * b._2) / (a._2 + b._2),
a._2 + b._2))
aggrData.foreach(println)
This approach takes about 5 seconds to complete (setup with Spark on my local machine, Cassandra on some remote server). Using it, I am performing operations on less than 1% of partitions in table cpu_utils .
With the Dataframes this is the approach I am using currently:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val date = "2017-04-25"
val partitions = sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "ids", "keyspace" -> keyspace)).load()
.select($"id").withColumn("date", lit(date))
val data: DataFrame = sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "cpu_utils", "keyspace" -> keyspace)).load()
.select($"id", $"cpu_util", $"date")
val dataFinal = partitions.join(data, partitions.col("id").equalTo(data.col("id")) and partitions.col("date").equalTo(data.col("date")))
.select(data.col("id"), data.col("cpu_util"))
.groupBy("id")
.agg(avg("cpu_util"), count("cpu_util"))
dataFinal.show()
However, this approach seems to load the whole table cpu_utils into memory as execution time here is considerably longer (almost 1 minute).
I am asking if there exists a better approach using Dataframes that would at least reach if not perform better than the RDD approach mentioned above?
P.s.: I am using Spark 1.6.1.