Group by and count on Spark Data frame all columns - scala

I want to Perform Group by on each column of the data frame using Spark Sql. The Dataframe will have approx. 1000 columns.
I have tried Iterating over all the columns in the data frame and performed groupBy on each column. But the program is executing more than 1.5 hour
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "exp", "keyspace" -> "testdata"))
.load()
val groupedData= channelDf.columns.map(c => channelDf.groupBy(c).count().take(10).toList)
println("Printing Dataset :"+ dataset)
If I have columns in the Dataframe For Example Name and Amount then the output should be like
GroupBy on column Name:
Name Count
Jon 2
Ram 5
David 3
GroupBy on column Amount:
Amount Count
1000 4
2525 3
3000 3
I want the group by result for each column.

The only way I can see a speed up here is to cache the df straight after reading it.
Unfortunately, each computation is independant, and you have to do them, there is no "work around".
Something like this can speed up a little bit, but not that much :
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "exp", "keyspace" -> "testdata"))
.load()
.cache()

Related

how to increase performance on Spark distinct() on multiple columns

Could you please suggest alternative way of implementing distinct in spark data frame.
I tried both SQL and spark distinct but since the dataset size (>2 Billion) it fails on the shuffle .
If I increase the node and memory to >250GB, process run for a longe time (more than 7 hours).
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes").distinct()
val df2 = df1.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df2.createOrReplaceTempView("df2")
val df3 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df2
LATERAL VIEW explode(codes) exploded_table as d
""")
df3.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)

Spark JDBC Read, Partition On, Column Type to Select?

I am trying a read a SQL Table (15 million rows) using Spark into Dataframe, I want to leverage Multi-Core to Do the read very Fast and do the Partition, What are the column/s I can select to partition ? is it ID, UUID, Sequence, date-time? How should I calculate the Number of Partitions?
There are multiple of complex questions in your question :
- What are the column/s I can select to partition ?
It depends on your needs and your computing goals and transformations that you will do next with spark on your data. (if groupBy(key) and your key is date-time, then you should partitionBy date-time)
- Number of partition depends on : the size of your data, your hardware ressources, yours needs ... it is a complex question, you have to take into account shuffle partitions for transformations also (default is 200, value advised by spark: 3 * number of cpu)
val sparkSession = org.apache.spark.sql.SparkSession.builder
.master("local[*]")
.appName("JdbcMicroService")
.getOrCreate()
sparkSession.conf.set("spark.sql.shuffle.partitions", 3 * nb CPU)
def requestPostgreSql(sparkSession: SparkSession, database : database, dateOfRequest : String) : DataFrame = {
val url = "jdbc:postgresql://" + database.url + "/" + database.databaseName
val requestDF = sparkSession.read.format("jdbc")
.option("Driver", "org.postgresql.Driver")
.option("url", url)
.option("dbtable",database.tableName)
.option("user",database.user)
.option("password",database.passwd)
.load()
.repartition(col("colName"))
requestDF
}

Configuration for spark job to write 3000000 file as output

I have to generate 3000000 files as the output of spark job.
I have two input file :
File 1 -> Size=3.3 Compressed, No.Of Records=13979835
File 2 -> Size=1.g Compressed, No.Of Records=6170229
Spark Job is doing the following:
reading both this file and joining them based on common column1. -> DataFrame-A
Grouping result of DataFrame-A based on one column2 -> DataFrame-B
From DataFrame-B used array_join for the aggregated column and separate that column by '\n' char. -> DataFrame-C
Writing result of DataFrame-C partition by column2.
val DF1 = sparkSession.read.json("FILE1") // |ID |isHighway|isRamp|pvId |linkIdx|ffs |length |
val DF12 = sparkSession.read.json("FILE2") // |lId |pid |
val joinExpression = DF1.col("pvId") === DF2.col("lId")
val DFA = DF.join(tpLinkDF, joinExpression, "inner").select(col("ID").as("SCAR"), col("lId"), col("length"), col("ffs"), col("ar"), col("pid")).orderBy("linkIdx")
val DFB = DFA.select(col("SCAR"),concat_ws(",", col("lId"), col("length"),col("ffs"), col("ar"), col("pid")).as("links")).groupBy("SCAR").agg(collect_list("links").as("links"))
val DFC = DFB.select(col("SCAR"), array_join(col("links"), "\n").as("links"))
DFC.write.format("com.databricks.spark.csv").option("quote", "\u0000").partitionBy("SCAR").mode(SaveMode.Append).format("csv").save("/tmp")
I have to generate 3000000 files as output of spark job.
After running some test I got an idea to run this job in batch like :
query startIdx: 0, endIndex:100000
query startIdx: 100000, endIndex:200000
query startIdx: 200000, endIndex:300000
and so.... on till
query startIdx: 2900000, endIndex:3000000

How to fix org.apache.spark.sql.AnalysisException while changing the order of columns in a dataframe?

I am trying to load data from an RDBMS table on Postgres to Hive table on HDFS.
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${query}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("numPartitions",15).load()
The Hive table is dynamically partitioned based on two columns: source_system_name,period_year
I have these column names present in a metadata table: metatables
val spColsDF = spark.read.format("jdbc").option("url",hiveMetaConURL)
.option("dbtable", "(select partition_columns from metainfo.metatables where tablename='finance.xx_gl_forecast') as colsPrecision")
.option("user", metaUserName)
.option("password", metaPassword)
.load()
I am trying to move the partition columns: source_system_name, period_year to the end of the dataFrame: yearDF because the columns that are used in Hive dynamic partitioning should be at the end.
To do that, I came up with the following logic:
val partition_columns = spColsDF.select("partition_columns").collect().map(_.getString(0)).toSeq
val allColsOrdered = yearDF.columns.diff(partition_columns) ++ partition_columns
val allCols = allColsOrdered.map(coln => org.apache.spark.sql.functions.col(coln))
val resultDF = yearDF.select(allCols:_*)
When I execute the code, I get the exception:org.apache.spark.sql.AnalysisException as below:
Exception in thread "main" 18/08/28 18:09:30 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
org.apache.spark.sql.AnalysisException: cannot resolve '`source_system_name,period_year`' given input columns: [cost_center, period_num, period_name, currencies, cc_channel, scenario, xx_pk_id, period_year, cc_region, reference_code, source_system_name, source_record_type, xx_last_update_tms, xx_last_update_log_id, book_type, cc_function, product_line, ptd_balance_text, project, ledger_id, currency_code, xx_data_hash_id, qtd_balance_text, pl_market, version, qtd_balance, period, ptd_balance, ytd_balance_text, xx_hvr_last_upd_tms, geography, year, del_flag, trading_partner, ytd_balance, xx_data_hash_code, xx_creation_tms, forecast_id, drm_org, account, business_unit, gl_source_name, gl_source_system_name];;
'Project [forecast_id#26L, period_year#27, period_num#28, period_name#29, drm_org#30, ledger_id#31L, currency_code#32, source_system_name#33, source_record_type#34, gl_source_name#35, gl_source_system_name#36, year#37, period#38, scenario#39, version#40, currencies#41, business_unit#42, account#43, trading_partner#44, cost_center#45, geography#46, project#47, reference_code#48, product_line#49, ... 20 more fields]
+- Relation[forecast_id#26L,period_year#27,period_num#28,period_name#29,drm_org#30,ledger_id#31L,currency_code#32,source_system_name#33,source_record_type#34,gl_source_name#35,gl_source_system_name#36,year#37,period#38,scenario#39,version#40,currencies#41,business_unit#42,account#43,trading_partner#44,cost_center#45,geography#46,project#47,reference_code#48,product_line#49,... 19 more fields] JDBCRelation((select forecast_id,period_year,period_num,period_name,drm_org,ledger_id,currency_code,source_system_name,source_record_type,gl_source_name,gl_source_system_name,year,period,scenario,version,currencies,business_unit,account,trading_partner,cost_center,geography,project,reference_code,product_line,book_type,cc_region,cc_channel,cc_function,pl_market,ptd_balance,qtd_balance,ytd_balance,xx_hvr_last_upd_tms,xx_creation_tms,xx_last_update_tms,xx_last_update_log_id,xx_data_hash_code,xx_data_hash_id,xx_pk_id,null::integer as del_flag,ptd_balance::character varying as ptd_balance_text,qtd_balance::character varying as qtd_balance_text,ytd_balance::character varying as ytd_balance_text from analytics.xx_gl_forecast where period_year='2017') as year2017) [numPartitions=1]
But if I pass the same column names in another way as following, the code works fine:
val lastCols = Seq("source_system_name","period_year")
val allColsOrdered = yearDF.columns.diff(lastCols) ++ lastCols
val allCols = allColsOrdered.map(coln => org.apache.spark.sql.functions.col(coln))
val resultDF = yearDF.select(allCols:_*)
Could anyone tell me what is the mistake I am doing here ?
If you look at the error:
cannot resolve '`source_system_name,period_year`
It means that, the following line:
spColsDF.select("partition_columns").collect().map(_.getString(0)).toSeq
is returning something like:
Array("source_system_name,period_year")
that means that both the column names are concatenated and form the first element of the array instead of seperate elements like you want.
To get the desired result, you need to split it on ,. For eg, the following should work.
spColsDf.select("partition_columns").collect.flatMap(_.getAs[String](0).split(","))

Is there an alternative to joinWithCassandraTable for DataFrames in Spark (Scala) when retrieving data from only certain Cassandra partitions?

When extracting small number of partitions from large C* table using RDDs, we can use this:
val rdd = … // rdd including partition data
val data = rdd.repartitionByCassandraReplica(keyspace, tableName)
.joinWithCassandraTable(keyspace, tableName)
Do we have available an equally effective approach using DataFrames?
Update (Apr 26, 2017):
To be more concrete, I prepared an example.
I have 2 tables in Cassandra:
CREATE TABLE ids (
id text,
registered timestamp,
PRIMARY KEY (id)
)
CREATE TABLE cpu_utils (
id text,
date text,
time timestamp,
cpu_util int,
PRIMARY KEY (( id, date ), time)
)
The first one contains a list of valid IDs and the second one cpu utilization data. I would like to efficiently get average cpu utilization per each id in table ids for one day, say "2017-04-25".
The most efficient way with the RDDs that I know of is the following:
val sc: SparkContext = ...
val date = "2017-04-25"
val partitions = sc.cassandraTable(keyspace, "ids")
.select("id").map(r => (r.getString("id"), date))
val data = partitions.repartitionByCassandraReplica(keyspace, "cpu_utils")
.joinWithCassandraTable(keyspace, "cpu_utils")
.select("id", "cpu_util").values
.map(r => (r.getString("id"), (r.getDouble("cpu_util"), 1)))
// aggrData in form: (id, (avg(cpu_util), count))
// example row: ("718be4d5-11ad-4849-8aab-aa563c9c290e",(6,723))
val aggrData = data.reduceByKey((a, b) => (
1d * (a._1 * a._2 + b._1 * b._2) / (a._2 + b._2),
a._2 + b._2))
aggrData.foreach(println)
This approach takes about 5 seconds to complete (setup with Spark on my local machine, Cassandra on some remote server). Using it, I am performing operations on less than 1% of partitions in table cpu_utils .
With the Dataframes this is the approach I am using currently:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val date = "2017-04-25"
val partitions = sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "ids", "keyspace" -> keyspace)).load()
.select($"id").withColumn("date", lit(date))
val data: DataFrame = sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "cpu_utils", "keyspace" -> keyspace)).load()
.select($"id", $"cpu_util", $"date")
val dataFinal = partitions.join(data, partitions.col("id").equalTo(data.col("id")) and partitions.col("date").equalTo(data.col("date")))
.select(data.col("id"), data.col("cpu_util"))
.groupBy("id")
.agg(avg("cpu_util"), count("cpu_util"))
dataFinal.show()
However, this approach seems to load the whole table cpu_utils into memory as execution time here is considerably longer (almost 1 minute).
I am asking if there exists a better approach using Dataframes that would at least reach if not perform better than the RDD approach mentioned above?
P.s.: I am using Spark 1.6.1.