Scala Spark : Creating variable that contains col names for partition - scala

I want to make several columns that will sum over other column partition by several columns. I want to make a variable that contains all the col names for the partition by, in that way my code will be cleaner. It goes like this
var myunit = $"reporttype_code", $"review_code", $"businessgroup_code",
$"company_code", $"grade_code", $"terminationtype_code"
var semesterly = Window.partitionBy(myunit, $semester)
.orderBy($year, $month)
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
var quarterly = Window.partitionBy(myunit, $quarter)
.orderBy($year, $month)
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
test = test.withColumn("turnover_over_semester", sum($"val1").over(semesterly))
.withColumn("turnover_over_quarter", sum($"val1").over(quarterly))
I'm gonna make a lot of columns which includes parition by var unit. But the var raised error : expected ; but , found Is it possible to create the var other way?

Related

Pyspark - iterate on a big dataframe

I'm using the following code
events_df = []
for i in df.collect():
v = generate_event(i)
events_df.append(v)
events_df = spark.createDataFrame(events_df, schema)
to go over each dataframe item and add an event header calculated in the generate_event function
def generate_event(delta_row):
header = {
"id": 1,
...
}
row = Row(Data=delta_row)
return EntityEvent(header, row)
class EntityEvent:
def __init__(self, _header, _payload):
self.header = _header
self.payload = _payload
It works fine locally for df with few items (even with 1 000 000 items) but when we have more than 6 millions the aws glue job fail
Note: with rdd seems to be better but I can't use it because I've a problem with dates < 1900-01-01 (issue)
is there a way to chunk the dataframe and consolidate at the end ?
The best solution that we can preview is to use spark promise features, like adding new columns using struct and create_map functions...
events_df = (
df
.withColumn(
"header",
f.create_map(
f.lit("id"),
f.lit(1)
)
)
...
So we can create columns as much as we need and make transformations to get the required header structure
PS: this solution (add new columns to the dataframe rather than iterate on it) avoid using rdd and brings a big advantage in terms of performance !

How to calculate a value for each row?

I have an input dataframe (created from a hive table) containing more than 100 rows. For each row of the dataframe, I need to extract the column values (most strings) and pass those values to a user-defined function. For each row, the function uses these input values and other intermediate dataframes (created from hive tables) to calculate a set of rows and stores in a result dataframe.
How do I achieve this - please help.
I tried this:
var df1= hiveContext.sql("Select event_date,channelcode,st,tc,startsec,endsec from program_master")
var count1=df1.count()
df1 = df1.withColumn("INDEX", monotonically_increasing_id())
var i=1
while (i <= count1){
var ed = df1.filter(df1("INDEX") === s"""$i""").select(to_date(unix_timestamp(df1("ed"), "dd-MM-yy").cast(TimestampType)).cast(DateType)).first().getDate(0)
var cc = df1.filter(df1("INDEX") === s"""$i""").select(df1("cc")).first().getInt(0)
var ST = df1.filter(df1("INDEX") === s"""$i""").select(df1("ST")).first().getString(0)
var TC = df1.filter(df1("INDEX") === s"""$i""").select(df1("TC")).first().getString(0)
var ss = df1.filter(df1("INDEX") === s"""$i""").select(df1("ss")).first().getInt(0)
var es = df1.filter(df1("INDEX") === s"""$i""").select(df1("es")).first().getInt(0)
calculate_values(ed, cc, st, tc, ss, ss, sparkSession)
i=i+1
}
calculate_values def
def calculate_values(ed: Date,cc:Integer,ST:String,TC:String,ss:Integer,ss:Integer,sparkSession: SparkSession):Unit=
2 problems in what I tried: hence do not have an output
line 3: I expected it to give numbers like 1,2,3,......100.... to iterate using i - but its generating random very large numbers.
line 5: It throws java.util.NoSuchElementException: next on empty iterator
monotonically_increasing_id() will generate random numbers but in increasing manner, so you cannot rely on it to generate serial numbers as row_number() function. But row_number() is expensive to be used on whole dataset as it will collect all the data in one executor unless you use row_number() by grouping the data.
monotonically_increasing_id() would be helpful in cases where you want to order/sort your data.
It seems that you are trying to calculate some values row by row using event_date, channelcode, st, tc, startsec and endsec.
If its a row by row calculation then I would suggest you to use a udf function. So you can convert your calculate_value function into a udf function as
import org.apache.spark.sql.functions._
def calculate_value = udf((ed: Date,cc:Int,ST:String,TC:String,ss:Int,es:Int) => //write your calculation part here)
And you call the udf function using withColumn as
df1.withColumn("calculated", calculate(col("ed"), col("cc"), col("ST"), col("TC"), col("ss"), col("es"))
a new column will be created with the calculated value
But if the calculations can be done column wise I would recommend you to look at inbuilt functions too

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

Spark - move data from tables to new table with extra column

So we have a Cassandra project and it requires us to migrate a large number of tables from 3 separate tables into one.
e.g. table_d_abc, table_m_abc, table_w_abc to table_t_abc
Essentially data needs to be moved to this new table with an extra column with a value that was in the table's name.
There are 100's of tables like this - so you could imagine the huge job it would be to 'hand-make' a migration script. And naturally I thought SPARK should be able to do the job.
e.g.:
var tables = List("table_*_abc", "table_*_def") // etc
var periods = List('d','w','m')
for (table <- tables) {
for (period <- periods) {
var rTable = table.replace('*', period)
var nTable = table.replace('*', 't')
try {
var t = sc.cassandraTable("data", rTable)
var fr = t.first
var columns = fr.toMap.keys.toArray :+ "period"
var data = t.map(_.iterator.toArray :+ period)
// This line does not work as data is a RDD of Array[Any] and not RDD of tuple[...]
// How to ???
data.saveToCassandra("data", nTable, SomeColumns(columns.map(ColumnName(_)):_*))
} //catch {}
}
}
versus:
var periods = List('d','w','m')
for (period <- periods) {
sc.cassandraTable("data","table_" + period + "_abc")
.map(v => (v.getString("a"), v.getInt("b"), v.getInt("c"), period))
.saveToCassandra("data", "table_t_abc", SomeColumns("a","b","c","period"))
// ... 100s of other scripts like this
}
Is what I'm trying to do possible?
Is there a way to programatically save an extra column from an source with unknown number of columns and datatypes?
The issue here is the RDD objects must be of a type which has a "RowWriter" defined. This maps the data in the object to C* insertable buffers.
RDD World
Using "CassandraRow" objects this is possible. These objects allow for generic contents and can be constructed on the file. They are also the default output so making a new one from an old one should be relatively cheap.
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/CassandraRow.scala
You would make a single RowMetadata (basically schema info) for each table with the additional column, then populate the row with the values of the input row + the new period variable.
Dataframe World
If you wanted to switch to Dataframes this would be easier as you could just use the DataFrame add column before saving.
cassandraDF.withColumn("period",lit("Value based on first row"))

Accessing column in a dataframe using Spark

I am working on SPARK 1.6.1 version using SCALA and facing a unusual issue. When creating a new column using an existing column created during same execution getting "org.apache.spark.sql.AnalysisException".
WORKING:.
val resultDataFrame = dataFrame.withColumn("FirstColumn",lit(2021)).withColumn("SecondColumn",when($"FirstColumn" - 2021 === 0, 1).otherwise(10))
resultDataFrame.printSchema().
NOT WORKING
val resultDataFrame = dataFrame.withColumn("FirstColumn",lit(2021)).withColumn("SecondColumn",when($"FirstColumn" - **max($"FirstColumn")** === 0, 1).otherwise(10))
resultDataFrame.printSchema().
Here i am creating my SecondColumn using the FirstColumn created during the same execution. Question is why it does not work while using avg/max functions. Please let me know how can i resolve this problem.
If you want to use aggregate functions together with "normal" columns, the functions should come after a groupBy or with a Window definition clause. Out of these cases they make no sense. Examples:
val result = df.groupBy($"col1").max("col2").as("max") // This works
In the above case, the resulting DataFrame will have both "col1" and "max" as columns.
val max = df.select(min("col2"), max("col2"))
This works because there are only aggregate functions in the query. However, the following will not work:
val result = df.filter($"col1" === max($"col2"))
because I am trying to mix a non aggregated column with an aggregated column.
If you want to compare a column with an aggregated value, you can try a join:
val maxDf = df.select(max("col2").as("maxValue"))
val joined = df.join(maxDf)
val result = joined.filter($"col1" === $"maxValue").drop("maxValue")
Or even use the simple value:
val maxValue = df.select(max("col2")).first.get(0)
val result = filter($"col1" === maxValue)