Looping the scala list in Spark - scala

I have a scala list as below.
partList: ListBuffer(2021-10-01, 2021-10-02, 2021-10-03, 2021-10-04, 2021-10-05, 2021-10-06, 2021-10-07, 2021-10-08)
Currently Im getting all the data from source into the dataframe based on the above dates.
fctExistingDF = ss.read.table(existingTable).filter(s"event_date in ('${partList.mkString("','")}')")
Later I'm doing few transformations and loading the data into a delta table. The sample code is as below.
fctDF = ss.read.table(existingTable).filter(s"event_date in ('${partList.mkString("','")}')")
if (fctExistingDF.count() > 0) {
fctDF.createOrReplaceTempView("vw_exist_fct")
val existingRecordsQuery = getExistingRecordsMergeQuery(azUpdateTS,key)
ss.sql(existingRecordsQuery)
.drop("az_insert_ts").drop("az_update_ts")
.withColumn("az_insert_ts", col("new_az_insert_ts"))
.withColumn("az_update_ts", col("new_az_update_ts"))
.drop("new_az_insert_ts").drop("new_az_update_ts")
.select(mrg_tbl_cols(0), mrg_tbl_cols.slice(1,mrg_tbl_cols.length): _*)
.coalesce(72*2)
.write.mode("Append").format("delta")
.insertInto(mergeTable)
mergedDataDF = ss.read.table(mergeTable).coalesce(72*2)
mergedDataDF.coalesce(72)
.write.mode("Overwrite").format("delta")
.insertInto(s"${tgtSchema}.${tgtTbl}")
The below command in the code is creating a dataframe based on the filter condition on the event_date present in the partList.
fctExistingDF = ss.read.table(existingTable).filter(s"event_date in ('${partList.mkString("','")}')")
Since it is creating the dataframe with huge data, I want to loop each date in the partlist and read the data into the dataframe, instead of filtering all the dates in the partlist at a time.
I tried below.
var counter = 0
while (counter < partList.length) {
fctExistingDF = ss.read.table(existingTable).filter(s"event_date in (I should pass 1st date from the list)
counter = counter + 1
I am new to scala , may be we should use foreach here?
Could someone please help. Thank you.

You can use foreach or map, depends whether you want to return the values (map) or not (foreach):
import org.apache.spark.sql.functions.col
partList = List("2021-10-01", "2021-10-02", "2021-10-03", "2021-10-04", "2021-10-05", "2021-10-06", "2021-10-07", "2021-10-08")
partList.foreach { case date =>
fctExistingDF = ss.read.table(existingTable).filter(col("event_date") === date)
}
If you want to return list of dataframes, use:
val dfs = partList.map { case date =>
fctExistingDF = ss.read.table(existingTable).filter(col("event_date") === date)

Related

Apache POI: Why Is a Row Missing When I use shiftRows?

I'm shifting the rows in an excel sheet and inserting a new row at the beginning of the sheet. However, regardless of how many rows I shift and insert, I seem to be ending up with one less row than I should be.
import org.apache.poi.ss.usermodel.Row
import Row.MissingCellPolicy._
import org.apache.poi.ss.usermodel.Sheet
import org.apache.poi.ss.usermodel.Workbook
import org.apache.poi.ss.util.CellRangeAddress
import org.apache.poi.ss.util.WorkbookUtil.createSafeSheetName
import org.apache.poi.xssf.usermodel.XSSFWorkbook
def shiftAndInsertRow(sheet: Sheet) = {
val rowInsertionPoint = 0
// shift all the rows down
val lastRowNum = sheet.getLastRowNum
println(s"Last row is $lastRowNum")
val debugRow1 = sheet.getRow(rowInsertionPoint)
val debugCell1 = debugRow1.getCell(0)
// let's get a play-by-play of what's being attempted
println(s"Current value in row $rowInsertionPoint is " +
s"${debugCell1.getNumericCellValue}")
println(s"Shifting rows $rowInsertionPoint and below down one row")
sheet.shiftRows(rowInsertionPoint, lastRowNum, 1, true, true)
val debugRow2 = sheet.getRow(rowInsertionPoint + 1)
val debugCell2 = debugRow2.getCell(0)
println(s"Current value in row ${rowInsertionPoint + 1} is now " +
s"${debugCell2.getNumericCellValue}")
println(s"Creating new row at $rowInsertionPoint in sheet")
// create the new row
val newRow = sheet.createRow(rowInsertionPoint)
// set the field ID of the row
val newCell = newRow.getCell(0, CREATE_NULL_AS_BLANK)
println(s"Inserting value $lastRowNum at $rowInsertionPoint in sheet")
newCell.setCellValue(lastRowNum)
println()
}
val workbook = new XSSFWorkbook()
val sheet = workbook.createSheet(createSafeSheetName("Test 1"))
val rowNum = 0
val cellValue = -1
println(s"Creating new row at $rowNum in sheet")
// create the new row
val row = sheet.createRow(rowNum)
// set the field ID of the row
val cell = row.getCell(0, CREATE_NULL_AS_BLANK)
println(s"Inserting value $cellValue at $rowNum in sheet")
cell.setCellValue(cellValue)
println()
// insert a second row
shiftAndInsertRow(sheet)
// and a third
shiftAndInsertRow(sheet)
workbook.write(new java.io.FileOutputStream("out/test.xlsx"))
The above code creates a spreadsheet with only two rows instead of three. What am I missing?
I think your code is fine, it looks to me like this is a bug in apache-poi. It works for me on version 3.17 but breaks if I upgrade to 4.0.0.
As far as I can tell, the row num is being updated correctly, but the reference (cell.getReference) is not.
I would suggest trying to find if the bug has already been reported here https://bz.apache.org/bugzilla/buglist.cgi?product=POI and if not, filing a new bug report.
In the meantime, you could perhaps try this workaround which seems to do the trick for me. It calls updateCellReferencesForShifting on every cell in the spreadsheet.
import scala.collection.JavaConverters._
for {
row <- sheet.rowIterator().asScala.toList
cell <- row.cellIterator().asScala.toList
} yield cell.asInstanceOf[XSSFCell].updateCellReferencesForShifting("")
Place this block of code right after your call to shiftRows. No guarantees that it's not going to break something else though, so use with caution!

Sort/Order an Undetermined Number of Columns (LINQ\Entity Framework)

Need to sort/order a list of data based on an undetermined number of columns (1 or more).
What i'm trying to do is loop through the desired columns and add an OrderBy or ThenBy based on their number to the query'd list, but i'm unsuccessful...
Done this, but it doesn't compile:
var query = GetAllItems(); //returns a IQueriable list of items
//for each selected column
for (int i = 0; i < param.Columns.Length; i++)
{
if (i == 0)
{
query = query.OrderBy(x => x.GetType().GetProperty(param.Columns[i].Name));
}
else
{
//ERROR: IQueriable does not contain a definition for "ThenBy" and no extension method "ThenBy"...
query = query.ThenBy(x => x.GetType().GetProperty(param.Columns[i].Data));
}
}
How can i resolve this issue? Or any alternative to accomplish this requirement?
SOLUTION: #Dave-Kidder's solution is well thought and resolves the compile errors i had. Just one problem, OrderBy only executes (actually sorts the results) after a ToList() cast. This is an issue because i can't convert a ToList back to an IOrderedQueryable.
So, after some research i came across a solution that resolve all my issues.
Microsoft assembly for the .Net 4.0 Dynamic language functionality: https://github.com/kahanu/System.Linq.Dynamic
using System.Linq.Dynamic; //need to install this package
Updated Code:
var query = GetAllItems(); //returns a IQueriable list of items
List<string> orderByColumnList = new List<string>(); //list of columns to sort
for (int i = 0; i < param.Columns.Length; i++)
{
string column = param.Columns[i].Name;
string direction = param.Columns[i].Dir;
//ex.: "columnA ASC"
string orderByColumn = column + " " + direction;
//add column to list
orderByColumnList.Add(orderBy);
}
//convert list to comma delimited string
string orderBy = String.Join(",", orderByColumnList.ToArray());
//sort by all columns, yay! :-D
query.OrderBy(orderBy).ToList();
The problem is that ThenBy is not defined on IQueryable, but on the IOrderedQueryable interface (which is what IQueryable.OrderBy returns). So you need to define a new variable for the IOrderedQueryable in order to do subsequent ThenBy calls. I changed the original code a bit to use System.Data.DataTable (to get a similar structure to your "param" object). The code also assumes that there is at least one column in the DataTable.
// using System.Data.DataTable to provide similar object structure as OP
DataTable param = new DataTable();
IQueryable<DataTable> query = new List<DataTable>().AsQueryable();
// OrderBy returns IOrderedQueryable<TSource>, which is the interface that defines
// "ThenBy" so we need to assign it to a different variable if we wish to make subsequent
// calls to ThenBy
var orderedQuery = query.OrderBy(x => x.GetType().GetProperty(param.Columns[0].ColumnName));
//for each other selected column
for (int i = 1; i < param.Columns.Count; i++)
{
orderedQuery = orderedQuery.ThenBy(x => x.GetType().GetProperty(param.Columns[i].ColumnName));
}
you should write ThenBy after OrderBy like this:
query = query
.OrderBy(t=> // your condition)
.ThenBy(t=> // next condition);

Scala : How to use variable in for loop outside loop block

How can I create Dataframe with all my json files, when after reading each file I need to add fileName as field in dataframe? it seems Variable in for loop is not recognized outside loop. How to overcome this issue?
for (jsonfilenames <- fileArray) {
var df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
// trying to create temp table from dataframe created in loop
tblLanding.registerTempTable("LandingTable") // ERROR here, can't resolved tblLanding
Thank in advance
Hossain
I think you are new to programming itself.
Anyways here you go.
Basically you specify the type and initialise it before loop.
var df:DataFrame = null
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
df.registerTempTable("LandingTable") // Getting ERROR here
Update
Ok you are completely new to programming, even loops.
Suppose fileArray is having values as [1.json, 2.json, 3.json, 4.json]
So, the loop actually created 4 dataframe, by reading 4 json files.
Which one you want to register as temp table.
If all of them,
var df:DataFrame = null
var count = 0
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
df.registerTempTable(s"LandingTable_$count")
count++;
}
And reason for df being empty before this update is, your fileArray is empty or Spark failed to read that file. Print it and check.
To query any of those registered LandingTable
val df2 = hiveContext.sql("SELECT * FROM LandingTable_0")
Update
Question has changed to making a single dataFrame from all the json files.
var dataFrame:DataFrame = null
for (jsonfilename <- fileArray) {
val eachDataFrame = hivecontext.read.json(jsonfilename)
if(dataFrame == null)
dataFrame = eachDataFrame
else
dataFrame = eachDataFrame.unionAll(dataFrame)
}
dataFrame.registerTempTable("LandingTable")
Insure, that fileArray is not empty and all json files in fileArray are having same schema.
// Create list of dataframes with source-file-names
val dfList = fileArray.map{ filename =>
hivecontext.read.json(filename)
.withColumn("source_file_name", lit(filename))
}
// union the dataframes (assuming all are same schema)
val df = dfList.reduce(_ unionAll _) // or use union if spark 2.x
// register as table
df.registerTempTable("LandingTable")

How to count new element from stream by using spark-streaming

I have done implementation of daily compute. Here is some pseudo-code.
"newUser" may called first activated user.
// Get today log from hbase or somewhere else
val log = getRddFromHbase(todayDate)
// Compute active user
val activeUser = log.map(line => ((line.uid, line.appId), line).reduceByKey(distinctStrategyMethod)
// Get history user from hdfs
val historyUser = loadFromHdfs(path + yesterdayDate)
// Compute new user from active user and historyUser
val newUser = activeUser.subtractByKey(historyUser)
// Get new history user
val newHistoryUser = historyUser.union(newUser)
// Save today history user
saveToHdfs(path + todayDate)
Computation of "activeUser" can be converted to spark-streaming easily. Here is some code:
val transformedLog = sdkLogDs.map(sdkLog => {
val time = System.currentTimeMillis()
val timeToday = ((time - (time + 3600000 * 8) % 86400000) / 1000).toInt
((sdkLog.appid, sdkLog.bcode, sdkLog.uid), (sdkLog.channel_no, sdkLog.ctime.toInt, timeToday))
})
val activeUser = transformedLog.groupByKeyAndWindow(Seconds(86400), Seconds(60)).mapValues(x => {
var firstLine = x.head
x.foreach(line => {
if (line._2 < firstLine._2) firstLine = line
})
firstLine
})
But the approach of "newUser" and "historyUser" is confusing me.
I think my question can be summarized as "how to count new element from stream". As my pseudo-code above, "newUser" is part of "activeUser". And I must maintain a set of "historyUser" to know which part is "newUser".
I consider an approach, but I think it may not work right way:
Load the history user as a RDD. Foreach DStream of "activeUser" and find the elements doesn't exist in the "historyUser". A problem here is when should I update this RDD of "historyUser" to make sure I can get the right "newUser" of a window.
Update the "historyUser" RDD means add "newUser" to it. Just like what I did in the pseudo-code above. The "historyUser" is updated once a day in that code. Another problem is how to do this update RDD operation from a DStream. I think update "historyUser" when window slides is proper. But I haven't find a proper API to do this.
So which is the best practice to solve this problem.
updateStateByKey would help here as it allows you to set initial state (your historical users) and then update it on each interval of your main stream. I put some code together to explain the concept
val historyUsers = loadFromHdfs(path + yesterdayDate).map(UserData(...))
case class UserStatusState(isNew: Boolean, values: UserData)
// this will prepare the RDD of already known historical users
// to pass into updateStateByKey as initial state
val initialStateRDD = historyUsers.map(user => UserStatusState(false, user))
// stateful stream
val trackUsers = sdkLogDs.updateStateByKey(updateState, new HashPartitioner(sdkLogDs.ssc.sparkContext.defaultParallelism), true, initialStateRDD)
// only new users
val newUsersStream = trackUsers.filter(_._2.isNew)
def updateState(newValues: Seq[UserData], prevState: Option[UserStatusState]): Option[UserStatusState] = {
// Group all values for specific user as needed
val groupedUserData: UserData = newValues.reduce(...)
// prevState is defined only for users previously seen in the stream
// or loaded as initial state from historyUsers RDD
// For new users it is None
val isNewUser = !prevState.isDefined
// as you return state here for the user - prevState won't be None on next iterations
Some(UserStatusState(isNewUser, groupedUserData))
}

Selecting keys of a TreeMap by their index

Imagine you have the following TreeMap:
var dates = new TreeMap[Long, Tuple2[Int, Double]]()
I know I can loop through it with:
dates.foreach { case (date, (id, rotation)) =>
...
}
But in my code, this loop takes place within another loop and I would therefore like to advance myself in the dates keys, typically with a currIndex : Int variable that I would increment according to a condition.
I thought one could do something like:
date = dates.keys(currIndex)
but it doesn't look like this is possible... any idea how to do that?
Edit: trying to address your comment:
You can convert the whole keys to an IndexedSeq beforehand:
val keysSeq = dates.keySet.toIndexedSeq
// later, obtain an index
val index: Int = /* ... */
// lookup the key
val (valueInt, valueDouble) = dates(keysSeq(index))
Previous answer
You could try something like this:
dates.iterator.zipWithIndex.foreach {
case ((key, (valueInt, valueDouble)), index) =>
}
Would that work for you? I'm not sure I properly understand your requirement of “increment[ing currIndex] according to a condition”…