Apache POI: Why Is a Row Missing When I use shiftRows? - scala

I'm shifting the rows in an excel sheet and inserting a new row at the beginning of the sheet. However, regardless of how many rows I shift and insert, I seem to be ending up with one less row than I should be.
import org.apache.poi.ss.usermodel.Row
import Row.MissingCellPolicy._
import org.apache.poi.ss.usermodel.Sheet
import org.apache.poi.ss.usermodel.Workbook
import org.apache.poi.ss.util.CellRangeAddress
import org.apache.poi.ss.util.WorkbookUtil.createSafeSheetName
import org.apache.poi.xssf.usermodel.XSSFWorkbook
def shiftAndInsertRow(sheet: Sheet) = {
val rowInsertionPoint = 0
// shift all the rows down
val lastRowNum = sheet.getLastRowNum
println(s"Last row is $lastRowNum")
val debugRow1 = sheet.getRow(rowInsertionPoint)
val debugCell1 = debugRow1.getCell(0)
// let's get a play-by-play of what's being attempted
println(s"Current value in row $rowInsertionPoint is " +
s"${debugCell1.getNumericCellValue}")
println(s"Shifting rows $rowInsertionPoint and below down one row")
sheet.shiftRows(rowInsertionPoint, lastRowNum, 1, true, true)
val debugRow2 = sheet.getRow(rowInsertionPoint + 1)
val debugCell2 = debugRow2.getCell(0)
println(s"Current value in row ${rowInsertionPoint + 1} is now " +
s"${debugCell2.getNumericCellValue}")
println(s"Creating new row at $rowInsertionPoint in sheet")
// create the new row
val newRow = sheet.createRow(rowInsertionPoint)
// set the field ID of the row
val newCell = newRow.getCell(0, CREATE_NULL_AS_BLANK)
println(s"Inserting value $lastRowNum at $rowInsertionPoint in sheet")
newCell.setCellValue(lastRowNum)
println()
}
val workbook = new XSSFWorkbook()
val sheet = workbook.createSheet(createSafeSheetName("Test 1"))
val rowNum = 0
val cellValue = -1
println(s"Creating new row at $rowNum in sheet")
// create the new row
val row = sheet.createRow(rowNum)
// set the field ID of the row
val cell = row.getCell(0, CREATE_NULL_AS_BLANK)
println(s"Inserting value $cellValue at $rowNum in sheet")
cell.setCellValue(cellValue)
println()
// insert a second row
shiftAndInsertRow(sheet)
// and a third
shiftAndInsertRow(sheet)
workbook.write(new java.io.FileOutputStream("out/test.xlsx"))
The above code creates a spreadsheet with only two rows instead of three. What am I missing?

I think your code is fine, it looks to me like this is a bug in apache-poi. It works for me on version 3.17 but breaks if I upgrade to 4.0.0.
As far as I can tell, the row num is being updated correctly, but the reference (cell.getReference) is not.
I would suggest trying to find if the bug has already been reported here https://bz.apache.org/bugzilla/buglist.cgi?product=POI and if not, filing a new bug report.
In the meantime, you could perhaps try this workaround which seems to do the trick for me. It calls updateCellReferencesForShifting on every cell in the spreadsheet.
import scala.collection.JavaConverters._
for {
row <- sheet.rowIterator().asScala.toList
cell <- row.cellIterator().asScala.toList
} yield cell.asInstanceOf[XSSFCell].updateCellReferencesForShifting("")
Place this block of code right after your call to shiftRows. No guarantees that it's not going to break something else though, so use with caution!

Related

How to reset a table

It's possible to reset a table scala swing or remove it from the container after clicking on a button ?
I've tried to create a val with that table but I have always a new table stacked under the old
Here is the code :
// here is the most crucial part when the user click on the button, it will append a new table but if we want to start again, it will append bottom of the old one and I want here a kind of reset or removing of table
contents = new BoxPanel(Orientation.Vertical) {
contents += new Label("Hello, you're welcome")
contents += Button("Query") {
val query: ScrollPane = new ScrollPane(changeCountry())
contents -= query
Try {
contents += query
}.getOrElse(Dialog.showMessage(contents.head, "Incorrect input ! This seems that input isn't in that list, write a different code or country"))
}
// this part will ask to the user to write text to the input to display the table in function of the parameter of my function
def changeCountry(): Table = {
val text = Dialog.showInput(parent = contents.head, message = "Write a code of a country or a country", initial = "test")
text match {
case Some(s) => airportRunwayByCountry(s)
}
}
// this below part creates the table
def airportRunwayByCountry(code : String): Table = {
val headers = Seq("Airport","Runway linked")
val rowData = Functions.findAirportAndRunwayByCountry(code).map(x => x.productIterator.toArray).toArray
val tableAirportRunway = new Table(rowData,headers)
tableAirportRunway}
}
Solved with method "remove" of containers
Here is the code :
Try {
if(contents.length == 3 \\ number of items in my Box) {
\\ at this moment, only add the table because none other table exists
contents += new ScrollPane(changeCountry())
}
else {
contents -= contents.remove(3) \\get the id of the old table and remove it at this position
contents += new ScrollPane(changeCountry()) \\ at this moment, this content will have the id n°2, and the loop can start over without errors
}

Looping the scala list in Spark

I have a scala list as below.
partList: ListBuffer(2021-10-01, 2021-10-02, 2021-10-03, 2021-10-04, 2021-10-05, 2021-10-06, 2021-10-07, 2021-10-08)
Currently Im getting all the data from source into the dataframe based on the above dates.
fctExistingDF = ss.read.table(existingTable).filter(s"event_date in ('${partList.mkString("','")}')")
Later I'm doing few transformations and loading the data into a delta table. The sample code is as below.
fctDF = ss.read.table(existingTable).filter(s"event_date in ('${partList.mkString("','")}')")
if (fctExistingDF.count() > 0) {
fctDF.createOrReplaceTempView("vw_exist_fct")
val existingRecordsQuery = getExistingRecordsMergeQuery(azUpdateTS,key)
ss.sql(existingRecordsQuery)
.drop("az_insert_ts").drop("az_update_ts")
.withColumn("az_insert_ts", col("new_az_insert_ts"))
.withColumn("az_update_ts", col("new_az_update_ts"))
.drop("new_az_insert_ts").drop("new_az_update_ts")
.select(mrg_tbl_cols(0), mrg_tbl_cols.slice(1,mrg_tbl_cols.length): _*)
.coalesce(72*2)
.write.mode("Append").format("delta")
.insertInto(mergeTable)
mergedDataDF = ss.read.table(mergeTable).coalesce(72*2)
mergedDataDF.coalesce(72)
.write.mode("Overwrite").format("delta")
.insertInto(s"${tgtSchema}.${tgtTbl}")
The below command in the code is creating a dataframe based on the filter condition on the event_date present in the partList.
fctExistingDF = ss.read.table(existingTable).filter(s"event_date in ('${partList.mkString("','")}')")
Since it is creating the dataframe with huge data, I want to loop each date in the partlist and read the data into the dataframe, instead of filtering all the dates in the partlist at a time.
I tried below.
var counter = 0
while (counter < partList.length) {
fctExistingDF = ss.read.table(existingTable).filter(s"event_date in (I should pass 1st date from the list)
counter = counter + 1
I am new to scala , may be we should use foreach here?
Could someone please help. Thank you.
You can use foreach or map, depends whether you want to return the values (map) or not (foreach):
import org.apache.spark.sql.functions.col
partList = List("2021-10-01", "2021-10-02", "2021-10-03", "2021-10-04", "2021-10-05", "2021-10-06", "2021-10-07", "2021-10-08")
partList.foreach { case date =>
fctExistingDF = ss.read.table(existingTable).filter(col("event_date") === date)
}
If you want to return list of dataframes, use:
val dfs = partList.map { case date =>
fctExistingDF = ss.read.table(existingTable).filter(col("event_date") === date)

Gatling & Scala : How to split values in loop?

I want to split some values in loop. I used split method in check and it works for me. But, there are more than 25 values of two different types.
So, I am implementing loop in scala and struggling.
Consider the following scenario:
import scala.concurrent.duration._
import io.gatling.core.Predef._
import io.gatling.http.Predef._
class testSimulation extends Simulation {
val httpProtocol = http
.baseURL("https://website.com")
.doNotTrackHeader("1")
.disableCaching
val uri1 = "https://website.com"
val scn = scenario("EditAttribute")
.exec(http("LogIn")
.post(uri1 + "/web/guest/")
.headers(headers_0)
.exec(http("getPopupData")
.post("/website/getPopupData")
.check(jsonPath("$.data[0].pid").transform(_.split('#').toSeq).saveAs("pID"))) // Saving splited value
.exec(http("Listing")
.post("/website/listing")
.check(jsonPath("$.data[*].AdId").findAll.saveAs("aID")) // All values are collected in vector
// .check(jsonPath("$.data[*].AdId").transform(_.split('#').toSeq).saveAs("aID")) // Split method Not working for batch
// .check(jsonPath("$.data[*].AdId").findAll.saveAs("aID")) // To verify the length of array (vector)
.check(jsonPath("$.data[0].RcId").findAll.saveAs("rID")))
.exec(http("UpdatedDataListing")
.post("/website/search")
.formParam("entityTypeId", "${pId(0)}") // passing splited value, perfectly done
.formParam("action_id", "${aId(0)},${aId(1)},${aId(2)},..and so on) // need to pass splitted values which is not happening
.formParam("userId", "${rID}")
// To verify values on console (What value I m getting after splitting)...
.exec( session => {
val abc = session("pID").as[Seq[String]]
val xyz = session("aID").as[Seq[String]]
println("Separated pId ===> " +abc(0)) // output - first splitted value
println("Separated pId ===> " +abc(1)) // split separater
println("Separated pId ===> " +abc(2)) // second splitted value
println("Length ===> " +abc.length) // output - 3
println("Length ===> " +xyz.length) // output - 25
session
}
)
.exec(http("logOut")
.get("https://" + uri1 + "/logout")
.headers(headers_0))
setUp(scn.inject(atOnceUsers(1))).protocols(httpProtocol)
}
I want to implement a loop which performs splitting of all (25) values in session. I do not want to do hard coding.
I am newbie to scala and Gatling as well.
Since it is a session function the below snippet will give a direction to continue ,use split just like you do in Java :-
exec { session =>
var requestIdValue = new scala.util.Random().nextInt(Integer.MAX_VALUE).toString();
var length = jobsQue.length
try {
var reportElement = jobsQue.pop()
jobData = reportElement.getData;
xml = Configuration.XML.replaceAll("requestIdValue", requestIdValue);
println(s"For Request Id : $requestIdValue .Data Value from feeder is : $jobData Current size of jobsQue : $length");
} catch {
case e: NoSuchElementException => print("Erorr")
}
session.setAll(
"xmlRequest" -> xml)
}

Scala : How to use variable in for loop outside loop block

How can I create Dataframe with all my json files, when after reading each file I need to add fileName as field in dataframe? it seems Variable in for loop is not recognized outside loop. How to overcome this issue?
for (jsonfilenames <- fileArray) {
var df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
// trying to create temp table from dataframe created in loop
tblLanding.registerTempTable("LandingTable") // ERROR here, can't resolved tblLanding
Thank in advance
Hossain
I think you are new to programming itself.
Anyways here you go.
Basically you specify the type and initialise it before loop.
var df:DataFrame = null
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
df.registerTempTable("LandingTable") // Getting ERROR here
Update
Ok you are completely new to programming, even loops.
Suppose fileArray is having values as [1.json, 2.json, 3.json, 4.json]
So, the loop actually created 4 dataframe, by reading 4 json files.
Which one you want to register as temp table.
If all of them,
var df:DataFrame = null
var count = 0
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
df.registerTempTable(s"LandingTable_$count")
count++;
}
And reason for df being empty before this update is, your fileArray is empty or Spark failed to read that file. Print it and check.
To query any of those registered LandingTable
val df2 = hiveContext.sql("SELECT * FROM LandingTable_0")
Update
Question has changed to making a single dataFrame from all the json files.
var dataFrame:DataFrame = null
for (jsonfilename <- fileArray) {
val eachDataFrame = hivecontext.read.json(jsonfilename)
if(dataFrame == null)
dataFrame = eachDataFrame
else
dataFrame = eachDataFrame.unionAll(dataFrame)
}
dataFrame.registerTempTable("LandingTable")
Insure, that fileArray is not empty and all json files in fileArray are having same schema.
// Create list of dataframes with source-file-names
val dfList = fileArray.map{ filename =>
hivecontext.read.json(filename)
.withColumn("source_file_name", lit(filename))
}
// union the dataframes (assuming all are same schema)
val df = dfList.reduce(_ unionAll _) // or use union if spark 2.x
// register as table
df.registerTempTable("LandingTable")

How to count new element from stream by using spark-streaming

I have done implementation of daily compute. Here is some pseudo-code.
"newUser" may called first activated user.
// Get today log from hbase or somewhere else
val log = getRddFromHbase(todayDate)
// Compute active user
val activeUser = log.map(line => ((line.uid, line.appId), line).reduceByKey(distinctStrategyMethod)
// Get history user from hdfs
val historyUser = loadFromHdfs(path + yesterdayDate)
// Compute new user from active user and historyUser
val newUser = activeUser.subtractByKey(historyUser)
// Get new history user
val newHistoryUser = historyUser.union(newUser)
// Save today history user
saveToHdfs(path + todayDate)
Computation of "activeUser" can be converted to spark-streaming easily. Here is some code:
val transformedLog = sdkLogDs.map(sdkLog => {
val time = System.currentTimeMillis()
val timeToday = ((time - (time + 3600000 * 8) % 86400000) / 1000).toInt
((sdkLog.appid, sdkLog.bcode, sdkLog.uid), (sdkLog.channel_no, sdkLog.ctime.toInt, timeToday))
})
val activeUser = transformedLog.groupByKeyAndWindow(Seconds(86400), Seconds(60)).mapValues(x => {
var firstLine = x.head
x.foreach(line => {
if (line._2 < firstLine._2) firstLine = line
})
firstLine
})
But the approach of "newUser" and "historyUser" is confusing me.
I think my question can be summarized as "how to count new element from stream". As my pseudo-code above, "newUser" is part of "activeUser". And I must maintain a set of "historyUser" to know which part is "newUser".
I consider an approach, but I think it may not work right way:
Load the history user as a RDD. Foreach DStream of "activeUser" and find the elements doesn't exist in the "historyUser". A problem here is when should I update this RDD of "historyUser" to make sure I can get the right "newUser" of a window.
Update the "historyUser" RDD means add "newUser" to it. Just like what I did in the pseudo-code above. The "historyUser" is updated once a day in that code. Another problem is how to do this update RDD operation from a DStream. I think update "historyUser" when window slides is proper. But I haven't find a proper API to do this.
So which is the best practice to solve this problem.
updateStateByKey would help here as it allows you to set initial state (your historical users) and then update it on each interval of your main stream. I put some code together to explain the concept
val historyUsers = loadFromHdfs(path + yesterdayDate).map(UserData(...))
case class UserStatusState(isNew: Boolean, values: UserData)
// this will prepare the RDD of already known historical users
// to pass into updateStateByKey as initial state
val initialStateRDD = historyUsers.map(user => UserStatusState(false, user))
// stateful stream
val trackUsers = sdkLogDs.updateStateByKey(updateState, new HashPartitioner(sdkLogDs.ssc.sparkContext.defaultParallelism), true, initialStateRDD)
// only new users
val newUsersStream = trackUsers.filter(_._2.isNew)
def updateState(newValues: Seq[UserData], prevState: Option[UserStatusState]): Option[UserStatusState] = {
// Group all values for specific user as needed
val groupedUserData: UserData = newValues.reduce(...)
// prevState is defined only for users previously seen in the stream
// or loaded as initial state from historyUsers RDD
// For new users it is None
val isNewUser = !prevState.isDefined
// as you return state here for the user - prevState won't be None on next iterations
Some(UserStatusState(isNewUser, groupedUserData))
}