iText(Sharp): tables with headers and subheaders - itext

I'm using iText (iTextSharp version 5.5.7) and I am creating a PdfPTable where the data in the rows is sorted. To give a specific example, say my data looks like this (including my headers - h1, h2, and h3):
+---+---+---+
|h1 |h2 |h3 |
+---+---+---+
| A | B | C |
+---+---+---+
| A | B | D |
+---+---+---+
| A | E | F |
+---+---+---+
| K | G | H |
+---+---+---+
| K | G | I |
+---+---+---+
| K | G | J |
+---+---+---+
I've got that working, and then I started setting the Rowspan property of PdfPCell so I can avoid printing repeated text. That's also working great, what I get is this:
+---+---+---+
|h1 |h2 |h3 |
+---+---+---+
| A | B | C |
| | +---+
| | | D |
| +---+---+
| | E | F |
+---+---+---+
| K | G | H |
| | +---+
| | | I |
| | +---+
| | | J |
+---+---+---+
The problem is, I hit page breaks, and what I see is this:
+---+---+---+
|h1 |h2 |h3 |
+---+---+---+
| A | B | C |
| | +---+
| | | D |
| +---+---+
| | E | F |
+---+---+---+
| K | G | H |
+---+---+---+
Page Break
+---+---+---+
|h1 |h2 |h3 |
+---+---+---+
| | | I |
| | +---+
| | | J |
+---+---+---+
What I want, is that when that second page starts, I want the spanned cells (in this case 'K' and 'G') to be re-printed so the user has some idea what's going on.
What I need is similar to a HeaderRow, but what I need the header row to be changes as the rows are emitted.
Any ideas on how to make this work?

You can define header (and footer) rows for PdfPTable, but that won't solve your problem as these header (or footer) rows repeat a complete row whereas you only want to repeat part of a row.
This doesn't mean your requirement can't be met. You can work around the problem by adding the content in a cell event instead of adding it directly to a cell.
For instance: you currently add content such as A, B, K and G like this:
PdfPCell cell = new PdfPCell(new Phrase("A"));
cell.setRowspan(3);
table.addCell(cell);
If this cell is split and distributed over multiple pages, the content "A" will only appear on the first page. There won't be any content on the subsequent pages.
You can solve this by adding an empty cell for which you define a cell event:
PdfPCell cell = new PdfPCell();
cell.setCellEvent(new MyCellEvent("A"));
cell.setRowspan(3);
table.addCell(cell);
You now have to write an implementation of the PdfPCellEvent interface and implement the cellLayout method in such a way that adds the content (in this case "A") using the coordinates passed to this method as a parameter.
For inspiration (and an idea on how to adapt my Java code to .NET), see Can I use an iTextSharp cell event to repeat data on the next page when a row is split?

Related

how to update a cell of a spark data frame

I have the following a dataFrame on which I'm trying to update a cell depending on some conditions (like sql update where..)
for example, let's say I have the following data Frame :
+-------+-------+
|datas |isExist|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | O |
| AA | O |
+-------+-------+
How could I update the values to X when datas=AA and isExist is O, here is the expected output :
+-------+-------+
|IPCOPE2|IPROPE2|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | X |
| AA | X |
+-------+-------+
I could do a filter, then union, but I think its not the best solution, I could also use the when, but in this case I had create a new line containing the same values except for the isExist column, in that example is an acceptable solution, but what if I have 20 column !!
You can create new column using withColumn (either putting original or updated value) and then drop isExist column.
I am not sure why you do not want to use when for it seems to be exactly what you need. The withColumn method, when used with an existing column name will simply replace the column by the new value:
df.withColumn("isExist",
when('datas === "AA" && 'isExist === "O", "X").otherwise('isExist))
.show()
+-----+-------+
|datas|isExist|
+-----+-------+
| AA| x|
| BB| x|
| CC| O|
| CC| O|
| DD| O|
| AA| x|
| AA| x|
| AA| X|
| AA| X|
+-----+-------+
Then you can use withColumnRenamed to change the names of your columns. (e.g. df.withColumnRenamed("datas", "IPCOPE2"))

Forward Fill New Row to Account for Missing Dates

I currently have a dataset grouped into hourly increments by a variable "aggregator". There are gaps in this hourly data and what i would ideally like to do is forward fill the rows with the prior row which maps to the variable in column x.
I've seen some solutions to similar problems using PANDAS but ideally i would like to understand how best to approach this with a pyspark UDF.
I'd initially thought about something like the following with PANDAS but also struggled to implement this to just fill ignoring the aggregator as a first pass:
df = df.set_index(keys=[df.timestamp]).resample('1H', fill_method='ffill')
But ideally i'd like to avoid using PANDAS.
In the example below i have two missing rows of hourly data (labeled as MISSING).
| timestamp | aggregator |
|----------------------|------------|
| 2018-12-27T09:00:00Z | A |
| 2018-12-27T10:00:00Z | A |
| MISSING | MISSING |
| 2018-12-27T12:00:00Z | A |
| 2018-12-27T13:00:00Z | A |
| 2018-12-27T09:00:00Z | B |
| 2018-12-27T10:00:00Z | B |
| 2018-12-27T11:00:00Z | B |
| MISSING | MISSING |
| 2018-12-27T13:00:00Z | B |
| 2018-12-27T14:00:00Z | B |
The expected output here would be the following:
| timestamp | aggregator |
|----------------------|------------|
| 2018-12-27T09:00:00Z | A |
| 2018-12-27T10:00:00Z | A |
| 2018-12-27T11:00:00Z | A |
| 2018-12-27T12:00:00Z | A |
| 2018-12-27T13:00:00Z | A |
| 2018-12-27T09:00:00Z | B |
| 2018-12-27T10:00:00Z | B |
| 2018-12-27T11:00:00Z | B |
| 2018-12-27T12:00:00Z | B |
| 2018-12-27T13:00:00Z | B |
| 2018-12-27T14:00:00Z | B |
Appreciate the help.
Thanks.
Here is the solution, to fill the missing hours. using windows, lag and udf. With little modification it can extend to days as well.
from pyspark.sql.window import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
from dateutil.relativedelta import relativedelta
def missing_hours(t1, t2):
return [t1 + relativedelta(hours=-x) for x in range(1, t1.hour-t2.hour)]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
df = spark.read.csv('dates.csv',header=True,inferSchema=True)
window = Window.partitionBy("aggregator").orderBy("timestamp")
df_mising = df.withColumn("prev_timestamp",lag(col("timestamp"),1, None).over(window))\
.filter(col("prev_timestamp").isNotNull())\
.withColumn("timestamp", explode(missing_hours_udf(col("timestamp"), col("prev_timestamp"))))\
.drop("prev_timestamp")
df.union(df_mising).orderBy("aggregator","timestamp").show()
which results
+-------------------+----------+
| timestamp|aggregator|
+-------------------+----------+
|2018-12-27 09:00:00| A|
|2018-12-27 10:00:00| A|
|2018-12-27 11:00:00| A|
|2018-12-27 12:00:00| A|
|2018-12-27 13:00:00| A|
|2018-12-27 09:00:00| B|
|2018-12-27 10:00:00| B|
|2018-12-27 11:00:00| B|
|2018-12-27 12:00:00| B|
|2018-12-27 13:00:00| B|
|2018-12-27 14:00:00| B|
+-------------------+----------+

Tallying in Scala DataFrame Array

I have 2 column spark Scala DataFrame. The first is of one variable, the second one is an array of letters. What I am trying to do is find a way to code a tally (without using a for loop) of the variables in an array.
For example, this is what I have (I am sorry its not that neat, this is my first stack post). You have 5 computers, each person is represented by a letter. I want to find a way to find out how many computers a person (A,B,C,D,E) has used.
+-----------------+--------------+
| id | [person] |
+-----------------+--------------+
| Computer 1 | [A,B,C,D] |
| Computer 2 | [A,B] |
| Computer 3 | [A,B,E] |
| Computer 4 | [A,C,D] |
| Computer 5 | [A,B,C,D,E] |
+-----------------+--------------+
What I would like to code up or asking if anyone has a solution would be something like this:
+---------+-----------+
| Person | [Count] |
+---------+-----------+
| A | 5 |
| B | 4 |
| C | 3 |
| D | 3 |
| E | 2 |
+---------+-----------+
Somehow count the people who are in arrays within the dataframe.
There's a function called explode which will expand the arrays into one row for each item:
| id | person
+-----------------+------------------------+
| Computer 1| A |
| Computer 1| B |
| Computer 1| C |
| Computer 1| D |
....
+---+----+----+----+----+
Then you can group by the person and count. Something like:
val df2 = df.select(explode($"person").as("person"))
val result = df2.groupBy($"person").count

Combine multiple columns to yield unique values

I'm trying to use Tableau (v10.1) to combine 5 separate columns and get a count of the distinct values for that combination. Some rows/columns are empty. For example:
+-------+-------+-------+-------+-------+
| Tag 1 | Tag 2 | Tag 3 | Tag 4 | Tag 5 |
+-------+-------+-------+-------+-------+
| A | B | C | D | E |
| B | D | E | - | - |
| - | - | - | - | - |
| E | A | - | - | - |
+-------+-------+-------+-------+-------+
I want to obtain the following in a Tableau worksheet:
+-----+-------+
| Tag | Count |
+-----+-------+
| E | 3 |
| A | 2 |
| B | 2 |
| D | 2 |
| C | 1 |
+-----+-------+
I would like to do this in Tableau (using calculated fields, etc.) and not change the original data source.
Click on the data source tab, select the five fields named Tag # and then use the pivot command to reshape the data without changing the original source

org mode spreadsheet formula for the number of lines in a cell

I am looking at a org-mode spreadsheet formula to get the number of non-empty lines in a cell. Example :
| col1 | col2 |
|------+------|
| a | 3 |
| b | |
| c | |
| | |
|------+------|
| a | 1 |
| | |
|------+------|
| a | 2 |
| b | |
| | |
|------+------|
I have "col1" as input, and would like to fill "col2" automatically (the values can be anything, not just a b c).
Note that what you call "cell" is actually a group of cells delimited by horizontal separators (hlines).
The following example uses calc's vlen function to get the size of the vector of cells on column 1, and rows between the previous (#-I) and next (#+I) hlines.
| col1 | col2 |
|------+------|
| a | 3 |
| b | |
| c | |
| | |
|------+------|
#+TBLFM: #2$2=vlen(#-I$1..#+I$1)
You have to apply this same formula for all row groups.