Display %ROWCOUNT value in a select statement - intersystems-cache

How is the result of %ROWCOUNT displayed in the SQL statement.
Example
Select top 10 * from myTable.
I would like the results to have a rowCount for each row returned in the result set
Ex
+----------+--------+---------+
|rowNumber |Column1 |Column2 |
+----------+--------+---------+
|1 |A |B |
|2 |C |D |
+----------+--------+---------+

There are no any simple way to do it. You can add Sql Procedure with this functionality and use it in your SQL statements.
For example, class:
Class Sample.Utils Extends %RegisteredObject
{
ClassMethod RowNumber(Args...) As %Integer [ SqlProc, SqlName = "ROW_NUMBER" ]
{
quit $increment(%rownumber)
}
}
and then, you can use it in this way:
SELECT TOP 10 Sample.ROW_NUMBER(id) rowNumber, id,name,dob
FROM sample.person
ORDER BY ID desc
You will get something like below
+-----------+-------+-------------------+-----------+
|rowNumber |ID |Name |DOB |
+-----------+-------+-------------------+-----------+
|1 |200 |Quigley,Neil I. |12/25/1999 |
|2 |199 |Zevon,Imelda U. |04/22/1955 |
|3 |198 |O'Brien,Frances I. |12/03/1944 |
|4 |197 |Avery,Bart K. |08/20/1933 |
|5 |196 |Ingleman,Angelo F. |04/14/1958 |
|6 |195 |Quilty,Frances O. |09/12/2012 |
|7 |194 |Avery,Susan N. |05/09/1935 |
|8 |193 |Hanson,Violet L. |05/01/1973 |
|9 |192 |Zemaitis,Andrew H. |03/07/1924 |
|10 |191 |Presley,Liza N. |12/27/1978 |
+-----------+-------+-------------------+-----------+

If you are willing to rewrite your query then you can use a view counter to do what you are looking for. Here is a link to the docs.
The short version is you move your query into a FROM clause sub query and use the special field %vid.
SELECT v.%vid AS Row_Counter, Name
FROM (SELECT TOP 10 Name FROM Sample.Person ORDER BY Name) v
Row_Counter Name
1 Adam,Thelma P.
2 Adam,Usha J.
3 Adams,Milhouse A.
4 Allen,Xavier O.
5 Avery,James R.
6 Avery,Kyra G.
7 Bach,Ted J.
8 Bachman,Brian R.
9 Basile,Angelo T.
10 Basile,Chad L.

Related

Scala/Spark; Add column to DataFrame that increments by 1 when a value is repeated in another column

I have a dataframe called rawEDI that looks something like this;
Line_number
Segment
1
ST
2
BPT
3
SE
4
ST
5
BPT
6
N1
7
SE
8
ST
9
PTD
10
SE
Each row represents a line in a file. Each line is called a segment and is denoted by something called a segment identifier; a short string. Segments are grouped together in chunks that start with an ST segment identifier and end with an SE segment segment identifier. There can be any number of ST chunks in a given file and the size of each any ST chunk is not fixed.
I want to create a new column on the dataframe that represents numerically what ST group a given segment belongs to. This will allow me to use groupBy to perform aggregate operations across all ST segments without having to loop over each individual ST segment, which is too slow.
The final DataFrame would look like this;
Line_number
Segment
ST_Group
1
ST
1
2
BPT
1
3
SE
1
4
ST
2
5
BPT
2
6
N1
2
7
SE
2
8
ST
3
9
PTD
3
10
SE
3
In short, I want to create and populate a DataFrame column with a number that increments by one whenever the value "ST" appears in the Segment column.
I am using spark 2.3.2 and scala 2.11.8
My initial thought was to use iteration. I collected another DataFrame, df, that contained the starting and ending line_number for each segment, looking like this;
Start
End
1
3
4
7
8
10
Then iterate over the rows of the dataframe and use them to populate the new column like this;
var st = 1
for (row <- df.collect()) {
val start = row(0)
val end = row(1)
var labelSTs = rawEDI.filter("line_number > = ${start}").filter("line_number <= ${end}").withColumn("ST_Group", lit(st))
st = st + 1
However, this yields an empty DataFrame. Additionally, the use of a for loop is time-prohibitive, taking over 20s on my machine for this. Achieving this result without the use of a loop would be huge, but a solution with a loop may also be acceptable if performant.
I have a hunch this can be accomplished using a udf or a Window, but I'm not certain how to attack that.
This
val func = udf((s:String) => if(s == "ST") 1 else 0)
var labelSTs = rawEDI.withColumn("ST_Group", func((col("segment")))
Only populates the column with 1 at each ST segment start.
And this
val w = Window.partitionBy("Segment").orderBy("line_number")
val labelSTs = rawEDI.withColumn("ST_Group", row_number().over(w)
Returns a nonsense dataframe.
One way is to create an intermediate dataframe of "groups" that would tell you on which line each group starts and ends (sort of what you've already done), and then join it to the original table using greater-than/less-than conditions.
Sample data
scala> val input = Seq((1,"ST"),(2,"BPT"),(3,"SE"),(4,"ST"),(5,"BPT"),
(6,"N1"),(7,"SE"),(8,"ST"),(9,"PTD"),(10,"SE"))
.toDF("linenumber","segment")
scala> input.show(false)
+----------+-------+
|linenumber|segment|
+----------+-------+
|1 |ST |
|2 |BPT |
|3 |SE |
|4 |ST |
|5 |BPT |
|6 |N1 |
|7 |SE |
|8 |ST |
|9 |PTD |
|10 |SE |
+----------+-------+
Create a dataframe for groups, using Window just as your hunch was telling you:
scala> val groups = input.where("segment='ST'")
.withColumn("endline",lead("linenumber",1) over Window.orderBy("linenumber"))
.withColumn("groupnumber",row_number() over Window.orderBy("linenumber"))
.withColumnRenamed("linenumber","startline")
.drop("segment")
scala> groups.show(false)
+---------+-----------+-------+
|startline|groupnumber|endline|
+---------+-----------+-------+
|1 |1 |4 |
|4 |2 |8 |
|8 |3 |null |
+---------+-----------+-------+
Join both to get the result
scala> input.join(groups,
input("linenumber") >= groups("startline") &&
(input("linenumber") < groups("endline") || groups("endline").isNull))
.select("linenumber","segment","groupnumber")
.show(false)
+----------+-------+-----------+
|linenumber|segment|groupnumber|
+----------+-------+-----------+
|1 |ST |1 |
|2 |BPT |1 |
|3 |SE |1 |
|4 |ST |2 |
|5 |BPT |2 |
|6 |N1 |2 |
|7 |SE |2 |
|8 |ST |3 |
|9 |PTD |3 |
|10 |SE |3 |
+----------+-------+-----------+
The only problem with this is Window.orderBy() on an unpartitioned dataframe, which would collect all data to a single partition and thus could be a killer.
if you want just to add column with a number that increments by one whenever the value "ST" appears in the Segment column, you can filter lines with the ST segment in a separate dataframe,
var labelSTs = rawEDI.filter("segement == 'ST'");
// then group by ST and collect to list the linenumbers
var groupedDf = labelSTs.groupBy("Segment").agg(collect_list("Line_number").alias("Line_numbers"))
// now you need to flat back the data frame and log the line number index
var flattedDf = groupedDf.select($"Segment", explode($"Line_numbers").as("Line_number"))
// log the line_number index in your target column ST_Group
val withIndexDF = flattenedDF.withColumn("ST_Group", row_number().over(Window.partitionBy($"Segment").orderBy($"Line_number")))
and you have this as result:
+-------+----------+----------------+
|Segment|Line_number|ST_Group |
+-------+----------+----------------+
| ST| 1| 1|
| ST| 4| 2|
| ST| 8| 3|
+-------|----------|----------------|
then you concat this with other Segement in the initial dataframe.
Found a more simpler way, add a column which will have 1 when the segment column value is ST, otherwise it will have 0. Then using Window function find the cummulative sum of that new column. This will give you the desired results.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val rawEDI=Seq((1,"ST"),(2,"BPT"),(3,"SE"),(4,"ST"),(5,"BPT"),(6,"N1"),(7,"SE"),(8,"ST"),(9,"PTD"),(10,"SE")).toDF("line_number","segment")
val newDf=rawEDI.withColumn("ST_Group", ($"segment" === "ST").cast("bigint"))
val windowSpec = Window.orderBy("line_number")
newDf.withColumn("ST_Group", sum("ST_Group").over(windowSpec))
.show
+-----------+-------+--------+
|line_number|segment|ST_Group|
+-----------+-------+--------+
| 1| ST| 1|
| 2| BPT| 1|
| 3| SE| 1|
| 4| ST| 2|
| 5| BPT| 2|
| 6| N1| 2|
| 7| SE| 2|
| 8| ST| 3|
| 9| PTD| 3|
| 10| SE| 3|
+-----------+-------+--------+

Get last value of previous partition/group in pyspark

I have a dataframe looking like this (just some example values):
| id | timestamp | mode | trip | journey | value |
1 2021-09-12 23:59:19.717000 walking 1 1 1.21
1 2021-09-12 23:59:38.617000 walking 1 1 1.36
1 2021-09-12 23:59:38.617000 driving 2 1 1.65
2 2021-09-11 23:52:09.315000 walking 4 6 1.04
I want to create new columns which I fill with the previous and next mode. Something like this:
| id | timestamp | mode | trip | journey | value | prev | next
1 2021-09-12 23:59:19.717000 walking 1 1 1.21 bus driving
1 2021-09-12 23:59:38.617000 walking 1 1 1.36 bus driving
1 2021-09-12 23:59:38.617000 driving 2 1 1.65 walking walking
2 2021-09-11 23:52:09.315000 walking 4 6 1.0 walking driving
I have tried to partition by id, trip, journey and mode and ordered by timestamp. Then I tried to use lag() and lead() but I am not sure these work on other partitions. I came across the Window.unboundedPreceding and Window.unboundedFollowing, however I am not sure I completely understand how these work. In my mind I think that if I partition the data as explained above I will always just need the last value of mode from the previous partition and to fill the next I could reorder the partition from ascending to descending on the timestamp and then do the same to fill the next column. However, I am unsure how I get the last value of the previous partition.
I have tried this:
w = Window.partitionBy("id", "journey", "trip").orderBy(col("timestamp").asc())
w_prev = w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("prev", first("mode").over(w_prev))
Code examples and explainations using pyspark will be very appreciated!
So, based on what I could understand you could do something like this,
Create a partition based on ID and their journey, within each journey there are multiple trips, so order by trip and lastly the timestamp, and then simply use the lead and lag to get the output!
w = Window().partitionBy('id', 'journey').orderBy('trip', 'timestamp')
df.withColumn('prev', F.lag('mode', 1).over(w)) \
.withColumn('next', F.lead('mode', 1).over(w)) \
.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:19.717000|walking|1 |1 |1.21 |null |walking|
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |walking|driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
EDIT:
Okay as OP asked, you can do this to achieve it,
# Used for taking the latest record from same id, trip, journey
w = Window().partitionBy('id', 'trip', 'journey').orderBy(F.col('timestamp').desc())
# Used to calculate prev and next mode
w1 = Window().partitionBy('id', 'journey').orderBy('trip')
# First take only the latest rows for a particular combination of id, trip, journey
# Second, use the filtered rows to get prev and next modes
df2 = df.withColumn('rn', F.row_number().over(w)) \
.filter(F.col('rn') == 1) \
.withColumn('prev', F.lag('mode', 1).over(w1)) \
.withColumn('next', F.lead('mode', 1).over(w1)) \
.drop('rn')
df2.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |null |driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
# Finally, join the calculated DF with the original DF to get prev and next mode
final_df = df.alias('a').join(df2.alias('b'), ['id', 'trip', 'journey'], how='left') \
.select('a.*', 'b.prev', 'b.next')
final_df.show(truncate=False)
Output:
+---+----+-------+--------------------------+-------+-----+-------+-------+
|id |trip|journey|timestamp |mode |value|prev |next |
+---+----+-------+--------------------------+-------+-----+-------+-------+
|1 |1 |1 |2021-09-12 23:59:19.717000|walking|1.21 |null |driving|
|1 |1 |1 |2021-09-12 23:59:38.617000|walking|1.36 |null |driving|
|1 |2 |1 |2021-09-12 23:59:38.617000|driving|1.65 |walking|null |
|2 |4 |6 |2021-09-11 23:52:09.315000|walking|1.04 |null |null |
+---+----+-------+--------------------------+-------+-----+-------+-------+

How to get the two nearest values in spark scala DataFrame

Hi EveryOne I'm new in Spark scala. I want to find the nearest values by partition using spark scala. My input is something like this:
first row for example: value 1 is between 2 and 7 in the value2 columns
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |1 |
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |2 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
|3 |5 |7 |
|3 |5 |8 |
My output should like this:
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
Can someone guide me how to resolve this please.
Instead of providing a code answer as you appear to want to learn I've provided you pseudo code and references to allow you to find the answers for yourself.
Group the elements (select id, value1) (aggregate on value2
with collect_list) so you can collect all the value2 into an
array.
select (id, and (add(concat) value1 to the collect_list array)) Sorting the array .
find( array_position ) value1 in the array.
splice the array. retrieving value before and value after
the result of (array_position)
If the array is less than 3 elements do error handling
now the last value in the array and the first value in the array are your 'closest numbers'.
You will need window functions for this.
val window = Window
.partitionBy("id", "value1")
.orderBy(asc("value2"))
val result = df
.withColumn("prev", lag("value2").over(window))
.withColumn("next", lead("value2").over(window))
.withColumn("dist_prev", col("value2").minus(col("prev")))
.withColumn("dist_next", col("next").minus(col("value2")))
.withColumn("min", min(col("dist_prev")).over(window))
.filter(col("dist_prev") === col("min") || col("dist_next") === col("min"))
.drop("prev", "next", "dist_prev", "dist_next", "min")
I haven't tested it, so think about it more as an illustration of the concept than a working ready-to-use example.
Here is what's going on here:
First, create a window that describes your grouping rule: we want the rows grouped by the first two columns, and sorted by the third one within each group.
Next, add prev and next columns to the dataframe that contain the value of value2 column from previous and next row within the group respectively. (prev will be null for the first row in the group, and next will be null for the last row – that is ok).
Add dist_prev and dist_next to contain the distance between value2 and prev and next value respectively. (Note that dist_prev for each row will have the same value as dist_next for the previous row).
Find the minimum value for dist_prev within each group, and add it as min column (note, that the minimum value for dist_next is the same by construction, so we only need one column here).
Filter the rows, selecting those that have the minimum value in either dist_next or dist_prev. This finds the tightest pair unless there are multiple rows with the same distance from each other – this case was not accounted for in your question, so we don't know what kind of behavior you want in this case. This implementation will simply return all of these rows.
Finally, drop all extra columns that were added to the dataframe to return it to its original shape.

PySpark sql CASE fails

i've encoutered strange behaviour when working with PySpark sqlContext. The problem is best ilustrated in the code below.
I am checking the value of COLUMN in simple case statement. However WHEN is not triggered even though the condition checks TRUE and always jumps to ELSE. Am I doing something wrong with the syntax here?
dataTest = spark.sql("""SELECT
COLUMN > 1,
CASE COLUMN
WHEN COLUMN > 1 THEN 1
ELSE COLUMN
END AS COLUMN_2,
COLUMN
FROM TABLE
""")
dataTest.sort(col("COLUMN").desc()).show(5, False)
+---------------+-------------+---------+
|COLUMN >1 |COLUMN_2 |COLUMN |
+---------------+-------------+---------+
|true |14 |14 |
|true |5 |5 |
|true |4 |4 |
|true |3 |3 |
|true |2 |2 |
+---------------+-------------+---------+
You are missing the syntax, try:
SELECT
COLUMN > 1,
CASE WHEN COLUMN > 1 THEN 1
ELSE COLUMN
END AS COLUMN_2,
COLUMN
FROM TABLE
Notice there's no COLUMN between CASE and WHEN keywords.

SQL Joining two table

I am struggling, maybe the simplest problem ever. My SQL knowledge pretty much limits me from achieving this. I am trying to build an sql query that should show JobTitle, Note and NoteType. Here is the thing, First job doesn't have any note but we should see it in the results. System notes never and ever should be displayed. An expected result should look like this
Result:
--------------------------------------------
|ID |Title |Note |NoteType |
--------------------------------------------
|1 |FirstJob |NULL |NULL |
|2 |SecondJob |CustomNot1|1 |
|2 |SecondJob |CustomNot2|1 |
|3 |ThirdJob |NULL |NULL |
--------------------------------------------
.
My query (doesn't work, doesn't display third job)
SELECT J.ID, J.Title, N.Note, N.NoteType
FROM JOB J
LEFT OUTER JOIN NOTE N ON N.JobId = J.ID
WHERE N.NoteType IS NULL OR N.NoteType = 1
My Tables:
My JOB Table
----------------------
|ID |Title |
----------------------
|1 |FirstJob |
|2 |SecondJob |
|3 |ThirdJob |
----------------------
My NOTE Table
--------------------------------------------
|ID |JobId |Note |NoteType |
--------------------------------------------
|1 |2 |CustomNot1|1 |
|2 |2 |CustomNot2|1 |
|3 |2 |SystemNot1|2 |
|4 |2 |SystemNot3|2 |
|5 |3 |SystemNot1|2 |
--------------------------------------------
This can't be true together (NoteType can't be NULL as well as 1 at the same time):
WHERE N.NoteType IS NULL AND N.NoteType = 1
You may want to use OR instead to check if NoteType is either NULL or 1.
WHERE N.NoteType IS NULL OR N.NoteType = 1
EDIT: With corrected query, your third job will not be retrieved as JOB_ID is matching but its the row getting filtered out because of the where condition.
Try below as work around to get the third job with null values.
SELECT J.ID, J.Title, N.Note, N.NoteType
FROM JOB J
LEFT OUTER JOIN
( SELECT JOBID NOTE, NOTETYPE FROM NOTE
WHERE N.NoteType IS NULL OR N.NoteType = 1) N
ON N.JobId = J.ID
just exclude the systemNotes and use a sub-select:
select * from job j
left outer join (
select * from note where notetype!=2
) n
on j.id=n.jobid;
if you include the joined table into where then left outer join might work as an inner join.