Clean dirty data

Clean dirty data - matlab

I have three variables (ID, Name and City) and need to generate a new variable flag.
There are something wrong with the observations. I need to find the wrong observations and create the flag. The variable flag indicates which column contains the wrong observation.
Suppose just one bad observation at most in each row.
Given dirty data!!!!!
|ID |Name |City
|1 |IBM |D
|1 |IBM |D
|2 |IBM |D
|3 |Google |F
|3 |Microsoft |F
|3 |Google |F
|8 |Microsoft |A
|8 |Microsoft |B
|8 |Microsoft |A
Result
|ID |Name |City |flag
|1 |IBM |D |0
|1 |IBM |D |0
|2 |IBM |D |1
|3 |Google |F |0
|3 |Microsoft |F |2
|3 |Google |F |0
|8 |Microsoft |A |0
|8 |Microsoft |B |3
|8 |Microsoft |A |0

Here is an answer in Stata that rests on many assumptions that you pointed out in the comments but not the initial quesiton:
clear all
input float ID str9 Name str1 City
1 "IBM" "D"
1 "IBM" "D"
2 "IBM" "D"
3 "Google" "F"
3 "Microsoft" "F"
3 "Google" "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end
// get dummy variable for
duplicates tag, gen(right)
gen flag = 0
encode Name, gen(Name_n)
encode City, gen(City_n)
qui sum
forvalues start = 1(3)`r(N)' {
local end = `start'+2
// check if ID is all same
qui sum ID in `start'/`end'
if `r(sd)' != 0 {
replace flag = 1 in `start'/`end' if right == 0
continue
}
// check if name is all same
qui sum Name_n in `start'/`end'
if `r(sd)' != 0 {
replace flag = 2 in `start'/`end' if right == 0
continue
}
// chech if city is all same
qui sum City_n in `start'/`end'
if `r(sd)' != 0 {
replace flag = 3 in `start'/`end' if right == 0
continue
}
}
drop right Name_n City_n
The intuition is that because they are grouped in 3s, two are always right, there is only one issue per group of 3, and they are sorted by ID which can be wrong but not greater than the next greatest right ID we can first check for duplicates, if there is a duplicate observation then that observation is right.
Next, (in the forvalues loop) we go through each group of three to see which of the variables has the wrong value, when we find it, we replace flag with the appropriate number.

This code is based on Eric's answer.
clear all
input float ID str9 Name str1 City
1 "IBM" "D"
1 "IBM" "D"
2 "IBM" "D"
3 "Google" "F"
3 "Microsoft" "F"
3 "Google" "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end
encode Name, gen(Name_n)
encode City, gen(City_n)
// get dummy variable for
duplicates tag ID Name, gen(col_12)
duplicates tag ID City, gen(col_13)
duplicates tag Name City, gen(col_23)
duplicates tag ID Name City, gen(col_123)
// generate the flag
gen flag = 0
replace flag = 1 if col_123 == 0 & col_23 ~= 0
replace flag = 2 if col_123 == 0 & col_13 ~= 0
replace flag = 3 if col_123 == 0 & col_12 ~= 0
drop Name_n City_n col_*

Related

Get last value of previous partition/group in pyspark

I have a dataframe looking like this (just some example values):
| id | timestamp | mode | trip | journey | value |
1 2021-09-12 23:59:19.717000 walking 1 1 1.21
1 2021-09-12 23:59:38.617000 walking 1 1 1.36
1 2021-09-12 23:59:38.617000 driving 2 1 1.65
2 2021-09-11 23:52:09.315000 walking 4 6 1.04
I want to create new columns which I fill with the previous and next mode. Something like this:
| id | timestamp | mode | trip | journey | value | prev | next
1 2021-09-12 23:59:19.717000 walking 1 1 1.21 bus driving
1 2021-09-12 23:59:38.617000 walking 1 1 1.36 bus driving
1 2021-09-12 23:59:38.617000 driving 2 1 1.65 walking walking
2 2021-09-11 23:52:09.315000 walking 4 6 1.0 walking driving
I have tried to partition by id, trip, journey and mode and ordered by timestamp. Then I tried to use lag() and lead() but I am not sure these work on other partitions. I came across the Window.unboundedPreceding and Window.unboundedFollowing, however I am not sure I completely understand how these work. In my mind I think that if I partition the data as explained above I will always just need the last value of mode from the previous partition and to fill the next I could reorder the partition from ascending to descending on the timestamp and then do the same to fill the next column. However, I am unsure how I get the last value of the previous partition.
I have tried this:
w = Window.partitionBy("id", "journey", "trip").orderBy(col("timestamp").asc())
w_prev = w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("prev", first("mode").over(w_prev))
Code examples and explainations using pyspark will be very appreciated!

So, based on what I could understand you could do something like this,
Create a partition based on ID and their journey, within each journey there are multiple trips, so order by trip and lastly the timestamp, and then simply use the lead and lag to get the output!
w = Window().partitionBy('id', 'journey').orderBy('trip', 'timestamp')
df.withColumn('prev', F.lag('mode', 1).over(w)) \
.withColumn('next', F.lead('mode', 1).over(w)) \
.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:19.717000|walking|1 |1 |1.21 |null |walking|
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |walking|driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
EDIT:
Okay as OP asked, you can do this to achieve it,
# Used for taking the latest record from same id, trip, journey
w = Window().partitionBy('id', 'trip', 'journey').orderBy(F.col('timestamp').desc())
# Used to calculate prev and next mode
w1 = Window().partitionBy('id', 'journey').orderBy('trip')
# First take only the latest rows for a particular combination of id, trip, journey
# Second, use the filtered rows to get prev and next modes
df2 = df.withColumn('rn', F.row_number().over(w)) \
.filter(F.col('rn') == 1) \
.withColumn('prev', F.lag('mode', 1).over(w1)) \
.withColumn('next', F.lead('mode', 1).over(w1)) \
.drop('rn')
df2.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |null |driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
# Finally, join the calculated DF with the original DF to get prev and next mode
final_df = df.alias('a').join(df2.alias('b'), ['id', 'trip', 'journey'], how='left') \
.select('a.*', 'b.prev', 'b.next')
final_df.show(truncate=False)
Output:
+---+----+-------+--------------------------+-------+-----+-------+-------+
|id |trip|journey|timestamp |mode |value|prev |next |
+---+----+-------+--------------------------+-------+-----+-------+-------+
|1 |1 |1 |2021-09-12 23:59:19.717000|walking|1.21 |null |driving|
|1 |1 |1 |2021-09-12 23:59:38.617000|walking|1.36 |null |driving|
|1 |2 |1 |2021-09-12 23:59:38.617000|driving|1.65 |walking|null |
|2 |4 |6 |2021-09-11 23:52:09.315000|walking|1.04 |null |null |
+---+----+-------+--------------------------+-------+-----+-------+-------+

How to get the two nearest values in spark scala DataFrame

Hi EveryOne I'm new in Spark scala. I want to find the nearest values by partition using spark scala. My input is something like this:
first row for example: value 1 is between 2 and 7 in the value2 columns
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |1 |
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |2 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
|3 |5 |7 |
|3 |5 |8 |
My output should like this:
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
Can someone guide me how to resolve this please.

Instead of providing a code answer as you appear to want to learn I've provided you pseudo code and references to allow you to find the answers for yourself.
Group the elements (select id, value1) (aggregate on value2
with collect_list) so you can collect all the value2 into an
array.
select (id, and (add(concat) value1 to the collect_list array)) Sorting the array .
find( array_position ) value1 in the array.
splice the array. retrieving value before and value after
the result of (array_position)
If the array is less than 3 elements do error handling
now the last value in the array and the first value in the array are your 'closest numbers'.

You will need window functions for this.
val window = Window
.partitionBy("id", "value1")
.orderBy(asc("value2"))
val result = df
.withColumn("prev", lag("value2").over(window))
.withColumn("next", lead("value2").over(window))
.withColumn("dist_prev", col("value2").minus(col("prev")))
.withColumn("dist_next", col("next").minus(col("value2")))
.withColumn("min", min(col("dist_prev")).over(window))
.filter(col("dist_prev") === col("min") || col("dist_next") === col("min"))
.drop("prev", "next", "dist_prev", "dist_next", "min")
I haven't tested it, so think about it more as an illustration of the concept than a working ready-to-use example.
Here is what's going on here:
First, create a window that describes your grouping rule: we want the rows grouped by the first two columns, and sorted by the third one within each group.
Next, add prev and next columns to the dataframe that contain the value of value2 column from previous and next row within the group respectively. (prev will be null for the first row in the group, and next will be null for the last row – that is ok).
Add dist_prev and dist_next to contain the distance between value2 and prev and next value respectively. (Note that dist_prev for each row will have the same value as dist_next for the previous row).
Find the minimum value for dist_prev within each group, and add it as min column (note, that the minimum value for dist_next is the same by construction, so we only need one column here).
Filter the rows, selecting those that have the minimum value in either dist_next or dist_prev. This finds the tightest pair unless there are multiple rows with the same distance from each other – this case was not accounted for in your question, so we don't know what kind of behavior you want in this case. This implementation will simply return all of these rows.
Finally, drop all extra columns that were added to the dataframe to return it to its original shape.

How to implement UniqueCount in Spark Scala

I am trying to implement uniqueCount in spark scala
Below is the transformation i am trying to implement :
case when ([last_revision]=1) and ([source]=""AR"") then UniqueCount([review_uuid]) OVER ([encounter_id]) end
Input
|last_revision|source|review_uuid |encounter_id|
|-------------|------|--------------|------------|
|1 |AR |123-1234-12345|7654 |
|1 |AR |123-7890-45678|7654 |
|1 |MR |789-1234-12345|7654 |
Expected Output
|last_revision|source|review_uuid |encounter_id|reviews_per_encounter|
|-------------|------|--------------|------------|---------------------|
|1 |AR |123-1234-12345|7654 |2 |
|1 |AR |123-7890-45678|7654 |2 |
|1 |MR |789-1234-12345|7654 |null |
My code :
.withColumn("reviews_per_encounter", when(col("last_revision") === "1" && col("source") === "AR", size(collect_set(col("review_uuid")).over(Window.partitionBy(col("encounter_id"))))))
My Output :
|last_revision|source|review_uuid |encounter_id|reviews_per_encounter|
|-------------|------|--------------|------------|---------------------|
|1 |AR |123-1234-12345|7654 |3 |
|1 |AR |123-7890-45678|7654 |3 |
|1 |MR |789-1234-12345|7654 |null |
Schema :
last_revision : integer
source : string
review_uuid : string
encounter_id : string
reviews_per_encounter : integer
In place of 2(expected) i am getting value 3, not sure what mistake i am doing here.
Please help. Thanks

The output makes perfect sense, as I commented, this is because this:
size(collect_set(col("review_uuid")))
Means:
give me the count of unique review_uuids in the whole dataframe (result: 3)
What you're looking for is:
give me the count of unique review_uuids only if the source in the corresponding row is equal to "AR" and "last_revision" is 1 (result: 2)
Notice the difference, now this doesn't need window functions and over actually. You can achieve this both using subqueries or self joining, here's how you can do it using self left join:
df.join(
df.where(col("last_revision") === lit(1) && col("source") === "AR")
.select(count_distinct(col("review_uuid")) as "reviews_per_encounter"),
col("last_revision") === lit(1) && col("source") === "AR",
"left"
)
Output:
+-------------+------+-----------+------------+---------------------+
|last_revision|source|review_uuid|encounter_id|reviews_per_encounter|
+-------------+------+-----------+------------+---------------------+
| 1| AR| 12345| 7654| 2|
| 1| AR| 45678| 7654| 2|
| 1| MR| 78945| 7654| null|
+-------------+------+-----------+------------+---------------------+
(I used some random uuid's, they were too long to copy :) )

Scala Pass window partition dataset to UDF

I have a dataframe like below,
Id1
Id2
Id3
TaskId
TaskName
index
1
11
bc123-234
dfr3ws-45d
randomName1
1
1
11
bc123-234
er98d3-lkj
randomName2
2
1
11
bc123-234
hu77d9-mnb
randomName3
3
1
11
bc123-234
xc33d5-rew
deployhere4
4
1
11
xre43-876
dfr3ws-45d
randomName1
1
1
11
xre43-876
er98d3-lkj
deployhere2
2
1
11
xre43-876
hu77d9-mnb
randomName3
3
1
11
xre43-876
xc33d5-rew
randomName4
4
I partitioned the data using Id3 and Id2 and added the row_number.
I need to perform the below condition:
TaskId "hu77d9-mnb" should come before the task name which contains deploy in it. As the table suggests above the name will be random I need to read each name in the partition and see which name contains deploy in it.
if deploy taskName index is greater than taskID index then I mark the value as 1 otherwise 0.
I need to get final table like this:
Id1
Id2
Id3
TaskId
TaskName
index
result
1
11
bc123-234
dfr3ws-45d
randomName1
1
1
1
11
bc123-234
er98d3-lkj
randomName2
2
1
1
11
bc123-234
hu77d9-mnb
randomName3
3
1
1
11
bc123-234
xc33d5-rew
deployhere4
4
1
1
11
xre43-876
dfr3ws-45d
randomName1
1
0
1
11
xre43-876
er98d3-lkj
deployhere2
2
0
1
11
xre43-876
hu77d9-mnb
randomName3
3
0
1
11
xre43-876
xc33d5-rew
randomName4
4
0
I am stuck at this place how can I pass the partition data to UDF (or other functions like UDAF) and perform this task. Any suggestion will be helpful. Thank you for your time.

Index of "deploy" row and index of specific row ("hu77d9-mnb") can be assigned to each row with Window "first" function, and then just compared:
val df = Seq(
(1, 11, "bc123-234", "dfr3ws-45d", "randomName1", 1),
(1, 11, "bc123-234", "er98d3-lkj", "randomName2", 2),
(1, 11, "bc123-234", "hu77d9-mnb", "randomName3", 3),
(1, 11, "bc123-234", "xc33d5-rew", "deployhere4", 4),
(1, 11, "xre43-876", "dfr3ws-45d", "randomName1", 1),
(1, 11, "xre43-876", "er98d3-lkj", "deployhere2", 2),
(1, 11, "xre43-876", "hu77d9-mnb", "randomName3", 3),
(1, 11, "xre43-876", "xc33d5-rew", "randomName4", 4)
).toDF("Id1", "Id2", "Id3", "TaskID", "TaskName", "index")
val specificTaskId = "hu77d9-mnb"
val idsWindow = Window.partitionBy("Id1", "Id2", "Id3")
df.withColumn("deployIndex",
first(
when(instr($"TaskName", "deploy") > 0, $"index").otherwise(null),
true)
.over(idsWindow))
.withColumn("specificTaskIdIndex",
first(
when($"TaskID" === lit(specificTaskId), $"index").otherwise(null),
true)
.over(idsWindow))
.withColumn("result",
when($"specificTaskIdIndex" > $"deployIndex", 0).otherwise(1)
)
Output ("deployIndex" and "specificTaskIdIndex" columns have to be dropped):
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+
|Id1|Id2|Id3 |TaskID |TaskName |index|deployIndex|specificTaskIdIndex|result|
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+
|1 |11 |bc123-234|dfr3ws-45d|randomName1|1 |4 |3 |1 |
|1 |11 |bc123-234|er98d3-lkj|randomName2|2 |4 |3 |1 |
|1 |11 |bc123-234|hu77d9-mnb|randomName3|3 |4 |3 |1 |
|1 |11 |bc123-234|xc33d5-rew|deployhere4|4 |4 |3 |1 |
|1 |11 |xre43-876|dfr3ws-45d|randomName1|1 |2 |3 |0 |
|1 |11 |xre43-876|er98d3-lkj|deployhere2|2 |2 |3 |0 |
|1 |11 |xre43-876|hu77d9-mnb|randomName3|3 |2 |3 |0 |
|1 |11 |xre43-876|xc33d5-rew|randomName4|4 |2 |3 |0 |
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+

Display %ROWCOUNT value in a select statement

How is the result of %ROWCOUNT displayed in the SQL statement.
Example
Select top 10 * from myTable.
I would like the results to have a rowCount for each row returned in the result set
Ex
+----------+--------+---------+
|rowNumber |Column1 |Column2 |
+----------+--------+---------+
|1 |A |B |
|2 |C |D |
+----------+--------+---------+

There are no any simple way to do it. You can add Sql Procedure with this functionality and use it in your SQL statements.
For example, class:
Class Sample.Utils Extends %RegisteredObject
{
ClassMethod RowNumber(Args...) As %Integer [ SqlProc, SqlName = "ROW_NUMBER" ]
{
quit $increment(%rownumber)
}
}
and then, you can use it in this way:
SELECT TOP 10 Sample.ROW_NUMBER(id) rowNumber, id,name,dob
FROM sample.person
ORDER BY ID desc
You will get something like below
+-----------+-------+-------------------+-----------+
|rowNumber |ID |Name |DOB |
+-----------+-------+-------------------+-----------+
|1 |200 |Quigley,Neil I. |12/25/1999 |
|2 |199 |Zevon,Imelda U. |04/22/1955 |
|3 |198 |O'Brien,Frances I. |12/03/1944 |
|4 |197 |Avery,Bart K. |08/20/1933 |
|5 |196 |Ingleman,Angelo F. |04/14/1958 |
|6 |195 |Quilty,Frances O. |09/12/2012 |
|7 |194 |Avery,Susan N. |05/09/1935 |
|8 |193 |Hanson,Violet L. |05/01/1973 |
|9 |192 |Zemaitis,Andrew H. |03/07/1924 |
|10 |191 |Presley,Liza N. |12/27/1978 |
+-----------+-------+-------------------+-----------+

If you are willing to rewrite your query then you can use a view counter to do what you are looking for. Here is a link to the docs.
The short version is you move your query into a FROM clause sub query and use the special field %vid.
SELECT v.%vid AS Row_Counter, Name
FROM (SELECT TOP 10 Name FROM Sample.Person ORDER BY Name) v
Row_Counter Name
1 Adam,Thelma P.
2 Adam,Usha J.
3 Adams,Milhouse A.
4 Allen,Xavier O.
5 Avery,James R.
6 Avery,Kyra G.
7 Bach,Ted J.
8 Bachman,Brian R.
9 Basile,Angelo T.
10 Basile,Chad L.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Clean dirty data - matlab

Related

Get last value of previous partition/group in pyspark

How to get the two nearest values in spark scala DataFrame

How to implement UniqueCount in Spark Scala

Scala Pass window partition dataset to UDF

Display %ROWCOUNT value in a select statement

Categories

Resources