I have a table that looks like the one below.
Table I have
I want an output classifying the PART as either CLASS A or B. CLASS A and CLASS B should be created as two additional columns.
The conditions for classifications are:
-> IF Arrangement, Prefix and Range are the same, it will be classified as CLASS A and CLASS B, and the PART should be in both columns.
->
If the above condition is false, it will check the value in the TYPE column.
If the type column contains CLASS A, then the respective PART should be in the CLASS A column.
If the type column contains CLASS B, then the separate PART should be in the CLASS B column.
Here is the sample of the output I would need.
Table I want
So far, I have managed to get the below output but have not been able to merge the results into one row when the three columns are the same.
Results I got till now
Sample Data
Arrangements PREFIX RANGE PART TYPE
ARR1 1XJ 1-100 191123 TRANSMISSION CLASS A
ARR1 1XJ 1-100 299123 TRANSMISSION CLASS B
ARR1 9TC 1-100 191123 TRANSMISSION CLASS A
ARR2 5TJ 101-120 288123 TRANSMISSION CLASS B
In the Query Editor (Transform Data), you can
Group by Arrangements, Prefix and Range and use the ALL operation
Then add Custom Columns with the formula below for CLASS A
=try Table.SelectRows([Count], each Text.Contains([TYPE],"CLASS A"))[PART]{0} otherwise null
Add another Custom Column for CLASS B with the obvious changes
Delete the Count column
Try this in M Advanced Editor
Table.AddColumn(previousStepName, "Text After Delimiter", each Text.AfterDelimiter([TYPE], " "), type text)
Related
I want to compare two tables. Table A consists of 2cr records and around 5p columns but needs to use only tran_name and table B consists of around 5k records and a single column.
Requirements are as follows:
If consider table B has a value as ABC in row 1 so now I have to filter all the records like ABC FROM TABLE A and this needs to be done for all the rows present in table B.
I tried using for loop in Pyspark for the same purpose but the time is taken and memory occupied was huge.
Now I plan to use map, lambda function
My code is as follows
def matches(A,i):
Rdd2=A.filter(A(col('tran_name'))).rlike(i)
return rdd2
matches_udf=udf(matches,StringType())
df=B.rdd.map(lambda x: x.matches_udf(A,x)).collect()
But it is showing an error
I'm looking for the most appropriate way to map the information contained in a DataFrame to some case classes I've defined, according to the following situation.
I have 2 Hive tables, and a third table which represents the many-to-many relationship between them. Lets call them "Item", "Group", and "GroupItems".
I'm considering executing a single query joining them all, to get the information of a single group, and all its items.
So, each row of the resulting DataFrame, would contain the fields of the Group, and the fields of an Item.
Then, I've created 4 different case classes to use this information in my application. Lets call them:
- ItemProps1: its properties match with some of the Item fields
- ItemProps2: its properties match with some of the Item fields
- Item: contains some properties which match with some of the Item fields, and has 1 object of type ItemProps1, and another of type ItemProps2
- Group: its properties match with the Group fields, and contains a list of items
What I want to do is to map the info contained in the resulting DataFrame into these case classes, but I don't know which would be the most appropriate way.
I know DataFrame has a method "as[U]" which is very useful to perform this kind of mapping, but I'm afraid in my case it wont be useful.
Then, I've found some options to perform the mapping manually, like the following ones:
df.map {
case Row(foo: Int, bar: String) => Record(foo, bar)
}
-
df.collect().foreach { row =>
val foo = row.getAs[Int]("foo")
val bar = row.getAs[String]("bar")
Record(foo, bar)
}
Is any of these approaches the most appropriate one, or should I do it in another way?
Thanks a lot!
Rephrasing of my questions:
I am writing a program that implements a data mining algorithm. In this program I want to save the input data which is supposed to be minded. Imagine the input data to be a table with rows and columns. Each row is going to be represented by an instance of my Scala class (the one in question). The columns of the input data can be of different type (Integer, Double, String, whatnot) and which type will change depending on the input data. I need a way to store a row inside my Scala class instance. Thus I need an ordered collection (like a special List) that can hold (many) different types as elements and it must be possible that the type is only determined at runtime. How can I do this? A Vector or a List require that all elements are supposed to be of the same type. A Tuple can hold different types (which can be determined at runtime if I am not mistaken), but only up to 22 elements which is too few.
Bonus (not sure if I am asking too much now):
I would also like to have the rows' columns to be named and excess-able by name. However, I thinkg this problem can easily be solved by using two lists. (Altough, I just read about this issue somewhere - but I forgot where - and think this was solved more elegantly.)
It might be good to have my collection to be random access (so "Vector" rather than "List").
Having linear algebra (matrix multiplication etc.) capabilities would be nice.
Even more bonus: If I could save matrices.
Old phrasing of my question:
I would like to have something like a data.frame as we know it from R in Scala, but I am only going to need one row. This row is going to be a member in a class. The reason for this construct is that I want methods related to each row to be close to the data itself. Each data row is also supposed to have meta data about itself and it will be possible to give functions so that different rows will be manipulated differently. However I need to save rows somehow within the class. A List or Vector comes to mind, but they only allow to be all Integer, String, etc. - but as we know from data.frame, different columns (here elements in Vector or List) can be of different type. I also would like to save the name of each column to be able to access the row values by column name. That seems the smallest issue though. I hope it is clear what I mean. How can I implement this?
DataFrames in R are heterogenous lists of homogeneous column vectors:
> df <- data.frame(c1=c(r1=1,r2=2), c2=c('a', 'b')); df
c1 c2
r1 1 a
r2 2 b
You could think of each row as a heterogeneous list of scalar values:
> as.list(df['r1',])
$c1
[1] 1
$c2
[1] a
An analogous implementation in scala would be a tuple of lists:
scala> val df = (List(1, 2), List('a', 'b'))
df: (List[Int], List[Char]) = (List(1, 2),List(a, b))
Each row could then just be a tuple:
scala> val r1 = (1, 'a')
r1: (Int, Char) = (1,a)
If you want to name all your variables, another possibility is a case class:
scala> case class Row (col1:Int, col2:Char)
defined class Row
scala> val r1 = Row(col1=1, col2='a')
r1: Row = Row(1,a)
Hope that helps bridge the R to scala divide.
In this Drools sheet I am comparing a class variable with a variable of another class variable but the rule converted are not as expected. Is there a way to do this...
One thing is creating a problem and it is when this excel sheet is converted into rules the condition where I check stdId in college class is equals to id of Student class i.e. third column, the rule is generated as follows-
$c2: College(stdId == $s.id == "x")
The =="x" part is undesirable and creating trouble while running the rules.
What should be done to remove the extra undesired part.
The third column can be written as
CONDITION
$c2: College(stdId==$s.id)/*$param*/
match student id
x
x
...
The x is required to trigger insertion of the conditional expression from row 2.
I have a table called "Tag" which consists of an Id, Name and Description column.
Now lets say I have the tables Character (C), Movie (M), Series (S) etc..
And I want to be able to tag entries in C, M, S with multiple tags and one tag may be used for multiple entries.
So I could realize it like this:
T -> TC <- C
T -> TM <- M
T -> TS <- S
Where TC, TM, TS are the intermediate tables.
I was wondering if I could combine TC, TM, TS into one table with a type column added and still use foreign keys.
As of yet I haven't found a way to do it.
Or is this something I shouldn't be doing?
As the comments above suggested you can't combine multiple table into a single one. If you want to have a single view of the "tag relationships" you can pull the needed information into a View. This way, you only need to write a longer query once and are able to use like a single table. Keep in mind that you can't insert data into a view (there are possibilities to do so, but they are a little advanced)