Pig - Creating groups of a given size

Pig - Creating groups of a given size - group-by

recs = load 'a.txt';
grp = group recs with each group having 5 records;
I need to do something like the above.
So if recs has 10 records, I want the groups to be created so that
each group has 5 records each.
How to do it?

A scalable solution could be to have a UDF accumulates records into a new bag and outputs the bag when it has 5 elements and empty bag (or null) if it does not have 5 elements yet.
One drawback is that the last group of each map might have less than 5 elements (so can try to pad with nulls or discard/regroup them all).
recs = load 'a.txt';
grp_5 = foreach recs generate GROUPER(*, 5) as group;
grp = filter grp by not IsEmpty(group);
Edit for comment:
A regular Bag attribute would be good as the UDF is normally instantiated at the beginning of the map then its exec() is called for each record. Think something like streaming all the records through it like a MAX function.

Related

groupBy Id and get multiple records for multiple columns in scala

I have a spark dataframe as below.
val df = Seq(("a",1,1400),("a",1,1250),("a",2,1200),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",2,500),("b",4,250),("b",4,200),("b",4,100),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
I am working in scala language to make use of this data frame and trying to get result as shown below.
val df = Seq(("a",1,1400),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",4,250),("b",4,200),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
Rules: Grouped by id, if min(hierarchy)==1 then I take the row with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. On the other hand, if min(hierarchy)==2 then I take two rows with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. And so on for all the id's in the data.
Thanks for the suggestions..

You may use window functions to generate the criteria which you will filter upon eg
val results = df.withColumn("minh",min("hierarchy").over(Window.partitionBy("id")))
.withColumn("rnk",rank().over(Window.partitionBy("id").orderBy(col("amount").desc())))
.withColumn(
"rn4",
when(col("hierarchy")>=4, row_number().over(
Window.partitionBy("id",when(col("hierarchy")>=4,1).otherwise(0)).orderBy(col("amount").desc())
) ).otherwise(5)
)
.filter("rnk <= minh or rn4 <=3")
.select("id","hierarchy","amount")
NB. More verbose filter .filter("(rnk <= minh or rn4 <=3) and (minh in (1,2))")
Above temporary columns generated by window functions to assist in the filtering criteria are
minh : used to determine the minimum hierarchy for a group id and subsequently select the top minh number of columns from the group .
rnk used to determine the rows with the highest amount in each group
rn4 used to determine the rows with the highest amount in each group with hierarchy >=4

Iterate through a dataframe and dynamically assign ID to records based on substring [Spark][Scala]

Currently I have an input file(millions of records) where all the records contain a 2 character Identifier. Multiple lines in this input file will be concatenated into only one record in the output file, and how this is determined is SOLELY based on the sequential order of the Identifier
For example, the records would begin as below
1A
1B
1C
2A
2B
2C
1A
1C
2B
2C
1A
1B
1C
1A marks the beginning of a new record, so the output file would have 3 records in this case. Everything between the "1A"s will be combined into one record
1A+1B+1C+2A+2B+2C
1A+1C+2B+2C
1A+1B+1C
The number of records between the "1A"s varies, so I have to iterate through and check the Identifier.
I am unsure how to approach this situation using scala/spark.
My strategy is to:
Load the Input file into the dataframe.
Create an Identifier column based on substring of record.
Create a new column, TempID and a variable, x that is set to 0
Iterate through the dataframe
if Identifier =1A, x = x+1
TempID= variable x
Then create a UDF to concat records with the same TempID.
To summarize my question:
How would I iterate through the dataframe, check the value of Identifier column, then assign a tempID(whose value increases by 1 if the value of identifier column is 1A)

This is dangerous. The issue is that spark is not guaranteed keep the same order among elements, especially since they might cross partition boundaries. So when you iterate over them you could get a different order back. This also has to happen entirely sequentially, so at that point why not just skip spark entirely and run it as regular scala code as a preproccessing step before getting to spark.
My recommendation would be to either look into writing a custom data inputformat/data source, or perhaps you could use "1A" as a record delimiter similar to this question.

First - usually "iterating" over a DataFrame (or Spark's other distributed collection abstractions like RDD and Dataset) is either wrong or impossible. The term simply does not apply. You should transform these collections using Spark's functions instead of trying to iterate over them.
You can achieve your goal (or - almost, details to follow) using Window Functions. The idea here would be to (1) add an "id" column to sort by, (2) use a Window function (based on that ordering) to count the number of previous instances of "1A", and then (3) using these "counts" as the "group id" that ties all records of each group together, and group by it:
import functions._
import spark.implicits._
// sample data:
val df = Seq("1A", "1B", "1C", "2A", "2B", "2C", "1A", "1C", "2B", "2C", "1A", "1B", "1C").toDF("val")
val result = df.withColumn("id", monotonically_increasing_id()) // add row ID
.withColumn("isDelimiter", when($"val" === "1A", 1).otherwise(0)) // add group "delimiter" indicator
.withColumn("groupId", sum("isDelimiter").over(Window.orderBy($"id"))) // add groupId using Window function
.groupBy($"groupId").agg(collect_list($"val") as "list") // NOTE: order of list might not be guaranteed!
.orderBy($"groupId").drop("groupId") // removing groupId
result.show(false)
// +------------------------+
// |list |
// +------------------------+
// |[1A, 1B, 1C, 2A, 2B, 2C]|
// |[1A, 1C, 2B, 2C] |
// |[1A, 1B, 1C] |
// +------------------------+
(if having the result as a list does not fit your needs, I'll leave it to you to transform this column to whatever you need)
The major caveat here is that collect_list does not necessarily guarantee preserving order - once you use groupBy, the order is potentially lost. So - the order within each resulting list might be wrong (the separation to groups, however, is necessarily correct). If that's important to you, it can be worked around by collecting a list of a column that also contains the "id" column and using it later to sort these lists.
EDIT: realizing this answer isn't complete without solving this caveat, and realizing it's not trivial - here's how you can solve it:
Define the following UDF:
val getSortedValues = udf { (input: mutable.Seq[Row]) => input
.map { case Row (id: Long, v: String) => (id, v) }
.sortBy(_._1)
.map(_._2)
}
Then, replace the row .groupBy($"groupId").agg(collect_list($"val") as "list") in the suggested solution above with these rows:
.groupBy($"groupId")
.agg(collect_list(struct($"id" as "_1", $"val" as "_2")) as "list")
.withColumn("list", getSortedValues($"list"))
This way we necessarily preserve the order (with the price of sorting these small lists).

Compare two dataframes to find substring in spark

I have three dataframes, dictionary,SourceDictionary and MappedDictionary. The dictionary andSourceDictionary have only one column, say words as String. The dictionary which has million records, is a subset of MappedDictionary (Around 10M records) and each record in MappedDictionary is substring of dictionary. So, I need to map the ditionary with SourceDictionary to MappedDictionary.
Example:
Records in ditionary : BananaFruit, AppleGreen
Records in SourceDictionary : Banana,grape,orange,lemon,Apple,...
Records to be mapped in MappedDictionary (Contains two columns) :
BananaFruit Banana
AppleGreen Apple
I planned to do like two for loops in java and make substring operation but the problem is 1 million * 10 million = 10 Trillion iterations
Also, I can't get correct way to iterate a dataframe like a for loop
Can someone give a solution for a way to make iteration in Dataframe and perform substring operations?
Sorry for my poor English, I am a non-native
Thanks for stackoverflow community members in advance :-)

Though you have million record in sourceDictionary because it has only one column broadcasting it to every node won't take up much memory and it will speed up the total performance.
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.catalyst.encoders.RowEncoder
//Assuming the schema names
val SourceDictionarySchema = StructType(StructField("value",StringType,false))
val dictionarySchema = StructType(StructField("value",StringType,false))
val MappedDictionary = StructType(StructField("value",StringType,false), StructField("key",StringType,false))
val sourceDictionaryBC = sc.broadcast(
sourceDictionary.map(row =>
row.getAs[String]("value")
).collect.toList
)
val MappedDictionaryN = dictionary.map(row =>
val value = row.getAs[String]("value")
val matchedKey = sourceDictionaryBC.value.find(value.contains)
Seq(value, matchedKey.orNull)
)(RowEncoder(MappedDictionary))
After this you have all the new mapped records. If you want to combine it with the existing MappedDictionary just do a simple union.
MappedDictionaryN.union(MappedDictionary)

how to get grouped query data from the resultset?

I want to get grouped data from a table in sqlite. For example, the table is like below:
Name Group Price
a 1 10
b 1 9
c 1 10
d 2 11
e 2 10
f 3 12
g 3 10
h 1 11
Now I want get all data grouped by the Group column, each group in one array, namely
array1 = {{a,1,10},{b,1,9},{c,1,10},{h,1,11}};
array2 = {{d,2,11},{e,2,10}};
array3 = {{f,3,12},{g,3,10}}.
Because i need these 2 dimension arrays to populate the grouped table view. the sql statement maybe NSString *sql = #"SELECT * FROM table GROUP BY Group"; But I wonder how to get the data from the resultset. I am using the FMDB.
Any help is appreciated.

Get the data from sql with a normal SELECT statement, ordered by group and name:
SELECT * FROM table ORDER BY group, name;
Then in code, build your arrays, switching to fill the next array when the group id changes.

Let me clear about GroupBy. You can group data but that time its require group function on other columns.
e.g. Table has list of students in which there are gender group mean Male & Female group so we can group this table by Gender which will return two set . Now we need to perform some operation on result column.
e.g. Maximum marks or Average marks of each group
In your case you want to group but what kind of operation you require on price column ?.
e.g. below query will return group with max price.
SELECT Group,MAX(Price) AS MaxPriceByEachGroup FROM TABLE GROUP BY(group)

T-SQL Query to process data in batches without breaking groups

I am using SQL 2008 and trying to process the data I have in a table in batches, however, there is a catch. The data is broken into groups and, as I do my processing, I have to make sure that a group will always be contained within a batch or, in other words, that the group will never be split across different batches. It's assumed that the batch size will always be much larger than the group size. Here is the setup to illustrate what I mean (the code is using Jeff Moden's data generation logic: http://www.sqlservercentral.com/articles/Data+Generation/87901)
DECLARE #NumberOfRows INT = 1000,
#StartValue INT = 1,
#EndValue INT = 500,
#Range INT
SET #Range = #EndValue - #StartValue + 1
IF OBJECT_ID('tempdb..#SomeTestTable','U') IS NOT NULL
DROP TABLE #SomeTestTable;
SELECT TOP (#NumberOfRows)
GroupID = ABS(CHECKSUM(NEWID())) % #Range + #StartValue
INTO #SomeTestTable
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
This will create a table with about 435 groups of records containing between 1 and 7 records in each. Now, let's say I want to process these records in batches of 100 records per batch. How can I make sure that my GroupID's don't get split between different batches? I am fine if each batch is not exactly 100 records, it could be a little more or a little less.
I appreciate any suggestions!

This will result in slightly smaller batches than 100 entries, it'll remove all groups that aren't entirely in the selection;
WITH cte AS (SELECT TOP 100 * FROM (
SELECT GroupID, ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY GroupID) r
FROM #SomeTestTable) a
ORDER BY GroupID, r DESC)
SELECT c1.GroupID FROM cte c1
JOIN cte c2
ON c1.GroupID = c2.GroupID
AND c2.r = 1
It'll select the groups with the lowest GroupID's, limited to 100 entries into a common table expression along with the row number, then it'll use the row number to throw away any groups that aren't entirely in the selection (row number 1 needs to be in the selection for the group to be, since the row number is ordered descending before cutting with TOP).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pig - Creating groups of a given size - group-by

recs = load 'a.txt'; grp = group recs with each group having 5 records; I need to do something like the above. So if recs has 10 records, I want the groups to be created so that each group has 5 records each. How to do it?

Related

groupBy Id and get multiple records for multiple columns in scala

Iterate through a dataframe and dynamically assign ID to records based on substring [Spark][Scala]

Compare two dataframes to find substring in spark

how to get grouped query data from the resultset?

T-SQL Query to process data in batches without breaking groups

Categories

Resources