In my program I want to perform a groupBy operation on the dataframe using the common element in the list. For example the following dataframe:
visitorId |trackingIds |emailIds
+-----------+----------------+--------
[a158] |[666b,666b,777c]|[12]
[7g21] |[c0b5,c0b4] |[45, 87]
[p9098] |[666b] |[90]
[8u7t] |[c0b5] |[40]
Should be grouped by the column trackingIds, which is actually a List[String]
visitorId |trackingIds |emailIds
+------------------+----------------+------------
[a158, p9098] |[666b,666b,777c]|[12, 90]
[7g21, 8u7t] |[c0b5,c0b4] |[45, 87, 40]
I have a solution using a simple function that finds the element in other rows and merges accordingly but looking for a solution that would be cost effective, as the operation is to be performed on a big dataframe with millions of rows.
Related
I have the following code that cleans a corpus of documents (pipelineClean(corpus)) that returns a Dataframe with two columns:
"id": Long
"tokens": Array[String].
After that, the code produces a Dataframe with the following columns:
"term": String
"postingList": List[Array[Long, Long]] (the first long is the documented the other the term frequency inside that document)
pipelineClean(corpus)
.select($"id" as "documentId", explode($"tokens") as "term") // explode creates a new row for each element in the given array column
.groupBy("term", "documentId").count //group by and then count number of rows per group, returning a df with groupings and the counting
.where($"term" =!= "") // seems like there are some tokens that are empty, even though Tokenizer should remove them
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")
.groupBy("term").agg(collect_list($"posting") as "postingList") // we do another grouping in order to collect the postings into a list
.orderBy("term")
.persist(StorageLevel.MEMORY_ONLY_SER)
My question is: would it be possible to make this code shorter and/or more efficient? For example, is it possible to do the grouping within a single groupBy?
It doesn't look like you can do much more than what you've got apart from skipping the withColumn call and using a straight select:
.select(col("term"), struct(col("documentId"), col("count")) as "posting")
instead of
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")
I have a data frame with 900 columns I need the sum of each column in pyspark, so it will be 900 values in a list. Please let me know how to do this? Data has around 280 mil rows all binary data.
Assuming you already have the data in a Spark DataFrame, you can use the sum SQL function, together with DataFrame.agg.
For example:
sdf = spark.createDataFrame([[1, 3], [2, 4]], schema=['a','b'])
from pyspark.sql import functions as F
sdf.agg(F.sum(sdf.a), F.sum(sdf.b)).collect()
# Out: [Row(sum(a)=3, sum(b)=7)]
Since in your case you have quite a few columns, you can use a list comprehension to avoid naming columns explicitly.
sums = sdf.agg(*[F.sum(sdf[c_name]) for c_name in sdf.columns]).collect()
Notice how you need to unpack the arguments from the list using the * operator.
Currently I have an input file(millions of records) where all the records contain a 2 character Identifier. Multiple lines in this input file will be concatenated into only one record in the output file, and how this is determined is SOLELY based on the sequential order of the Identifier
For example, the records would begin as below
1A
1B
1C
2A
2B
2C
1A
1C
2B
2C
1A
1B
1C
1A marks the beginning of a new record, so the output file would have 3 records in this case. Everything between the "1A"s will be combined into one record
1A+1B+1C+2A+2B+2C
1A+1C+2B+2C
1A+1B+1C
The number of records between the "1A"s varies, so I have to iterate through and check the Identifier.
I am unsure how to approach this situation using scala/spark.
My strategy is to:
Load the Input file into the dataframe.
Create an Identifier column based on substring of record.
Create a new column, TempID and a variable, x that is set to 0
Iterate through the dataframe
if Identifier =1A, x = x+1
TempID= variable x
Then create a UDF to concat records with the same TempID.
To summarize my question:
How would I iterate through the dataframe, check the value of Identifier column, then assign a tempID(whose value increases by 1 if the value of identifier column is 1A)
This is dangerous. The issue is that spark is not guaranteed keep the same order among elements, especially since they might cross partition boundaries. So when you iterate over them you could get a different order back. This also has to happen entirely sequentially, so at that point why not just skip spark entirely and run it as regular scala code as a preproccessing step before getting to spark.
My recommendation would be to either look into writing a custom data inputformat/data source, or perhaps you could use "1A" as a record delimiter similar to this question.
First - usually "iterating" over a DataFrame (or Spark's other distributed collection abstractions like RDD and Dataset) is either wrong or impossible. The term simply does not apply. You should transform these collections using Spark's functions instead of trying to iterate over them.
You can achieve your goal (or - almost, details to follow) using Window Functions. The idea here would be to (1) add an "id" column to sort by, (2) use a Window function (based on that ordering) to count the number of previous instances of "1A", and then (3) using these "counts" as the "group id" that ties all records of each group together, and group by it:
import functions._
import spark.implicits._
// sample data:
val df = Seq("1A", "1B", "1C", "2A", "2B", "2C", "1A", "1C", "2B", "2C", "1A", "1B", "1C").toDF("val")
val result = df.withColumn("id", monotonically_increasing_id()) // add row ID
.withColumn("isDelimiter", when($"val" === "1A", 1).otherwise(0)) // add group "delimiter" indicator
.withColumn("groupId", sum("isDelimiter").over(Window.orderBy($"id"))) // add groupId using Window function
.groupBy($"groupId").agg(collect_list($"val") as "list") // NOTE: order of list might not be guaranteed!
.orderBy($"groupId").drop("groupId") // removing groupId
result.show(false)
// +------------------------+
// |list |
// +------------------------+
// |[1A, 1B, 1C, 2A, 2B, 2C]|
// |[1A, 1C, 2B, 2C] |
// |[1A, 1B, 1C] |
// +------------------------+
(if having the result as a list does not fit your needs, I'll leave it to you to transform this column to whatever you need)
The major caveat here is that collect_list does not necessarily guarantee preserving order - once you use groupBy, the order is potentially lost. So - the order within each resulting list might be wrong (the separation to groups, however, is necessarily correct). If that's important to you, it can be worked around by collecting a list of a column that also contains the "id" column and using it later to sort these lists.
EDIT: realizing this answer isn't complete without solving this caveat, and realizing it's not trivial - here's how you can solve it:
Define the following UDF:
val getSortedValues = udf { (input: mutable.Seq[Row]) => input
.map { case Row (id: Long, v: String) => (id, v) }
.sortBy(_._1)
.map(_._2)
}
Then, replace the row .groupBy($"groupId").agg(collect_list($"val") as "list") in the suggested solution above with these rows:
.groupBy($"groupId")
.agg(collect_list(struct($"id" as "_1", $"val" as "_2")) as "list")
.withColumn("list", getSortedValues($"list"))
This way we necessarily preserve the order (with the price of sorting these small lists).
I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark
I have a use case in which i have a set of data (Eg: A csv file containing around 10 million of rows and around 25 columns ).
and i have a set of rules(around 1000 rules) using that i need to update records, and these rules have to execute sequentially.
i wrote a code in which i am looping for every rule and for each rule i updating data.
suppose rule is like
col1=5 and col2=10 then col25=updatedValue
rulesList.foreach(rule=> {
var data = data.map(line(col1, col2, .., col25) => if(rule){
line(col1, col2, .., updatedValue)
} else {line(col1, col2, .., col25)})
})
these rules will execute sequential and finally a will get updated records.
But problem is that if rules and data is less that it is executing properly but if data is large than i gets StackOverflow Error, Reason may be because it is mapping for all rules and executing it last like map-reduce.
Is there any way using which i can update this data incremently.
Try mapping once over the RDD and loop over the rules inside the map a lot less data movement. All the rules will be applied locally at the data resulting in the updated record - instead of creating 1000 RDDs
Given a record in the RDD, if you can apply all updates incremently to it but independently of the other records, I would suggest you do the map first and then you iterate through the rulesList inside the map:
val result = data.map { case line(col1, col2, ..., col25) =>
var col25_mutable = col25
rulesList.foreach{ rule =>
col25_mutable = if(rule) updatedValue else col25_mutable
}
line(col1, col2, ..., col25_mutable)
}
This approach should be thread-safe if rulesList is a simple iterable object, such as Array or List.
I hope it works for you, or that it at least helps you achieve your goal.
Cheers