I have a Spark dataframe that looks something like this
x |count
1 |3
3 |5
4 |3
Below is my spark code:
sdf.createOrReplaceTempView('sdf_view')
spark.sql('SELECT MAX(count), x FROM sdf_view')
This seems like a perfect SQL query and I'm wondering why this doesn't work with Spark. What I want to find is the maximum count along with the x corresponding to it.
Any leads appreciated.
The error message is:
AnalysisException: u"grouping expressions sequence is empty, and 'sdf_view.`x`' is not an aggregate function. Wrap '(max(sdf_view.`count`) AS `max(count)`)' in windowing function(s) or wrap 'sdf_view.`x`' in first() (or first_value) if you don't care which value you get.
I added another row:
x = [{"x": 1, "count": 3}, {"x": 3, "count": 5}, {"x": 4, "count": 3}, {"x": 4, "count": 60}]
sdf = spark.createDataFrame(x)
+-----+---+
|count| x|
+-----+---+
| 3| 1|
| 5| 3|
| 3| 4|
| 60| 4|
+-----+---+
Your SQL statement is odd and you need to say how you want to group things. I'm guessing you want to group the X's and get the max of each of the unique X's? In other words, do you want a max count for each of the unique X's?
y = spark.sql('SELECT MAX(count), x FROM sdf_view GROUP BY x ')
y.show()
+----------+---+
|max(count)| x|
+----------+---+
| 3| 1|
| 5| 3|
| 60| 4|
+----------+---+
Or Do you want to just find the highest count of them all?
y = spark.sql('SELECT MAX(count) FROM sdf_view')
y.show()
+----------+
|max(count)|
+----------+
| 60|
+----------+
Related
I have two dataframes, one looks like this
+------------------------------------------------------------+
|docs |
+------------------------------------------------------------+
|{doc1.txt -> 1, doc2.txt -> 3, doc3.txt -> 5, doc4.txt -> 1}|
|{doc1.txt -> 2, doc2.txt -> 2, doc3.txt -> 4} |
|{doc1.txt -> 3, doc2.txt -> 2, doc4.txt -> 2} |
+------------------------------------------------------------+
and the other like
+--------------+----------+
| Document|doc_length|
+--------------+----------+
| doc1.txt| 0|
| doc2.txt| 0|
| doc3.txt| 0|
| doc3.txt| 0|
| doc4.txt| 0|
+-------------------------+
for sake of example the documents are in order, but in my use case I cannot expect them to be.
now I want to iterate through the first dataframe and update the values in the second as I go. I got a loop like this
df1.foreach(r =>
for (keyValPair <- r(0).asInstanceOf[Map[String, Long]]) {
// Something needs to happen here
} )
In every iteration I want to take take the key of the key-value-pair to select a specific row in the second dataframe and then add the value to the doc_length, so my final output of df2.show() would look like
EDIT: Later down the line I probably want to do other more complicated mathematical operations here then just summing all the values up, that's why I was trying to use the structure described above
+--------------+----------+
| Document|doc_length|
+--------------+----------+
| doc1.txt| 6|
| doc2.txt| 7|
| doc3.txt| 9|
| doc4.txt| 0|
+-------------------------+
This doesn't look like it should be too hard, but I don't know how I can access specific rows of a dataframe, by using a specific column as a key, and change them
You can explode the map column and group by key to sum up the lengths:
val df2 = df.select(explode(col("val")))
.groupBy(col("key").as("document"))
.agg(sum("value").as("doc_length"))
df2.show
+--------+----------+
|document|doc_length|
+--------+----------+
|doc1.txt| 6|
|doc4.txt| 3|
|doc3.txt| 9|
|doc2.txt| 7|
+--------+----------+
I have a Spark Dataframe where for each set of rows with a given column value (col1), I want to grab a sample of the values in (col2). The number of rows for each possible value of col1 may vary widely, so i'm just looking for a set number, say 10, of each type.
There may be a better way to do this, but the natural approach seemed to be a df.groupby('col1')
in pandas, I could do df.groupby('col1').col2.head()
i understand that spark dataframes are not pandas dataframes, but this is a good analogy.
i suppose i could loop over all of col1 types as a filter, but that seems terribly icky.
any thoughts on how to do this? thanks.
Let me create a sample Spark dataframe with two columns.
df = SparkSQLContext.createDataFrame([[1, 'r1'],
[1, 'r2'],
[1, 'r2'],
[2, 'r1'],
[3, 'r1'],
[3, 'r2'],
[4, 'r1'],
[5, 'r1'],
[5, 'r2'],
[5, 'r1']], schema=['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1| r1|
| 1| r2|
| 1| r2|
| 2| r1|
| 3| r1|
| 3| r2|
| 4| r1|
| 5| r1|
| 5| r2|
| 5| r1|
+----+----+
After grouping by col1, we get GroupedData object (instead of Spark Dataframe). You can use aggregate functions like min, max, average. But getting a head() is little bit tricky. We need to convert GroupedData object back to Spark Dataframe. This can be done Using pyspark collect_list() aggregation function.
from pyspark.sql import functions
df1 = df.groupBy(['col1']).agg(functions.collect_list("col2")).show(n=3)
Output is:
+----+------------------+
|col1|collect_list(col2)|
+----+------------------+
| 5| [r1, r2, r1]|
| 1| [r1, r2, r2]|
| 3| [r1, r2]|
+----+------------------+
only showing top 3 rows
I have a dataframe where i need to first apply dataframe and then get weighted average as shown in the output calculation below. What is an efficient way in pyspark to do that?
data = sc.parallelize([
[111,3,0.4],
[111,4,0.3],
[222,2,0.2],
[222,3,0.2],
[222,4,0.5]]
).toDF(['id', 'val','weight'])
data.show()
+---+---+------+
| id|val|weight|
+---+---+------+
|111| 3| 0.4|
|111| 4| 0.3|
|222| 2| 0.2|
|222| 3| 0.2|
|222| 4| 0.5|
+---+---+------+
Output:
id weigthed_val
111 (3*0.4 + 4*0.3)/(0.4 + 0.3)
222 (2*0.2 + 3*0.2+4*0.5)/(0.2+0.2+0.5)
You can multiply columns weight and val, then aggregate:
import pyspark.sql.functions as F
data.groupBy("id").agg((F.sum(data.val * data.weight)/F.sum(data.weight)).alias("weighted_val")).show()
+---+------------------+
| id| weighted_val|
+---+------------------+
|222|3.3333333333333335|
|111|3.4285714285714293|
+---+------------------+
Imagine that I have the following DataFrame df:
+---+-----------+------------+
| id|featureName|featureValue|
+---+-----------+------------+
|id1| a| 3|
|id1| b| 4|
|id2| a| 2|
|id2| c| 5|
|id3| d| 9|
+---+-----------+------------+
Imagine that I run:
df.groupBy("id")
.agg(collect_list($"featureIndex").as("idx"),
collect_list($"featureValue").as("val"))
Am I GUARANTEED that "idx" and "val" will be aggregated and keep their relative order? i.e.
GOOD GOOD BAD
+---+------+------+ +---+------+------+ +---+------+------+
| id| idx| val| | id| idx| val| | id| idx| val|
+---+------+------+ +---+------+------+ +---+------+------+
|id3| [d]| [9]| |id3| [d]| [9]| |id3| [d]| [9]|
|id1|[a, b]|[3, 4]| |id1|[b, a]|[4, 3]| |id1|[a, b]|[4, 3]|
|id2|[a, c]|[2, 5]| |id2|[c, a]|[5, 2]| |id2|[a, c]|[5, 2]|
+---+------+------+ +---+------+------+ +---+------+------+
NOTE: e.g. It's BAD because for id1 [a, b] should have been associated with [3, 4] (and not [4, 3]). Same for id2
I think you can rely on "their relative order" as Spark goes over rows one by one in order (and usually does not re-order rows if not explicitly needed).
If you are concerned with the order, merge these two columns using struct function before doing groupBy.
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
You could also use monotonically_increasing_id function to number records and use it to pair with the other columns (perhaps using struct):
monotonically_increasing_id(): Column A column expression that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
Say I have two PySpark DataFrames df1 and df2.
df1= 'a'
1
2
5
df2= 'b'
3
6
And I want to find the closest df2['b'] value for each df1['a'], and add the closest values as a new column in df1.
In other words, for each value x in df1['a'], I want to find a y that achieves min(abx(x-y)) for all y in df2['b'](note: can assume that there is only one y that can achieve the minimum distance), and the result would be
'a' 'b'
1 3
2 3
5 6
I tried the following code to create a distance matrix first (before finding the values achieving the minimum distance):
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def dict(x,y):
return abs(x-y)
udf_dict = udf(dict, IntegerType())
sql_sc = SQLContext(sc)
udf_dict(df1.a, df2.b)
which gives
Column<PythonUDF#dist(a,b)>
Then I tried
sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b))
which runs forever without giving error/output.
My questions are:
As I'm new to Spark, is my way to construct the output DataFrame efficient? (My way would be creating a distance matrix for all the a and b values first, and then find the min one)
What's wrong with the last line of my code and how to fix it?
Starting with your second question - you can apply udf only to existing dataframe, I think you were thinking for something like this:
>>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show()
+---+---+--------+
| a| b|distance|
+---+---+--------+
| 1| 3| 2|
| 1| 6| 5|
| 2| 3| 1|
| 2| 6| 4|
| 5| 3| 2|
| 5| 6| 1|
+---+---+--------+
But there is a more efficient way to apply this distance, by using internal abs:
>>> from pyspark.sql.functions import abs
>>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
Then you can find matching numbers by calculating:
>>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
>>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance'))
>>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show()
+---+---+
| a| b|
+---+---+
| 5| 6|
| 1| 3|
| 2| 3|
+---+---+