I have a pyspark RDD (myRDD) that is a variable-length list of IDs, such as
[['a', 'b', 'c'], ['d','f'], ['g', 'h', 'i','j']]
I have a pyspark dataframe (myDF) with columns ID and value.
I want to query myDF with the query:
outputDF = myDF.select(F.collect_set("value")).alias("my_values").where(col("ID").isin(id_list))
where id_list is an element from the myRDD, such as ['d','f'] or ['a', 'b', 'c'].
An example would be:
outputDF = myDF.select(F.collect_set("value")).alias("my_values").where(col("ID").isin(['d','f']))
What is a parallelizable way to use the RDD to query the DF like this?
Considering your dataframe column "ID" is of type stringType(), you want to keep ID values that appear in any of your RDD's row.
First, let's transform our RDD into a one column dataframe with with a unique ID for each row:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
ID_df = hc.createDataFrame(
myRDD.map(lambda row: [row]),
['ID']
).withColumn("uniqueID", psf.monotonically_increasing_id())
We'll explose it so that each row only has one ID value:
import pyspark.sql.functions as psf
ID_df = ID_df.withColumn('ID', psf.explode(ID_df.ID))
We can now join out original dataframe, the inner join will serve as a filter:
myDF = myDF.join(ID_df, "ID", "inner)
A collect_set is an aggregation function, so you need so kind of groupBy before using it, for instance by the newly created row ID:
myDF.groupBy("uniqueID").agg(
psf.collect_set("ID").alias("ID")
)
Related
How to create a column with json structure based on other columns of a pyspark dataframe.
For example, I want to achieve the below in pyspark dataframe. I am able to do this on pandas dataframe as below, but how do I do the same on pyspark dataframe
df = {'Address': ['abc', 'dvf', 'bgh'], 'zip': [34567, 12345, 78905], 'state':['VA', 'TN', 'MA']}
df = pd.DataFrame(df, columns = ['Address', 'zip', 'state'])
lst = ['Address', 'zip']
df['new_col'] = df[lst].apply(lambda x: x.to_json(), axis = 1)
Expected output
Assuming your pyspark dataframe is named df, use the struct function to construct a struct, and then use the to_json function to convert it to a json string.
import pyspark.sql.functions as F
....
lst = ['Address', 'zip']
df = df.withColumn('new_col', F.to_json(F.struct(*[F.col(c) for c in lst])))
df.show(truncate=False)
I have two list, one is col_name=['col1','col2','col3'] and the other is col_value=['val1', 'val2', 'val3']. I am trying to create a dataframe from the two list with col_name being the column.I need the output with 3 columns and 1 row (with 1 header as a col_name).
Finding difficult to get the solution for this. Pls help.
Construct data directly, and use the createDataFrame method to create a dataframe.
col_name = ['col1', 'col2', 'col3']
col_value = ['val1', 'val2', 'val3']
data = [col_value]
df = spark.createDataFrame(data, col_name)
df.printSchema()
df.show(truncate=False)
I have a big dataframe (~30M rows). I have a function f. The business of f is to run through each row, check some logics and feed the outputs into a dictionary. The function needs to be performed row by row.
I tried:
dic = dict() for row in df.rdd.collect(): f(row, dic)
But I always meet the error OOM. I set the memory of Docker to 8GB.
How can I effectively perform the business?
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import StringType, MapType
#sample data
df = sc.parallelize([
['a', 'b'],
['c', 'd'],
['e', 'f']
]).toDF(('col1', 'col2'))
#add logic to create dictionary element using rows of the dataframe
def add_to_dict(l):
d = {}
d[l[0]] = l[1]
return d
add_to_dict_udf = udf(add_to_dict, MapType(StringType(), StringType()))
#struct is used to pass rows of dataframe
df = df.withColumn("dictionary_item", add_to_dict_udf(struct([df[x] for x in df.columns])))
df.show()
#list of dictionary elements
dictionary_list = [i[0] for i in df.select('dictionary_item').collect()]
print dictionary_list
Output is:
[{u'a': u'b'}, {u'c': u'd'}, {u'e': u'f'}]
By using collect you pull all the data out of the Spark Executors into your Driver. You really should avoid this, as it makes using Spark pointless (you could just use plain python in that case).
What could you do:
reimplement your logic using functions already available: pyspark.sql.functions doc
if you cannot do the first, because there is functionality missing, you can define a User Defined Function
I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)
I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:
Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[]
2, ["feat1_2"],["feat2_1","feat2_2"]
3,["feat1_4"],["feat2_3"]
I want to get the list of distinct elements inside each feature column, so the output will be:
distinct_feat1,distinct_feat2
-----------------------------
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]
what is the best way to do this in Scala?
You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:
import org.apache.spark.sql.functions._
val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
withColumn("feat2", explode(col("feat2"))).
agg(collect_set("feat1").alias("distinct_feat1"),
collect_set("feat2").alias("distinct_feat2"))
distinct_df.show
+--------------------+--------------------+
| distinct_feat1| distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+
distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
WrappedArray(, feat2_1, feat2_2, feat2_3)])
one more solution for spark 2.4+
.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))
beware, if one of columns is null, result will be null
The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:
def array_unique_values(df, fields):
from pyspark.sql.functions import col, collect_set, explode
from functools import reduce
data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])
And then:
data = array_unique_values(df, my_fields)
data.take(1)