I am trying to convert an RDD of lists to a Dataframe in Spark.
RDD:
['ABC', 'AA', 'SSS', 'color-0-value', 'AAAAA_VVVV0-value_1', '1', 'WARNING', 'No test data for negative population! Re-using negative population for non-backtest.']
['ABC', 'SS', 'AA', 'color-0-SS', 'GG0-value_1', '1', 'Temp', 'After, date differences are outside tolerance (10 days) 95.1% of the time']
This is the content of the RDD, multiple lists.
How to convert this to a dataframe? Currently, it is converting it into a single column, but i need multiple columns.
Dataframe
+--------------+
| _1|
+--------------+
|['ABC', 'AA...|
|['ABC', 'SS...|
Just use Row.fromSeq:
import org.apache.spark.sql.Row
rdd.map(x => Row.fromSeq(x)).toDF
Related
In my case how to split a column contain StringType with a format '1-1235.0 2-1248.0 3-7895.2' to another column with ArrayType contains ['1','2','3']
this is relatively simple with UDF:
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")
val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))
df.withColumn("newCol", extractFirst($"input"))
.show()
gives
+--------------------+---------+
| input| newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+
I could not find an easy soluton with spark internals (other than using split in combination with explode etc and then re-aggregating)
You can split the string to an array using split function and then you can transform the array using Higher Order Function TRANSFORM (it is available since Sark 2.4) together with substring_index:
import org.apache.spark.sql.functions.{split, expr}
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")
df.withColumn("array", split($"stringCol", " "))
.withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))
Notice that this is native approach, no UDF applied.
Is there a way to select the entire row as a column to input into a Pyspark filter udf?
I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame:
my_filter_udf = udf(lambda r: my_filter(r), BooleanType())
new_df = df.filter(my_filter_udf(col("*"))
But
col("*")
throws an error because that's not a valid operation.
I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back into a dataframe. My DataFrame has complex nested types, so the schema inference fails when I try to convert the RDD into a dataframe again.
You should write all columns staticly. For example:
from pyspark.sql import functions as F
# create sample df
df = sc.parallelize([
(1, 'b'),
(1, 'c'),
]).toDF(["id", "category"])
#simple filter function
#F.udf(returnType=BooleanType())
def my_filter(col1, col2):
return (col1>0) & (col2=="b")
df.filter(my_filter('id', 'category')).show()
Results:
+---+--------+
| id|category|
+---+--------+
| 1| b|
+---+--------+
If you have so many columns and you are sure to order of columns:
cols = df.columns
df.filter(my_filter(*cols)).show()
Yields the same output.
I got this following RDD after kafka streaming. I want to convert it into dataframe without defining Schema.
[
{u'Enrolment_Date': u'2008-01-01', u'Freq': 78},
{u'Enrolment_Date': u'2008-02-01', u'Group': u'Recorded Data'},
{u'Freq': 70, u'Group': u'Recorded Data'},
{u'Enrolment_Date': u'2008-04-01', u'Freq': 96}
]
You can use an OrderedDict to convert a RDD containing key value pairs to a dataframe. However, in your case not all keys are present in each row, and therefore you need to populate these first with None values. See the solution below
#Define test data
data = [{u'Enrolment_Date': u'2008-01-01', u'Freq': 78}, {u'Enrolment_Date': u'2008-02-01', u'Group': u'Recorded Data'}, {u'Freq': 70, u'Group': u'Recorded Data'}, {u'Enrolment_Date': u'2008-04-01', u'Freq': 96}]
rdd = sc.parallelize(data)
from pyspark.sql import Row
from collections import OrderedDict
#Determine all the keys in the input data
schema = rdd.flatMap(lambda x: x.keys()).distinct().collect()
#Add missing keys with a None value
rdd_complete= rdd.map(lambda r:{x:r.get(x) for x in schema})
#Use an OrderedDict to convert your data to a dataframe
#This ensures data ends up in the right column
df = rdd_complete.map(lambda r: Row(**OrderedDict(sorted(r.items())))).toDF()
df.show()
This gives as output:
+--------------+----+-------------+
|Enrolment_Date|Freq| Group|
+--------------+----+-------------+
| 2008-01-01| 78| null|
| 2008-02-01|null|Recorded Data|
| null| 70|Recorded Data|
| 2008-04-01| 96| null|
+--------------+----+-------------+
I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:
Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[]
2, ["feat1_2"],["feat2_1","feat2_2"]
3,["feat1_4"],["feat2_3"]
I want to get the list of distinct elements inside each feature column, so the output will be:
distinct_feat1,distinct_feat2
-----------------------------
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]
what is the best way to do this in Scala?
You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:
import org.apache.spark.sql.functions._
val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
withColumn("feat2", explode(col("feat2"))).
agg(collect_set("feat1").alias("distinct_feat1"),
collect_set("feat2").alias("distinct_feat2"))
distinct_df.show
+--------------------+--------------------+
| distinct_feat1| distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+
distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
WrappedArray(, feat2_1, feat2_2, feat2_3)])
one more solution for spark 2.4+
.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))
beware, if one of columns is null, result will be null
The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:
def array_unique_values(df, fields):
from pyspark.sql.functions import col, collect_set, explode
from functools import reduce
data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])
And then:
data = array_unique_values(df, my_fields)
data.take(1)
Spark Newbie here attempting to use Spark to do some ETL and am having trouble finding a clean way of unifying the data into the destination scheme.
I have multiple dataframes with these keys / values in a spark context (streaming)
Dataframe of long values:
entry---------|long---------
----------------------------
alert_refresh |1446668689989
assigned_on |1446668689777
Dataframe of string values
entry---------|string-------
----------------------------
statusmsg |alert msg
url |http:/svr/pth
Dataframe of boolean values
entry---------|boolean------
----------------------------
led_1 |true
led_2 |true
Dataframe of integer values:
entry---------|int----------
----------------------------
id |789456123
I need to create a unified dataframe based on these where the key is the fieldName and it maintains the type from each source dataframe. It would look something like this:
id-------|led_1|led_2|statusmsg|url----------|alert_refresh|assigned_on
-----------------------------------------------------------------------
789456123|true |true |alert msg|http:/svr/pth|1446668689989|1446668689777
What is the most efficient way to do this in Spark?
BTW - I tried doing a matrix transform:
val seq_b= df_booleans.flatMap(row => (row.toSeq.map(col => (col, row.toSeq.indexOf(col)))))
.map(v => (v._2, v._1))
.groupByKey.sortByKey(true)
.map(._2)
val b_schema_names = seq_b.first.flatMap(r => Array(r))
val b_schema = StructType(b_schema_names.map(r => StructField(r.toString(), BooleanType, true)))
val b_data = seq_b.zipWithIndex.filter(._2==1).map(_._1).first()
val boolean_df = sparkContext.createDataFrame(b_data, b_schema)
Issue: Takes 12 seconds and .sortByKey(true) does not always sort values last