Spark Column merging all list into 1 single list - pyspark

I want the below column to merge into a single list for n-gram calculation. I am not sure how can I merge all the lists in a column into a single one.
+--------------------+
| author|
+--------------------+
| [Justin, Lee]|
|[Chatbots, were, ...|
|[Our, hopes, were...|
|[And, why, wouldn...|
|[At, the, Mobile,...|
+--------------------+
(Edit)Some more info:
I would like this as a spark df column and all the words including the repeated ones in a single list. The data is kind of big so I want to try avoiding methods like collect

OP wants to aggregate all the arrays/lists into the top row.
values = [(['Justin','Lee'],),(['Chatbots','were'],),(['Our','hopes','were'],),
(['And','why','wouldn'],),(['At','the','Mobile'],)]
df = sqlContext.createDataFrame(values,['author',])
df.show()
+------------------+
| author|
+------------------+
| [Justin, Lee]|
| [Chatbots, were]|
|[Our, hopes, were]|
|[And, why, wouldn]|
| [At, the, Mobile]|
+------------------+
This step suffices.
from pyspark.sql import functions as F
df = df.groupby().agg(F.collect_list('author').alias('list_of_authors'))
df.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|list_of_authors |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray(Justin, Lee), WrappedArray(Chatbots, were), WrappedArray(Our, hopes, were), WrappedArray(And, why, wouldn), WrappedArray(At, the, Mobile)]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+

DataFrames, same as other distributed data structures, are not iterable and by only using dedicated higher order function and / or SQL methods can be accessed
Suppose your Dataframe is DF1 and Output is DF2
You need something like :
values = [(['Justin', 'Lee'],), (['Chatbots', 'were'],), (['Our', 'hopes', 'were'],),
(['And', 'why', 'wouldn'],), (['At', 'the', 'Mobile'],)]
df = spark.createDataFrame(values, ['author', ])
df.agg(F.collect_list('author').alias('author')).show(truncate=False)
Upvote if works

Related

pyspark join 2 columns if condition is met, and insert string into the result

I have a pyspark dataframe like this:
+-------+---------------+------------+
|s_field|s_check| t_filter|
+-------+---------------+------------+
| MANDT| true| !=E|
| WERKS| true|0010_0020_0021_00...|
+-------+---------------+------------+
And as a first step, I split t_filter based on _ with f.split(f.col("t_filter"), "_")
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_")).show(truncate=False)
+-------+---------------+------------+------------+------------+
|s_field|s_check| t_filter| t_filter_1|
+-------+---------------+------------+------------+------------+
| MANDT| true| 070_70| [!= E]|
| WERKS| true|0010_0020_0021_00...| [0010, 0020, 0021, 00...]
+-------+---------------+------------+------------+------------+
What I want to achieve is to create a new column, using s_field and t_filter as the input while doing a logic check for !=.
ultimate aim
+------------+------------+------------+
| t_filter_2|
+------------+------------+------------+
| MANDT != 'E'|
| WERKS in ('0010', '0020', ...)|
+------------+------------+------------+
I have tried using withColumn but I keep getting error on col must be Column.
I am also not sure what the proper approach should be in order to achieve this.
Note: there is a large amount of rows, like 10k. I understand that using a UDF would be quite expensive, so i'm interested to know if there are other ways that can be done.
You can achieve this using withColumn with conditional evaluation by using the when and otherwise function. Following your example the following logic would apply, if t_filter contains != concatenate s_field and t_filter, else first convert t_filter_1 array to a string with , as separator then concat with s_field along with literals for in and ().
from pyspark.sql import functions as f
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
)
Output
+-------+-------+--------------------+-------------------------+---------------------------------------+
|s_check|s_field|t_filter |t_filter_1 |t_filter_2 |
+-------+-------+--------------------+-------------------------+---------------------------------------+
|true |MANDT |!='E' |[!='E'] |MANDT!='E' |
|true |WERKS |0010_0020_0021_00...|[0010, 0020, 0021, 00...]|WERKS in ('0010','0020','0021','00...')|
+-------+-------+--------------------+-------------------------+---------------------------------------+
Complete Working Example
from pyspark.sql import functions as f
filters_data = [
{"s_field": "MANDT", "s_check": True, "t_filter": "!='E'"},
{"s_field": "WERKS", "s_check": True, "t_filter": "0010_0020_0021_00..."},
]
filters = spark.createDataFrame(filters_data)
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_"))
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
).show(200, False)

spark merge datasets based on the same input of one column and concat the others

Currently I have several Dataset[UserRecord], and it looks like this
case class UserRecord(
Id: String,
ts: Timestamp,
detail: String
)
Let's call the several datasets datasets.
Previously I tried this
datasets.reduce(_ union _)
.groupBy("Id")
.agg(collect_list("ts", "detail"))
.as[(String, Seq[DetailRecord]]
but this code gives me an OOM error. I think the root cause is collect_list.
Now I'm thinking if I can do the groupBy and agg for each of the dataset first and then join them together to solve the OOM issue. Any other good advice is welcome too :)
I have an IndexedSeq of datasets look like this
|name| lists |
| x |[[1,2], [3,4]]|
|name| lists |
| y |[[5,6], [7,8]]|
|name| lists |
| x |[[9,10], [11,12]]|
How can I combine them to get a Dataset that looks like
|name| lists |
| x |[[1,2], [3,4],[9,10], [11,12]]|
| y |[[5,6], [7,8]] |
I tried ds.reduce(_ union _) but it didn't seem to work
You can aggregate after union:
val ds2 = ds.reduce(_ unionAll _).groupBy("name").agg(flatten(collect_list("lists")).as("lists"))

How to combine several Dataframes together in scala?

I have several dataframes which contains single column in them. Let's say I have 4 such dataframe all with one column. How can I form a single dataframe by combining all of them?
val df = xmldf.select(col("UserData.UserValue._valueRef"))
val df2 = xmldf.select(col("UserData.UserValue._title"))
val df3 = xmldf.select(col("author"))
val df4 = xmldf.select(col("price"))
To combine, I am trying this, but it doesn't work:
var newdf = df
newdf = newdf.withColumn("col1",df1.col("UserData.UserValue._title"))
newdf.show()
It errors out saying that field of one column are not present in another. I am not sure how can I combine these 4 dataframes together. They don't have any common column.
df2 looks like this:
+---------------+
| _title|
+---------------+
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
+---------------+
and df looks like this:
+-----------+
|_valuegiven|
+-----------+
| qwe|
| dfdfrt|
| dfdf|
+-----------+
df3 and df4 are also in same format. I want like below dataframe:
+-----------+---------------+
|_valuegiven| _title|
+-----------+---------------+
| qwe|_CONFIG_CONTEXT|
| dfdfrt|_CONFIG_CONTEXT|
| dfdf|_CONFIG_CONTEXT|
+-----------+---------------+
I used this:
val newdf = xmldf.select(col("UserData.UserValue._valuegiven"),col("UserData.UserValue._title") )
newdf.show()
But I am getting column name on the go and as such, I would need to append on the go, due to which I don't know exactly how many columns I will get. Which is why I cannot use the above command.
It's a little unclear of your goal. If asking to join these dataframes, but perhaps you just want to select those 4 columns.
val newdf = xmldf.select($"UserData.UserValue._valueRef", $"UserData.UserValue._title", 'author,'price")
newdf.show
If you really want to join all these dataframes, you'll need to join them all and select the appropriate fields.
If the goal is to get 4 columns from xmldf into a new dataframe you shouldn't be splitting it into 4 dataframes in the first place.
You can select multiple columns from a dataframe by providing additional column names in the select function.
val newdf = xmldf.select(
col("UserData.UserValue._valueRef"),
col("UserData.UserValue._title"),
col("author"),
col("price"))
newdf.show()
So I looked at various ways and finally Ram Ghadiyaram's answer in Solution 2 does what I wanted to do. Using this approach, you can combine any number of columns on the go. Basically, you need to create indexes by which you can join the dataframes together and after joining, drop the index column altogether.

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

Spark-scala: Select distinct arrays from a column dataframe ignoring ordering

I've been thinking the next problem but I haven't reach the solution: I have a dataframe df with only one column A, which elements have dataType Array[String]. I'm trying to get all the different arrays of A, non importing the order of the Strings in the arrays.
For example, if the dataframe is the following:
df.select("A").show()
+--------+
|A |
+--------+
|[a,b,c] |
|[d,e] |
|[f] |
|[e,d] |
|[c,a,b] |
+--------+
I would like to get the dataframe
+--------+
|[a,b,c] |
|[d,e] |
|[f] |
+--------+
I've trying make a distinct(), dropDuplicates() and other functions, but It doesnt't work.
I would appreciate any help. Thank you in advance.
You can use collect_list function to collect all the arrays in that column and then use udf function to sort the individual arrays and finally return the distinct arrays of the collected list. Finally you can use explode function to distribute the distinct collected arrays into separate rows
import org.apache.spark.sql.functions._
def distinctCollectUDF = udf((a: mutable.WrappedArray[mutable.WrappedArray[String]]) => a.map(array => array.sorted).distinct)
df.select(distinctCollectUDF(collect_list("A")).as("A")).withColumn("A", explode($"A")).show(false)
You should have your desired result.
You might try and use the contains method.