I have tied various Stackoverflow question but unable to get to the end goal of a array of distinct values from the DataFrame.
val df = Seq(
List("Mandrin","Hindi","English"),
List("French","English")
).toDF("languages")
df.collect.map(_.toSeq).flatten
Returns
Array[Any] = Array(WrappedArray(Mandrin, Hindi, English), WrappedArray(French, English))
Desired result is
Array(Mandrin, Hindi, English, French)
If I could get it to a flat array with duplicates, then I can call distinct.
thanks.
You don't need that additional map step, when you collect you already have a list of sequences of string. You just need to flatten them to get an Array of Strings.
val languagesArray:Array[String] = df.collect().flatten
However when working with huge sets of data it is not often the best idea to collect the data, maybe you can consider using explode
import org.apache.spark.sql.functions._
df.select(explode($"languages")).show()
this generates the following output
+-------+
| col|
+-------+
|Mandrin|
| Hindi|
|English|
| French|
|English|
+-------+
on either of the output you can then do a distinct to get the distinct languages.
Related
I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala.
The values from the column have different lengths and also the suffix is different.
For example, I have the following values:
09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0
I need to remove everything after the "_" so that I can get the following values:
09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0
As #werner pointed out in his comment, substring_index provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr.
Whereas #AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]
val df = List(
"09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
"0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
"0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
"0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
"22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")
import org.apache.spark.sql.functions.{col, substring_index}
df
.withColumn("col0", substring_index(col("col0"), "_", 1))
.show(false)
gives:
+-----------------------------------+
|col0 |
+-----------------------------------+
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D |
|22AEA8C8D403643111B781FE31B047E3-0 |
+-----------------------------------+
[1] Is there a performance penalty when composing spark UDFs
I am new to Apache spark. I want to find the unique product among the stores using scala spark.
Data in file is like below where 1st column in each row represents store name.
Sears,shoe,ring,pan,shirt,pen
Walmart,ring,pan,hat,meat,watch
Target,shoe,pan,shirt,hat,watch
I want the output to be
Only Walmart has Meat.
only Sears has Pen.
I tried the below in scala spark, able to get the unique products but don't know how to get the store name of those products. Please help.
val filerdd = sc.textFile("file:///home/hduser/stores_products")
val uniquerdd = filerdd.map(x=>x.split(",")).map(x=>Array(x(1),x(2),x(3),x(4),x(5))).flatMap(x=>x).map(x=>(x,1)).reduceByKey((a,b)=>a+b).filter(x=>x._2==1)
uniquerdd holds - Array((pen,1),(meat,1))
Now I want to find in which row of filerdd these products presents and should display the output as below
Only Walmart has Meat.
Only Sears has Pen.
can you please help me to get the desired output?
The dataframe API is probably easier than the RDD API to do this. You can explode the list of products and filter those with count = 1.
import org.apache.spark.sql.expressions.Window
df = spark.read.csv("filepath")
result = df.select(
$"_c0".as("store"),
explode(array(df.columns.tail.map(col):_*)).as("product")
).withColumn(
"count",
count("*").over(Window.partitionBy("product"))
).filter(
"count = 1"
).select(
format_string("Only %s has %s.", $"store", $"product").as("output")
)
result.show(false)
+----------------------+
|output |
+----------------------+
|Only Walmart has meat.|
|Only Sears has pen. |
+----------------------+
I want the below column to merge into a single list for n-gram calculation. I am not sure how can I merge all the lists in a column into a single one.
+--------------------+
| author|
+--------------------+
| [Justin, Lee]|
|[Chatbots, were, ...|
|[Our, hopes, were...|
|[And, why, wouldn...|
|[At, the, Mobile,...|
+--------------------+
(Edit)Some more info:
I would like this as a spark df column and all the words including the repeated ones in a single list. The data is kind of big so I want to try avoiding methods like collect
OP wants to aggregate all the arrays/lists into the top row.
values = [(['Justin','Lee'],),(['Chatbots','were'],),(['Our','hopes','were'],),
(['And','why','wouldn'],),(['At','the','Mobile'],)]
df = sqlContext.createDataFrame(values,['author',])
df.show()
+------------------+
| author|
+------------------+
| [Justin, Lee]|
| [Chatbots, were]|
|[Our, hopes, were]|
|[And, why, wouldn]|
| [At, the, Mobile]|
+------------------+
This step suffices.
from pyspark.sql import functions as F
df = df.groupby().agg(F.collect_list('author').alias('list_of_authors'))
df.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|list_of_authors |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray(Justin, Lee), WrappedArray(Chatbots, were), WrappedArray(Our, hopes, were), WrappedArray(And, why, wouldn), WrappedArray(At, the, Mobile)]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
DataFrames, same as other distributed data structures, are not iterable and by only using dedicated higher order function and / or SQL methods can be accessed
Suppose your Dataframe is DF1 and Output is DF2
You need something like :
values = [(['Justin', 'Lee'],), (['Chatbots', 'were'],), (['Our', 'hopes', 'were'],),
(['And', 'why', 'wouldn'],), (['At', 'the', 'Mobile'],)]
df = spark.createDataFrame(values, ['author', ])
df.agg(F.collect_list('author').alias('author')).show(truncate=False)
Upvote if works
Every row in the dataframe contains a csv formatted string line plus another simple string, so what I'm trying to get at the end is a dataframe composed of the fields extracted from the line string together with category.
So I proceeded as follows to explode the line string
val df = stream.toDF("line","category")
.map(x => x.getString(0))......
At the end I manage to get a new dataframe composed of the line fields but I can't return the category to the new dataframe
I can't join the new dataframe with the initial one since the common field id was not a separate column at first.
Sample of input :
line | category
"'1';'daniel';'dan#gmail.com'" | "premium"
Sample of output:
id | name | email | category
1 | "daniel"| "dan#gmail.com"| "premium"
Any suggestions, thanks in advance.
If the structure of strings in line column is fixed as mentioned in the question, then following simple solution should work where split inbuilt function is used to split the string into array and then finally selecting the elements from the array and aliasing to get the final dataframe
import org.apache.spark.sql.functions._
df.withColumn("line", split(col("line"), ";"))
.select(col("line")(0).as("id"), col("line")(1).as("name"), col("line")(2).as("email"), col("category"))
.show(false)
which should give you
+---+--------+---------------+--------+
|id |name |email |category|
+---+--------+---------------+--------+
|'1'|'daniel'|'dan#gmail.com'|premium |
+---+--------+---------------+--------+
I hope the answer is helpful
I've been thinking the next problem but I haven't reach the solution: I have a dataframe df with only one column A, which elements have dataType Array[String]. I'm trying to get all the different arrays of A, non importing the order of the Strings in the arrays.
For example, if the dataframe is the following:
df.select("A").show()
+--------+
|A |
+--------+
|[a,b,c] |
|[d,e] |
|[f] |
|[e,d] |
|[c,a,b] |
+--------+
I would like to get the dataframe
+--------+
|[a,b,c] |
|[d,e] |
|[f] |
+--------+
I've trying make a distinct(), dropDuplicates() and other functions, but It doesnt't work.
I would appreciate any help. Thank you in advance.
You can use collect_list function to collect all the arrays in that column and then use udf function to sort the individual arrays and finally return the distinct arrays of the collected list. Finally you can use explode function to distribute the distinct collected arrays into separate rows
import org.apache.spark.sql.functions._
def distinctCollectUDF = udf((a: mutable.WrappedArray[mutable.WrappedArray[String]]) => a.map(array => array.sorted).distinct)
df.select(distinctCollectUDF(collect_list("A")).as("A")).withColumn("A", explode($"A")).show(false)
You should have your desired result.
You might try and use the contains method.