Combining two columns from different datasets into one table in Spark - scala

I want to take two columns from two different tables and combine them into one table, but not using any primary keys that are common between both. For example:
val testDSArray : java.util.List[Integer] = new util.ArrayList[Integer]()
testDSArray.add(4)
testDSArray.add(7)
testDSArray.add(10)
val testDS: DataFrame = spark.createDataset(testDSArray)(Encoders.INT).toDF("col1")
val testDS2: DataFrame = spark.createDataset(testDSArray)(Encoders.INT).toDF("col2")
val columns = testDS.withColumn("col2", testDS2.col("col2"))
columns.show(5)
I would expect this code to show something like:
---------------
| col1 | col2 |
---------------
| 4 | 4 |
| 7 | 7 |
| 10 | 10 |
---------------
However, the code above fails to run with error
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) col2#12 missing from col1#6 in operator !Project [col1#6, col2#12 AS col2#15];

Related

pyspark column type casting in pivot

I have a dataframe where I want to create pivot table from 2 columns, i'm using the question header column which will have its value pivoted like below : age , age_numeric
and the answer header is the value , my problem is I want to put the value of the answer header in a list which I'm doing using collect_list function, but the problem is i want the new column like age_numeric to be list of int, while column age to be list of strings, based on question type column, but when i try the code it always gives me a list of strings, any idea how to solve this problem?
this is the code
y=output.groupby("sessionId").pivot("questionHeader").
agg(collect_list(when(col("questionType")=="numericAnswer",
col("answerHeader")
.cast("float")).when(col("questionType")!="numericAnswer",col("answerHeader"))))
this is what i get
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | ["20"]
| 3 | ["20-25 years"] | ["20"]
This is what i want
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | [20]
| 3 | ["20-25 years"] | [20]
If you want the output as in the last two rows, then you do not require a pivot, just groupby and collect_list on each of the two columns To get the list of integers for Age_numeric, apply .cast("array< int>"), or change the type of Age_numeric column before collect_list().
Replicate the data
import pyspark.sql.functions as F
data = [(1, "20-25 years", "20"), (3, "20-25 years", "20")]
df = spark.createDataFrame(data, schema=["session_id", "Age", "Age_numeric"])
Replicate the output
df_out = (df.groupBy("session_id")
.agg(F.collect_list("Age").alias("Age"),
F.collect_list("Age_numeric")
.cast("array<int>")
.alias("Age_numeric"))

.isin() with a column from a dataframe

How can I query a table using isin() with another dataframe? For example there is this dataframe, df1:
| id | rank |
|---------|------|
| SE34SER | 1 |
| SEF3445 | 2 |
| 5W4G4F | 3 |
I want to query a table where a column in the table isin(df1.id). I tried doing so like this:
t = (
spark.table('mytable')
.where(sf.col('id').isin(df1.id))
.select('*')
).show()
However it errors:
AttributeError: 'NoneType' object has no attribute 'id'
Unfortunately, you can't pass another dataframe's column to isin() method. You can get all the values of that column in a list and pass list to isin() method but this is not a better approach.
You can do inner join between those 2 dataframes.
df2 = spark.table('mytable')
df2.join(df1.select('id'),df1.id == df2.id, 'inner')

How to find duplicated columns with all values in spark dataframe?

I'm preprocessing my data(2000K+ rows), and want to count the duplicated columns in a spark dataframe, for example:
id | col1 | col2 | col3 | col4 |
----+--------+-------+-------+-------+
1 | 3 | 999 | 4 | 999 |
2 | 2 | 888 | 5 | 888 |
3 | 1 | 777 | 6 | 777 |
In this case, the col2 and col4's values are the same, which is my interest, so let the count +1.
I had tried toPandas(), transpose, and then duplicateDrop() in pyspark, but it's too slow.
Is there any function could solve this?
Any idea will be appreciate, thank you.
So you want to count the number of duplicate values based on the columns col2 and col4? This should do the trick below.
val dfWithDupCount = df.withColumn("isDup", when($"col2" === "col4", 1).otherwise(0))
This will create a new dataframe with a new boolean column saying that if col2 is equal to col4, then enter the value 1 otherwise 0.
To find the total number of rows, all you need to do is do a group by based on isDup and count.
import org.apache.spark.sql.functions._
val groupped = df.groupBy("isDup").agg(sum("isDup")).toDF()
display(groupped)
Apologies if I misunderstood you. You could probably use the same solution if you were trying to match any of the columns together, but that would require nested when statements.

PySpark join dataframes and merge contents of specific columns

My goal is to merge two dataframes on the column id, and perform a somewhat complex merge on another column that contains JSON we can call data.
Suppose I have the DataFrame df1 that looks like this:
id | data
---------------------------------
42 | {'a_list':['foo'],'count':1}
43 | {'a_list':['scrog'],'count':0}
And I'm interested in merging with a similar, but different DataFrame df2:
id | data
---------------------------------
42 | {'a_list':['bar'],'count':2}
44 | {'a_list':['baz'],'count':4}
And I would like the following DataFrame, joining and merging properties from the JSON data where id matches, but retaining rows where id does not match and keeping the data column as-is:
id | data
---------------------------------------
42 | {'a_list':['foo','bar'],'count':3} <-- where 'bar' is added to 'foo', and count is summed
43 | {'a_list':['scrog'],'count':1}
44 | {'a_list':['baz'],'count':4}
As can be seen where id is 42, there is a some logic I will have to apply to how the JSON is merged.
My knee jerk thought is that I'd like to provide a lambda / udf to merge the data column, but not sure how to think about that with during a join.
Alternatively, I could break the properties from the JSON into columns, something like this, that might be a better approach?
df1:
id | a_list | count
----------------------
42 | ['foo'] | 1
43 | ['scrog'] | 0
df2:
id | a_list | count
---------------------
42 | ['bar'] | 2
44 | ['baz'] | 4
Resulting:
id | a_list | count
---------------------------
42 | ['foo', 'bar'] | 3
43 | ['scrog'] | 0
44 | ['baz'] | 4
If I went this route, I would then have to merge the columns a_list and count into JSON again under a single column data, but this I can wrap my head around as a relatively simple map function.
Update: Expanding on Question
More realistically, I will have n number of DataFrames in a list, e.g. df_list = [df1, df2, df3], all shaped the same. What is an efficient way to perform these same actions on n number of DataFrames?
Update to Update
Not sure how efficient this is, or if there is a more spark-esque way to do this, but incorporating accepted answer, this appears to work for question update:
for i in range(0, (len(validations) - 1)):
# set dfs
df1 = validations[i]['df']
df2 = validations[(i+1)]['df']
# joins here...
# update new_df
new_df = df2
Here's one way to accomplish your second approach:
Explode the list column and then unionAll the two DataFrames. Next groupBy the "id" column and use pyspark.sql.functions.collect_list() and pyspark.sql.functions.sum():
import pyspark.sql.functions as f
new_df = df1.select("id", f.explode("a_list").alias("a_values"), "count")\
.unionAll(df2.select("id", f.explode("a_list").alias("a_values"), "count"))\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))
new_df.show(truncate=False)
#+---+----------+-----+
#|id |a_list |count|
#+---+----------+-----+
#|43 |[scrog] |0 |
#|44 |[baz] |4 |
#|42 |[foo, bar]|3 |
#+---+----------+-----+
Finally you can use pyspark.sql.functions.struct() and pyspark.sql.functions.to_json() to convert this intermediate DataFrame into your desired structure:
new_df = new_df.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
new_df.show()
#+---+----------------------------------+
#|id |data |
#+---+----------------------------------+
#|43 |{"a_list":["scrog"],"count":0} |
#|44 |{"a_list":["baz"],"count":4} |
#|42 |{"a_list":["foo","bar"],"count":3}|
#+---+----------------------------------+
Update
If you had a list of dataframes in df_list, you could do the following:
from functools import reduce # for python3
df_list = [df1, df2]
new_df = reduce(lambda a, b: a.unionAll(b), df_list)\
.select("id", f.explode("a_list").alias("a_values"), "count")\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))\
.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))

In spark and scala, how to convert or map a dataframe to specific columns info?

Scala.
Spark.
intellij IDEA.
I have a dataframe (multiple rows, multiple columns) from CSV file.
And I want it maps to another specific column info.
I think scala class (not case class, because columns count > 22) or map().....
But I don't know how to convert them.
Example
a dataframe from CSV file.
----------------------
| No | price| name |
----------------------
| 1 | 100 | "A" |
----------------------
| 2 | 200 | "B" |
----------------------
another specific columns info.
=> {product_id, product_name, seller}
First, product_id is mapping to 'No'.
Second, product_name is mapping to 'name'.
Third, seller is null or ""(empty string).
So, finally, I want a dataframe that have another columns info.
-----------------------------------------
| product_id | product_name | seller |
-----------------------------------------
| 1 | "A" | |
-----------------------------------------
| 2 | "B" | |
-----------------------------------------
If you already have a dataframe (eg. old_df) :
val new_df=old_df.withColumnRenamed("No","product_id").
withColumnRenamed("name","product_name").
drop("price").
withColumn("seller", ... )
Let's say your CSV file is "products.csv",
First you have to load it in spark, you can do that using
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
Once the data is loaded you will have all the column names in the dataframe df. As you mentioned your column name will be "No","Price","Name".
To change the name of the column you just have to use withColumnRenamed api of dataframe.
val renamedDf = df.withColumnRenamed("No","product_id").
withColumnRenames("name","product_name")
Your renamedDf will have the name of the column as you have assigned.