Scala DataFrame: Explode an array

Scala DataFrame: Explode an array - scala

I am using the spark libraries in Scala. I have created a DataFrame using
val searchArr = Array(
StructField("log",IntegerType,true),
StructField("user", StructType(Array(
StructField("date",StringType,true),
StructField("ua",StringType,true),
StructField("ui",LongType,true))),true),
StructField("what",StructType(Array(
StructField("q1",ArrayType(IntegerType, true),true),
StructField("q2",ArrayType(IntegerType, true),true),
StructField("sid",StringType,true),
StructField("url",StringType,true))),true),
StructField("where",StructType(Array(
StructField("o1",IntegerType,true),
StructField("o2",IntegerType,true))),true)
)
val searchSt = new StructType(searchArr)
val searchData = sqlContext.jsonFile(searchPath, searchSt)
I am now what to explode the field what.q1, which should contain an array of integers, but the documentation is limited:
http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#explode(java.lang.String,%20java.lang.String,%20scala.Function1,%20scala.reflect.api.TypeTags.TypeTag)
So far I tried a few things without much luck
val searchSplit = searchData.explode("q1", "rb")(q1 => q1.getList[Int](0).toArray())
Any ideas/examples of how to use explode on an array?

Did you try with an UDF on field "what"? Something like that could be useful:
val explode = udf {
(aStr: GenericRowWithSchema) =>
aStr match {
case null => ""
case _ => aStr.getList(0).get(0).toString()
}
}
val newDF = df.withColumn("newColumn", explode(col("what")))
where:
getList(0) returns "q1" field
get(0) returns the first element of "q1"
I'm not sure but you could try to use getAs[T](fieldName: String) instead of getList(index: Int).

I'm not used to Scala; but in Python/pyspark, the array type column nested within a struct type field can be exploded as follows. If it works for you, then you can convert it to corresponding Scala representation.
from pyspark.sql.functions import col, explode
from pyspark.sql.types import ArrayType, IntegerType, LongType, StringType, StructField, StructType
schema = StructType([
StructField("log", IntegerType()),
StructField("user", StructType([
StructField("date", StringType()),
StructField("ua", StringType()),
StructField("ui", LongType())])),
StructField("what", StructType([
StructField("q1", ArrayType(IntegerType())),
StructField("q2", ArrayType(IntegerType())),
StructField("sid", StringType()),
StructField("url", StringType())])),
StructField("where", StructType([
StructField("o1", IntegerType()),
StructField("o2", IntegerType())]))
])
data = [((1), ("2022-01-01","ua",1), ([1,2,3],[6],"sid","url"), (7,8))]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
Output:
+---+-------------------+--------------------------+------+
|log|user |what |where |
+---+-------------------+--------------------------+------+
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|
+---+-------------------+--------------------------+------+
With what.q1 exploded:
df.withColumn("what.q1_exploded", explode(col("what.q1"))).show(truncate=False)
Output:
+---+-------------------+--------------------------+------+----------------+
|log|user |what |where |what.q1_exploded|
+---+-------------------+--------------------------+------+----------------+
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|1 |
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|2 |
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|3 |
+---+-------------------+--------------------------+------+----------------+

Related

Transform a column included in array column

I need to transform an array column in my dataframe, array is called 'cities' and has the type Array(City) and I want put city name in upper case.
Structure:
val cities: StructField = StructField("cities", ArrayType(CityType), nullable = true)
def CityType: StructType =
StructType(
Seq(
StructField(code, StringType, nullable = true),
StructField(name, StringType, nullable = true)
)
)
Code I tried:
.withColumn(
newColumn,
forall(
col(cities),
(col: Column) =>
struct(
Array(
col(code),
upper(col(name))
): _*
)
)
)
Error says
cannot resolve 'forall(...

There is no such thing called forall. You can use transform instead:
// sample data
val df = spark.sql("select array(struct('1' as code, 'abc' as name), struct('2' as code, 'def' as name)) cities")
import org.apache.spark.sql.Column
val df2 = df.withColumn(
"newcol",
transform(
col("cities"),
(c: Column) => struct(c("code"), upper(c("name")))
)
)
df2.show
+--------------------+--------------------+
| cities| newcol|
+--------------------+--------------------+
|[[1, abc], [2, def]]|[[1, ABC], [2, DEF]]|
+--------------------+--------------------+

Spark create a dataframe from multiple lists/arrays

So, I have 2 lists in Spark(scala). They both contain the same number of values. The first list a contains all strings and the second list b contains all Long's.
a: List[String] = List("a", "b", "c", "d")
b: List[Long] = List(17625182, 17625182, 1059731078, 100)
I also have a schema defined as follows:
val schema2=StructType(
Array(
StructField("check_name", StringType, true),
StructField("metric", DecimalType(38,0), true)
)
)
What is the best way to convert my lists to a single dataframe, that has schema schema2 and the columns are made from a and b respectively?

You can create an RDD[Row] and convert to Spark dataframe with the given schema:
val df = spark.createDataFrame(
sc.parallelize(a.zip(b).map(x => Row(x._1, BigDecimal(x._2)))),
schema2
)
df.show
+----------+----------+
|check_name| metric|
+----------+----------+
| a| 17625182|
| b| 17625182|
| c|1059731078|
| d| 100|
+----------+----------+

Using Dataset:
import spark.implicits._
case class Schema2(a: String, b: Long)
val el = (a zip b) map { case (a, b) => Schema2(a, b)}
val df = spark.createDataset(el).toDF()

how to compliment a df from another df values

i have 2 dataframes, so one df has unique values with a good format, and the other df has values with wrong values, so how can i compliment the df with wrong values with respect to the other dataframe?.
Example: df with correct and unique values
+----------------------------------------+--------------+
|company_id |company_name |
+----------------------------------------+--------------+
|8f642dc67fccf861548dfe1c761ce22f795e91f0|Muebles |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy |
+----------------------------------------+--------------+
Example df with wrong values:
+----------------------------------------+------------+
|company_id |company_name|
+----------------------------------------+------------+
|******* |MiPasajefy |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|NaN |
|NaN |MiPasajefy |
+----------------------------------------+------------+
The columns : company_id and company_name are key columns,
so the wrong df with corrected values has to be:
+----------------------------------------+------------+
|company_id |company_name|
+----------------------------------------+------------+
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy |
+----------------------------------------+------------+

from datetime import datetime
from pyspark.sql import *
from collections import *
from pyspark.sql.functions import udf,explode
from pyspark.sql.types import StringType
table_schema = StructType([StructField('key1', StringType(), True),
StructField('key2', IntegerType(), True),
StructField('list1', ArrayType(StringType()), False),
StructField('list2', ArrayType(StringType()), False),
StructField('list3', ArrayType(IntegerType()), False),
StructField('list4', StringType(), False),
StructField('list5', ArrayType(FloatType()), False),
StructField('list6', ArrayType(StringType()), False)
])
df= spark.createDataFrame(
[
("8f642dc67fccf861548dfe1c761ce22f795e91f0","Muebles"),
("cbf1c8b09cd5b549416d49d220a40cbd317f952e","MiPasajefy")
],("company_id","company_name")
)
df2= spark.createDataFrame(
[
( "*****" ,"MiPasajefy" ),
("cbf1c8b09cd5b549416d49d220a40cbd317f952e","NaN" ),
("NaN","MiPasajefy")
],("company_id","company_name")
)
df.createOrReplaceTempView("A")
df2.createOrReplaceTempView("B")
spark.sql("select a.Company_name,a.company_id from B b left join A a on (a.company_id=b.company_id or a.Company_name=b.Company_name )").show(truncate=False)
+------------+----------------------------------------+
|Company_name|company_id |
+------------+----------------------------------------+
|MiPasajefy |cbf1c8b09cd5b549416d49d220a40cbd317f952e|
|MiPasajefy |cbf1c8b09cd5b549416d49d220a40cbd317f952e|
|MiPasajefy |cbf1c8b09cd5b549416d49d220a40cbd317f952e|
+------------+----------------------------------------+

what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

I have dataframe with 3 columns
date
jsonString1
jsonString2
I want to expand attributes inside json into columns. so i did something like this.
val json1 = spark.read.json(dataframe.select(col("jsonString1")).rdd.map(_.getString(0)))
val json2 = spark.read.json(dataframe.select(col("jsonString2")).rdd.map(_.getString(0)))
val json1Table = json1.selectExpr("id", "status")
val json2Table = json2.selectExpr("name", "address")
now i want to put these table together. so i did the following
val json1TableWithIndex = addColumnIndex(json1Table)
val json2TableWithIndex = addColumnIndex(json2Table)
var finalResult = json1Table
.join(json2Table, Seq("columnindex"))
.drop("columnindex")
def addColumnIndex(df: DataFrame) = spark.createDataFrame(
df.rdd.zipWithIndex.map { case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex) },
StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
)
After sampling few rows I observe that rows match exactly as in the source dataframe
I did not find any information on the order guarantee when joining two columns of a dataframe which are processed separately. Is this the right way to solve my problem. Any help is appreciated.

It is always risky to rely on undocumented behaviours, as your code might not work as you intended, because you only have a partial understanding of it.
You can do the same thing in a much more efficient way without using any split and join approach. Use a from_json function to create two nested columns and then flatten out the nested columns and finally drop out the intermediate JSON string columns and nested columns.
Here is an example fo the whole process.
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val df = (Seq(
("09-02-2020","{\"id\":\"01\", \"status\":\"Active\"}","{\"name\":\"Abdullah\", \"address\":\"Jumeirah\"}"),
("10-02-2020","{\"id\":\"02\", \"status\":\"Dormant\"}","{\"name\":\"Ali\", \"address\":\"Jebel Ali\"}")
).toDF("date","jsonString1","jsonString2"))
scala> df.show()
+----------+--------------------+--------------------+
| date| jsonString1| jsonString2|
+----------+--------------------+--------------------+
|09-02-2020|{"id":"01", "stat...|{"name":"Abdullah...|
|10-02-2020|{"id":"02", "stat...|{"name":"Ali", "a...|
+----------+--------------------+--------------------+
val schema1 = (StructType(Seq(
StructField("id", StringType, true),
StructField("status", StringType, true)
)))
val schema2 = (StructType(Seq(
StructField("name", StringType, true),
StructField("address", StringType, true)
)))
val dfFlattened = (df.withColumn("jsonData1", from_json(col("jsonString1"), schema1))
.withColumn("jsonData2", from_json(col("jsonString2"), schema2))
.withColumn("id", col("jsonData1.id"))
.withColumn("status", col("jsonData1.status"))
.withColumn("name", col("jsonData2.name"))
.withColumn("address", col("jsonData2.address"))
.drop("jsonString1")
.drop("jsonString2")
.drop("jsonData1")
.drop("jsonData2"))
scala> dfFlattened.show()
+----------+---+-------+--------+---------+
| date| id| status| name| address|
+----------+---+-------+--------+---------+
|09-02-2020| 01| Active|Abdullah| Jumeirah|
|10-02-2020| 02|Dormant| Ali|Jebel Ali|
+----------+---+-------+--------+---------+

Manually create a pyspark dataframe

I am trying to manually create a pyspark dataframe given certain data:
row_in = [(1566429545575348), (40.353977), (-111.701859)]
rdd = sc.parallelize(row_in)
schema = StructType(
[
StructField("time_epocs", DecimalType(), True),
StructField("lat", DecimalType(), True),
StructField("long", DecimalType(), True),
]
)
df_in_test = spark.createDataFrame(rdd, schema)
This gives an error when I try to display the dataframe, so I am not sure how to do this.
However, the Spark documentation seems to be a bit convoluted to me, and I got similar errors when I tried to follow those instructions.
Does anyone know how to do this?

Simple dataframe creation:
df = spark.createDataFrame(
[
(1, "foo"), # create your data here, be consistent in the types.
(2, "bar"),
],
["id", "label"] # add your column names here
)
df.printSchema()
root
|-- id: long (nullable = true)
|-- label: string (nullable = true)
df.show()
+---+-----+
| id|label|
+---+-----+
| 1| foo|
| 2| bar|
+---+-----+
According to official doc:
when schema is a list of column names, the type of each column will be inferred from data. (example above ↑)
When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data. (examples below ↓)
# Example with a datatype string
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
"id int, label string", # add column names and types here
)
# Example with pyspark.sql.types
from pyspark.sql import types as T
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
T.StructType( # Define the whole schema within a StructType
[
T.StructField("id", T.IntegerType(), True),
T.StructField("label", T.StringType(), True),
]
),
)
df.printSchema()
root
|-- id: integer (nullable = true) # type is forced to Int
|-- label: string (nullable = true)
Additionally, you can create your dataframe from Pandas dataframe, schema will be inferred from Pandas dataframe's types :
import pandas as pd
import numpy as np
pdf = pd.DataFrame(
{
"col1": [np.random.randint(10) for x in range(10)],
"col2": [np.random.randint(100) for x in range(10)],
}
)
df = spark.createDataFrame(pdf)
df.show()
+----+----+
|col1|col2|
+----+----+
| 6| 4|
| 1| 39|
| 7| 4|
| 7| 95|
| 6| 3|
| 7| 28|
| 2| 26|
| 0| 4|
| 4| 32|
+----+----+

To elaborate/build off of #Steven's answer:
field = [
StructField("MULTIPLIER", FloatType(), True),
StructField("DESCRIPTION", StringType(), True),
]
schema = StructType(field)
multiplier_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
Will create a blank dataframe.
We can now simply add a row to it:
l = [(2.3, "this is a sample description")]
rdd = sc.parallelize(l)
multiplier_df_temp = spark.createDataFrame(rdd, schema)
multiplier_df = wtp_multiplier_df.union(wtp_multiplier_df_temp)

This answer demonstrates how to create a PySpark DataFrame with createDataFrame, create_df and toDF.
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["first_name", "age"])
df.show()
+----------+---+
|first_name|age|
+----------+---+
| joe| 34|
| luisa| 22|
+----------+---+
You can also pass createDataFrame a RDD and schema to construct DataFrames with more precision:
from pyspark.sql import Row
from pyspark.sql.types import *
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
schema = schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
create_df from my Quinn project allows for the best of both worlds - it's concise and fully descriptive:
from pyspark.sql.types import *
from quinn.extensions import *
df = spark.create_df(
[("jose", "a"), ("li", "b"), ("sam", "c")],
[("name", StringType(), True), ("blah", StringType(), True)]
)
df.show()
+----+----+
|name|blah|
+----+----+
|jose| a|
| li| b|
| sam| c|
+----+----+
toDF doesn't offer any advantages over the other approaches:
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
df = rdd.toDF()
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+

With formatting
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(1, "foo"),
(2, "bar"),
],
StructType(
[
StructField("id", IntegerType(), False),
StructField("txt", StringType(), False),
]
),
)
print(df.dtypes)
df.show()

Extending #Steven's Answer:
data = [(i, 'foo') for i in range(1000)] # random data
columns = ['id', 'txt'] # add your columns label here
df = spark.createDataFrame(data, columns)
Note: When schema is a list of column-names, the type of each column will be inferred from data.
If you want to specifically define schema then do this:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df1 = spark.createDataFrame(data, schema)
Outputs:
>>> df1
DataFrame[id: int, txt: string]
>>> df
DataFrame[id: bigint, txt: string]

for beginners, a full example importing data from file:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
ShortType,
StringType,
StructType,
StructField,
TimestampType,
)
import os
here = os.path.abspath(os.path.dirname(__file__))
spark = SparkSession.builder.getOrCreate()
schema = StructType(
[
StructField("id", ShortType(), nullable=False),
StructField("string", StringType(), nullable=False),
StructField("datetime", TimestampType(), nullable=False),
]
)
# read file or construct rows manually
df = spark.read.csv(os.path.join(here, "data.csv"), schema=schema, header=True)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala DataFrame: Explode an array - scala

Related

Transform a column included in array column

Spark create a dataframe from multiple lists/arrays

how to compliment a df from another df values

what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

Manually create a pyspark dataframe

Categories

Resources