pyspark AnalysisException Attribute contains invalid characters even after changing the names - pyspark

I'm getting an error about illegal column names even after renaming the columns. How to fix?
I start with.
df=spark.read.parquet('abfss://myblob.dfs.core.windows.net/somedir')
df
# DataFrame[Region: string, Date: date, OnOff Peak: string, Hourbegin: int, Hourend: int, Inflation: string, Price_Type: string, Reference_Year: int, Area: string, Price: double, Case: string]
df.columns
# ['Region', 'Date', 'OnOff Peak', 'Hourbegin', 'Hourend', 'Inflation','Price_Type', 'Reference_Year', 'Area', 'Price', 'Case']
df.head()
# AnalysisException: Attribute name "OnOff Peak" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
Ok so my 'OnOff Peak' has an illegal space. Just rename it right?
df=df.withColumnRenamed('OnOff Peak', 'OnOff_Peak')
df
#DataFrame[Region: string, Date: date, OnOff_Peak: string, Hourbegin: int, Hourend: int, Inflation: string, Price_Type: string, Reference_Year: int, Area: string, Price: double, Case: string]
df.columns
# ['Region', 'Date', 'OnOff_Peak', 'Hourbegin', 'Hourend', 'Inflation', 'Price_Type', 'Reference_Year', 'Area', 'Price', 'Case']
df.head()
# AnalysisException: Attribute name "OnOff Peak" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
Even though I renamed the column, I'm still getting the same error.
Maybe if I rename it using select, col, and alias...
df=spark.read.parquet('abfss://myblob.dfs.core.windows.net/somedir').select(col("Region"),
col("Date"),
col("OnOff Peak").alias("OnOff_Peak"),
col("Hourbegin"),
col("Hourend"),
col("Inflation"),
col("Price_Type"),
col("Reference_Year"),
col("Area"),
col("Price"),
col("Case"))
df
# DataFrame[Region: string, Date: date, OnOff_Peak: string, Hourbegin: int, Hourend: int, Inflation: string, Price_Type: string, Reference_Year: int, Area: string, Price: double, Case: string]
df.columns
# ['Region', 'Date', 'OnOff_Peak', 'Hourbegin', 'Hourend', 'Inflation','Price_Type', 'Reference_Year', 'Area', 'Price', 'Case']
df.head()
# AnalysisException: Attribute name "OnOff Peak" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
Nope. Same error as before.
If I omit that column then I don't get the error.
df.select(col("Region"),
col("Date"),
col("Hourbegin"),
col("Hourend"),
col("Inflation"),
col("Price_Type"),
col("Reference_Year"),
col("Area"),
col("Price"),
col("Case")).show(1)
#+-----------------+----------+---------+-------+---------+------------+--------------+------------+-----+--------------------+
#| Region| Date|Hourbegin|Hourend|Inflation| Price_Type|Reference_Year| Area|Price| Case|
#+-----------------+----------+---------+-------+---------+------------+--------------+------------+-----+--------------------+
#My Actual Data
#+-----------------+----------+---------+-------+---------+------------+--------------+------------+-----+--------------------+

What if you try to rename using regular expressions to make sure you only have alphanumeric characters in this column name?
df = spark.createDataFrame(
[
(1,'foo'),
(2,'bar'),
(3,'gee'),
(4,'noo'),
],
["OnOff Peak", "description"])
import re
col_nm = df.columns[0]
new_col_nm = re.sub('[^0-9a-zA-Z]+', '', col_nm)
df=df.withColumnRenamed(col_nm, new_col_nm)
df.show()
+---------+-----------+
|OnOffPeak|description|
+---------+-----------+
| 1| foo|
| 2| bar|
| 3| gee|
| 4| noo|
+---------+-----------+

Related

how to persist a spark Date type column into a DB Date column

I have a dataSet contains a Date type column, when I tried to store this dataSet to DB i'm getting :
ERROR: column "processDate" is of type date but expression is of type character varying
which obviously telling that I'm trying to store a varchar type column in a date column, however, I'm using to_date (from sql.function) to convert the processingDate from string to Date (which works, I tried it )
can anyone help ?
Make sure that your data is on the specified data format, you can also pass the format explicitly to to_date() function.
Sample data:
val somedata = Seq(
(1, "11-11-2019"),
(2, "11-11-2019")
).toDF("id", "processDate")
somedata.printSchema()
Schema for this sample data is:
somedata:org.apache.spark.sql.DataFrame = [id: integer, processDate: string]
You can do something like this:
import org.apache.spark.sql.functions.to_date
val newDF = somedata.withColumn("processDate", to_date('processDate, "MM-dd-yyyy"))
newDF.show()
newDF.printSchema()
Above code will output:
+---+-----------+
| id|processDate|
+---+-----------+
| 1| 2019-11-11|
| 2| 2019-11-11|
+---+-----------+
With the following schema:
newDF: org.apache.spark.sql.DataFrame = [id: int, processDate: date]

create new columns from string column

I have as DataFrame with a string column
val df= Seq(("0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:3")).toDF("col_str")
+--------------------+
| col_str|
+--------------------+
|0003C32C-FC1D-482...|
+--------------------+
The string column is comprised of character sequences separated by whitespace. If a character sequence starts with 0, I want to return the second number and the last number of the sequence. The second number can be any number between 0 and 8.
Array("8,3", "6,1", "7,1", "6,1", "7,1", "8,3")
I then want to transform the array of pairs into 9 columns, with the first number of the pair as the column and the second number as the value. If a number is missing, it will get a value of 0.
For example
val df= Seq(("0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:1")).).toDF("col_str", "col0", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8")
+--------------------+----+----+----+----+----+----+----+----+----+
| col_str|col0|col1|col2|col3|col4|col5|col6|col7|col8|
+--------------------+----+----+----+----+----+----+----+----+----+
|0003C32C-FC1D-482...| 0| 0| 0| 0| 0| 0| 1| 1| 3|
+--------------------+----+----+----+----+----+----+----+----+----+
I don't care if the solution is in either scala or python.
You can do the following (commented for clarity)
//string defining
val str = """0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:3"""
//string splitting with space
val splittedStr = str.split(" ")
//parsing the splitted string to get the desired format with the second element as key and the last element as value of the elements starting with 0
val parsedStr = List(("col_str"->splittedStr.head)) ++ splittedStr.tail.filter(_.startsWith("0")).map(value => {val splittedValue = value.split("[,:]"); ("col"+splittedValue(1)->splittedValue.last)}) toMap
//expected header names
val expectedHeader = Seq("col_str", "col0", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8")
//populating 0 for the missing header names in the parsed string in above step
val missedHeaderWithValue = expectedHeader.diff(parsedStr.keys.toSeq).map((_->"0")).toMap
//combining both the maps
val expectedKeyValues = parsedStr ++ missedHeaderWithValue
//converting to a dataframe
Seq(expectedDF(expectedKeyValues(expectedHeader(0)), expectedKeyValues(expectedHeader(1)), expectedKeyValues(expectedHeader(2)), expectedKeyValues(expectedHeader(3)), expectedKeyValues(expectedHeader(4)), expectedKeyValues(expectedHeader(5)), expectedKeyValues(expectedHeader(6)), expectedKeyValues(expectedHeader(7)), expectedKeyValues(expectedHeader(8)), expectedKeyValues(expectedHeader(9))))
.toDF()
.show(false)
which should give you
+------------------------------------+----+----+----+----+----+----+----+----+----+
|col_str |col0|col1|col2|col3|col4|col5|col6|col7|col8|
+------------------------------------+----+----+----+----+----+----+----+----+----+
|0003C32C-FC1D-482F-B543-3CBD7F0A0E36|0 |0 |0 |0 |0 |0 |1 |1 |3 |
+------------------------------------+----+----+----+----+----+----+----+----+----+
and of course you would need expectedDF case class defined somewhere out of scope
case class expectedDF(col_str: String, col0: String, col1: String, col2: String, col3: String, col4: String, col5: String, col6: String, col7: String, col8: String)

How to unpack multiple keys in a Spark DataSet

I have the following DataSet, with the following structure.
case class Person(age: Int, gender: String, salary: Double)
I want to determine the average salary by gender and age, thus I group the DS by both keys. I've encountered two main problems, one is that both keys are mixed in a single one, but I want to keep them in two different columns, the other is that the aggregated column gets a silly long name and I can't figure out how to rename it (apparently as and alias won't work) all of this using the DS API.
val df = sc.parallelize(List(Person(100000.00, "male", 27),
Person(120000.00, "male", 27),
Person(95000, "male", 26),
Person(89000, "female", 31),
Person(250000, "female", 51),
Person(120000, "female", 51)
)).toDF.as[Person]
df.groupByKey(p => (p.gender, p.age)).agg(typed.avg(_.salary)).show()
+-----------+------------------------------------------------------------------------------------------------+
| key| TypedAverage(line2503618a50834b67a4b132d1b8d2310b12.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Person)|
+-----------+------------------------------------------------------------------------------------------------+
|[female,31]| 89000.0...
|[female,51]| 185000.0...
| [male,27]| 110000.0...
| [male,26]| 95000.0...
+-----------+------------------------------------------------------------------------------------------------+
Aliasing is an untyped action, so you must retype it after. And the only way to unpack the key is to do it after, via a select or something:
df.groupByKey(p => (p.gender, p.age))
.agg(typed.avg[Person](_.salary).as("average_salary").as[Double])
.select($"key._1",$"key._2",$"average_salary").show
The easiest way to achieve both goals is to map() from the aggregation result to the Person instance again:
.map{case ((gender, age), salary) => Person(gender, age, salary)}
The result will look best if slightly re-arrange the order of arguments in the case class'es constructor:
case class Person(gender: String, age: Int, salary: Double)
+------+---+--------+
|gender|age| salary|
+------+---+--------+
|female| 31| 89000.0|
|female| 51|185000.0|
| male| 27|110000.0|
| male| 26| 95000.0|
+------+---+--------+
Full code:
import session.implicits._
val df = session.sparkContext.parallelize(List(
Person("male", 27, 100000),
Person("male", 27, 120000),
Person("male", 26, 95000),
Person("female", 31, 89000),
Person("female", 51, 250000),
Person("female", 51, 120000)
)).toDS
import org.apache.spark.sql.expressions.scalalang.typed
df.groupByKey(p => (p.gender, p.age))
.agg(typed.avg(_.salary))
.map{case ((gender, age), salary) => Person(gender, age, salary)}
.show()

Spark Dataset API - join

I am trying to use the Spark Dataset API but I am having some issues doing a simple join.
Let's say I have two dataset with fields: date | value, then in the case of DataFrame my join would look like:
val dfA : DataFrame
val dfB : DataFrame
dfA.join(dfB, dfB("date") === dfA("date") )
However for Dataset there is the .joinWith method, but the same approach does not work:
val dfA : Dataset
val dfB : Dataset
dfA.joinWith(dfB, ? )
What is the argument required by .joinWith ?
To use joinWith you first have to create a DataSet, and most likely two of them. To create a DataSet, you need to create a case class that matches your schema and call DataFrame.as[T] where T is your case class. So:
case class KeyValue(key: Int, value: String)
val df = Seq((1,"asdf"),(2,"34234")).toDF("key", "value")
val ds = df.as[KeyValue]
// org.apache.spark.sql.Dataset[KeyValue] = [key: int, value: string]
You could also skip the case class and use a tuple:
val tupDs = df.as[(Int,String)]
// org.apache.spark.sql.Dataset[(Int, String)] = [_1: int, _2: string]
Then if you had another case class / DF, like this say:
case class Nums(key: Int, num1: Double, num2: Long)
val df2 = Seq((1,7.7,101L),(2,1.2,10L)).toDF("key","num1","num2")
val ds2 = df2.as[Nums]
// org.apache.spark.sql.Dataset[Nums] = [key: int, num1: double, num2: bigint]
Then, while the syntax of join and joinWith are similar, the results are different:
df.join(df2, df.col("key") === df2.col("key")).show
// +---+-----+---+----+----+
// |key|value|key|num1|num2|
// +---+-----+---+----+----+
// | 1| asdf| 1| 7.7| 101|
// | 2|34234| 2| 1.2| 10|
// +---+-----+---+----+----+
ds.joinWith(ds2, df.col("key") === df2.col("key")).show
// +---------+-----------+
// | _1| _2|
// +---------+-----------+
// | [1,asdf]|[1,7.7,101]|
// |[2,34234]| [2,1.2,10]|
// +---------+-----------+
As you can see, joinWith leaves the objects intact as parts of a tuple, while join flattens out the columns into a single namespace. (Which will cause problems in the above case because the column name "key" is repeated.)
Curiously enough, I have to use df.col("key") and df2.col("key") to create the conditions for joining ds and ds2 -- if you use just col("key") on either side it does not work, and ds.col(...) doesn't exist. Using the original df.col("key") does the trick, however.
From https://docs.cloud.databricks.com/docs/latest/databricks_guide/05%20Spark/1%20Intro%20Datasets.html
it looks like you could just do
dfA.as("A").joinWith(dfB.as("B"), $"A.date" === $"B.date" )
For the above example, you can try the below:
Define a case class for your output
case class JoinOutput(key:Int, value:String, num1:Double, num2:Long)
Join two Datasets with Seq("key"), this will help you to avoid two duplicate key columns in the output, which will also help to apply the case class or fetch the data in the next step
val joined = ds.join(ds2, Seq("key")).as[JoinOutput]
// res27: org.apache.spark.sql.Dataset[JoinOutput] = [key: int, value: string ... 2 more fields]
The result will be flat instead:
joined.show
+---+-----+----+----+
|key|value|num1|num2|
+---+-----+----+----+
| 1| asdf| 7.7| 101|
| 2|34234| 1.2| 10|
+---+-----+----+----+

groupBykey in spark

New to spark here and I'm trying to read a pipe delimited file in spark. My file looks like this:
user1|acct01|A|Fairfax|VA
user1|acct02|B|Gettysburg|PA
user1|acct03|C|York|PA
user2|acct21|A|Reston|VA
user2|acct42|C|Fairfax|VA
user3|acct66|A|Reston|VA
and I do the following in scala:
scala> case class Accounts (usr: String, acct: String, prodCd: String, city: String, state: String)
defined class Accounts
scala> val accts = sc.textFile("accts.csv").map(_.split("|")).map(
| a => (a(0), Accounts(a(0), a(1), a(2), a(3), a(4)))
| )
I then try to group the key value pair by the key, and this is not sure if I'm doing this right...is this how I do it?
scala> accts.groupByKey(2)
res0: org.apache.spark.rdd.RDD[(String, Iterable[Accounts])] = ShuffledRDD[4] at groupByKey at <console>:26
I thought the (2) is to give me the first two results back but I don't seem to get anything back at the console...
If I run a distinct...I get this too..
scala> accts.distinct(1).collect(1)
<console>:26: error: type mismatch;
found : Int(1)
required: PartialFunction[(String, Accounts),?]
accts.distinct(1).collect(1)
EDIT:
Essentially I'm trying to get to a key value pair nested mapping. For example, user1 would looke like this:
user1 | {'acct01': {prdCd: 'A', city: 'Fairfax', state: 'VA'}, 'acct02': {prdCd: 'B', city: 'Gettysburg', state: 'PA'}, 'acct03': {prdCd: 'C', city: 'York', state: 'PA'}}
trying to learn this step by step so thought I'd break it down into chunks to understand...
I think you might have better luck if you put your data into a DataFrame if you've already gone through the process of defining a schema. First off, you need to modify the split comment to use single quotes. (See this question). Also, you can get rid of the a(0) in the beginning. Then, converting to a DataFrame is trivial. (Note that DataFrames are available on spark 1.3+.)
val accts = sc.textFile("/tmp/accts.csv").map(_.split('|')).map(a => Accounts(a(0), a(1), a(2), a(3), a(4)))
val df = accts.toDF()
Now df.show produces:
+-----+------+------+----------+-----+
| usr| acct|prodCd| city|state|
+-----+------+------+----------+-----+
|user1|acct01| A| Fairfax| VA|
|user1|acct02| B|Gettysburg| PA|
|user1|acct03| C| York| PA|
|user2|acct21| A| Reston| VA|
|user2|acct42| C| Fairfax| VA|
|user3|acct66| A| Reston| VA|
+-----+------+------+----------+-----+
It should be easier for you to work with the data. For example, to get a list of the unique users:
df.select("usr").distinct.collect()
produces
res42: Array[org.apache.spark.sql.Row] = Array([user1], [user2], [user3])
For more details, check out the docs.
3 observations that may help you understand the problem:
1)
groupByKey(2) does not return first 2 results, the parameter 2 is used as number of partitions for the resulting RDD. See docs.
2) collect does not take Int parameter. See docs.
3) split takes 2 types of parameters, Char or String. String version uses Regex so "|" needs escaping if intended as literal.