I have the following data frame:
_name
data
Test
{[{0, 0, 1, 0 }]}
I want the output as:
allNames
data
Test
0
Test
0
Test
1
Test
1
I tried the explode function, but the following code just returns the same data frame as above with just the headers changed. How can I change the code to get the expected output?
val t = cabinetDF.select(col("_attrname").as("allNames"),functions.explode(array(col("ROWDATA"))))
t.show(false)
This worked for me, but then I'm creating an ordinary array. I'm not sure why yours is wrapped in an additional set of curly brackets and an additional set of square brackets:
import org.apache.spark.sql.functions._
import spark.sqlContext.implicits._
val df = Seq( ("Test", Array( 0, 0, 1, 0 ) ) )
.toDF("_name", "data")
df.show
val df2 = df
.withColumn("x", explode($"data"))
.select("_name", "x")
df2.show
My results:
Perhaps you need to unpack it first or look at how it's being created in the first place.
Related
I have a dataframe with 100 million rows and ~ 10,000 columns. The columns are of two types, standard (C_i) followed by dynamic (X_i). This dataframe was obtained after some processing, and the performance was fast. Now only 2 steps remain:
Goal:
A particular operation needs to be done on every X_i using identical subset of C_i columns.
Convert each of X-i column into FloatType.
Difficulty:
Performance degrades terribly with increasing number of columns.
After a while, only 1 executor seems to work (%CPU use < 200%), even on a sample data with 100 rows and 1,000 columns. If I push it to 1,500 columns, it crashes.
Minimal code:
import spark.implicits._
import org.apache.spark.sql.types.FloatType
// sample_udf
val foo = (s_val: String, t_val: String) => {
t_val + s_val.takeRight(1)
}
val foos_udf = udf(foo)
spark.udf.register("foos_udf", foo)
val columns = Seq("C1", "C2", "X1", "X2", "X3", "X4")
val data = Seq(("abc", "212", "1", "2", "3", "4"),("def", "436", "2", "2", "1", "8"),("abc", "510", "1", "2", "5", "8"))
val rdd = spark.sparkContext.parallelize(data)
var df = spark.createDataFrame(rdd).toDF(columns:_*)
df.show()
for (cols <- df.columns.drop(2)) {
df = df.withColumn(cols, foos_udf(col("C2"),col(cols)))
}
df.show()
for (cols <- df.columns.drop(2)) {
df = df.withColumn(cols,col(cols).cast(FloatType))
}
df.show()
Error on 1,500 column data:
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.isStreaming(LogicalPlan.scala:37)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$isStreaming$1.apply(LogicalPlan.scala:37)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$isStreaming$1.apply(LogicalPlan.scala:37)
at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
at scala.collection.immutable.List.exists(List.scala:84)
...
Thoughts:
Perhaps var could be replaced, but the size of the data is close to 40% of the RAM.
Perhaps for loop for dtype casting could be causing degradation of performance, though I can't see how, and what are the alternatives. From searching on internet, I have seen people suggesting foldLeft based approach, but that apparently still gets translated to for loop internally.
Any inputs on this would be greatly appreciated.
A faster solution was to call UDF on row itself rather than calling on each column. As Spark stores data as rows, the earlier approach was exhibiting terrible performance.
def my_udf(names: Array[String]) = udf[String,Row]((r: Row) => {
val row = Array.ofDim[String](names.length)
for (i <- 0 until row.length) {
row(i) = r.getAs(i)
}
...
}
...
val df2 = df1.withColumn(results_col,my_udf(df1.columns)(struct("*"))).select(col(results_col))
Type casting can be done as suggested by Riccardo
not sure if this will fix the performance on your side with 10000~ columns, but I was able to run it locally with 1500 using the following code.
I addressed points #1 and #2, which may have had some impact on performance. One note, to my understanding foldLeft should be a pure recursive function without an internal for loop, so it might have an impact on performance in this case.
Also, the two for loops can be simplified into a single for loop that I refactored as foldLeft.
We might also get a performance increase if we replace the udf with a spark function.
import spark.implicits._
import org.apache.spark.sql.types.FloatType
import org.apache.spark.sql.functions._
// sample_udf
val foo = (s_val: String, t_val: String) => {
t_val + s_val.takeRight(1)
}
val foos_udf = udf(foo)
spark.udf.register("foos_udf", foo)
val numberOfColumns = 1500
val numberOfRows = 100
val colNames = (1 to numberOfColumns).map(s => s"X$s")
val colValues = (1 to numberOfColumns).map(_.toString)
val columns = Seq("C1", "C2") ++ colNames
val schema = StructType(columns.map(field => StructField(field, StringType)))
val rowFields = Seq("abc", "212") ++ colValues
val listOfRows = (1 to numberOfRows).map(_ => Row(rowFields: _*))
val listOfRdds = spark.sparkContext.parallelize(listOfRows)
val df = spark.createDataFrame(listOfRdds, schema)
df.show()
val newDf = df.columns.drop(2).foldLeft(df)((df, colName) => {
df.withColumn(colName, foos_udf(col("C2"), col(colName)) cast FloatType)
})
newDf.show()
Hope this helps!
*** EDIT
Found a way better solution that circumvents loops. Simply make a single expression with SelectExpr, this way sparks casts all columns in one go without any kind of recursion. From my previous example:
instead of doing fold left, just replace it with these lines. I just tested it with 10k columns 100 rows in my local computer, lasted a few seconds
val selectExpression = Seq("C1", "C2") ++ colNames.map(s => s"cast($s as float)")
val newDf = df.selectExpr(selectExpression:_*)
I have a df tthat one of the columns is a set of words. How I can make them lower case in the efficient way?
The df has many column but the column that I am trying to make it lower case is like this:
B
['Summer','Air Bus','Got']
['Parmin','Home']
Note:
In pandas I do df['B'].str.lower()
If I understood you correctly, you have a column that is an array of strings.
To lower the string, you can use lower function like this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
data = [
{"B": ["Summer", "Air Bus", "Got"]},
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df = df.withColumn("result", F.expr("transform(B, x -> lower(x))"))
Result:
+----------------------+----------------------+
|B |result |
+----------------------+----------------------+
|[Summer, Air Bus, Got]|[summer, air bus, got]|
+----------------------+----------------------+
A slight variation on #vladsiv's answer, which tries to answer a question in the comments about passing a dynamic column name.
# set column name
m = "B"
# use F.tranform directly, rather than in a F.expr
df = df.withColumn("result", F.transform(F.col(m), lambda x:F.lower(x)))
EX1. This with an RDD gives Serialization as we expect with or without Object and val num being the culprit, fine:
object Example {
val r = 1 to 1000000 toList
val rdd = sc.parallelize(r,3)
val num = 1
val rdd2 = rdd.map(_ + num)
rdd2.collect
}
Example
EX2. Using a Dataframe in similar fashion, however, does not. Why is that as it looks sort of the same? What am I missing here?
object Example {
import spark.implicits._
import org.apache.spark.sql.functions._
val n = 1
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
df.repartition(3).withColumn("plus1", $"b" + n).show(false)
}
Example
Reasons not entirely clear to me on DF, would expect similar behaviour. Looks like DSs circumvent some issues, but I may well be missing something.
Running on Databricks gives plenty of Serializatiion issues, so do not think that is affecting things, handy to test.
The reason is simple and more fundamental than distinction between RDD and Dataset:
The first piece of code evaluates a function
_ + num
therefore it has to be computed and evaluated.
The second piece of code doesn't. Following
$"b" + n
is just a value, therefore no closure computation and subsequent serialization is required.
If this is still not clear you can think about it this way:
The former piece of code tells Spark how to do something.
The latter piece of code tells Spark what to do. Actual code that is executed is generated in different scope.
If your Dataset code was closer to it's RDD counterpart, for example:
object Example {
import spark.implicits._
val num = 1
spark.range(1000).map(_ + num).collect
}
or
Example {
import spark.implicits._
import org.apache.spark.sql.functions._
val num = 1
val f = udf((x: Int) => x + num)
spark.range(1000).select(f($"id")).collect
}
it would fail with serialization exception, same as RDD version does.
So, what I'm doing below is I drop a column A from a DataFrame because I want to apply a transformation (here I just json.loads a JSON string) and replace the old column with the transformed one. After the transformation I just join the two resulting data frames.
df = df_data.drop('A').join(
df_data[['ID', 'A']].rdd\
.map(lambda x: (x.ID, json.loads(x.A))
if x.A is not None else (x.ID, None))\
.toDF()\
.withColumnRenamed('_1', 'ID')\
.withColumnRenamed('_2', 'A'),
['ID']
)
The thing I dislike about this is of course the overhead I'm faced because I had to do the withColumnRenamed operations.
With pandas All I'd do something like this:
pdf = pd.DataFrame([json.dumps([0]*np.random.randint(5,10)) for i in range(10)], columns=['A'])
pdf.A = pdf.A.map(lambda x: json.loads(x))
pdf
but the following does not work in pyspark:
df.A = df[['A']].rdd.map(lambda x: json.loads(x.A))
So is there an easier way than what I'm doing in my first code snipped?
I do not think you need to drop the column and do the join. The following code should* be equivalent to what you posted:
cols = df_data.columns
df = df_data.rdd\
.map(
lambda row: tuple(
[row[c] if c != 'A' else (json.loads(row[c]) if row[c] is not None else None)
for c in cols]
)
)\
.toDF(cols)
*I haven't actually tested this code, but I think this should work.
But to answer your general question, you can transform a column in-place using withColumn().
df = df_data.withColumn("A", my_transformation_function("A").alias("A"))
Where my_transformation_function() can be a udf or a pyspark sql function.
From what i could understand, is it something like this you are trying to achieve?
import pyspark.sql.functions as F
import json
json_convert = F.udf(lambda x: json.loads(x) if x is not None else None)
cols = df_data.columns
df = df_data.select([json_convert(F.col('A')).alias('A')] + \
[col for col in cols if col != 'A'])
I am studying Spark on VirtualBox. I use ./bin/spark-shell to open Spark and use Scala. Now I got confused about key-value format using Scala.
I have a txt file in home/feng/spark/data, which looks like:
panda 0
pink 3
pirate 3
panda 1
pink 4
I use sc.textFile to get this txt file. If I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7")
Then I can use rdd.collect() to show rdd on the screen:
scala> rdd.collect()
res26: Array[String] = Array(panda 0, pink 3, pirate 3, panda 1, pink 4)
However, if I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7.txt")
which no ".txt" here. Then when I use rdd.collect(), I got a mistake:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/feng/spark/A.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
......
But I saw other examples. All of them have ".txt" at the end. Is there something wrong with my code or my system?
Another thing is when I tried to do:
scala> val rddd = rdd.map(x => (x.split(" ")(0),x))
rddd: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[2] at map at <console>:29
scala> rddd.collect()
res0: Array[(String, String)] = Array((panda,panda 0), (pink,pink 3), (pirate,pirate 3), (panda,panda 1), (pink,pink 4))
I intended to select the first column of the data and use it as the key. But rddd.collect() looks like not that way as the words occur twice, which is not right. I cannot keep doing the rest operations like mapbykey, reducebykey or others. Where did I do wrong?
Just for example I create a String with your dataset, after this I split the record by line, and use SparkContext's parallelize method to create an RDD. Notice that after I create the RDD I use its map method to split the String stored in each record and convert it to a Row.
import org.apache.spark.sql.Row
val text = "panda 0\npink 3\npirate 3\npanda 1\npink 4"
val rdd = sc.parallelize(text.split("\n")).map(x => Row(x.split(" "):_*))
rdd.take(3)
The output from the take method is:
res4: Array[org.apache.spark.sql.Row] = Array([panda,0], [pink,3], [pirate,3])
About your first question, there is no need for files to have any extension. Because, in this case files are seen as plain text.