how to modify one column value in one row used by pyspark - pyspark

I want to update value when userid=22650984.How to do it in pyspark platform?thank you for helping.
>>>xxDF.select('userid','registration_time').filter('userid="22650984"').show(truncate=False)
18/04/08 10:57:00 WARN TaskSetManager: Lost task 0.1 in stage 57.0 (TID 874, shopee-hadoop-slave89, executor 9): TaskKilled (killed intentionally)
18/04/08 10:57:00 WARN TaskSetManager: Lost task 11.1 in stage 57.0 (TID 875, shopee-hadoop-slave97, executor 16): TaskKilled (killed intentionally)
+--------+----------------------------+
|userid |registration_time |
+--------+----------------------------+
|22650984|270972-04-26 13:14:46.345152|
+--------+----------------------------+

If you want to modify a subset of your DataFrame and keep the rest unchanged, the best option would be to use pyspark.sql.functions.when() as using filter or pyspark.sql.functions.where() would remove all rows where the condition is not met.
from pyspark.sql.functions import col, when
valueWhenTrue = None # for example
df.withColumn(
"existingColumnToUpdate",
when(
col("userid") == 22650984,
valueWhenTrue
).otherwise(col("existingColumnToUpdate"))
)
When will evaluate the first argument as a boolean condition. If the condition is True, it will return the second argument. You can chain together multiple when statements as shown in this post and also this post. Or use otherwise() to specify what to do when the condition is False.
In this example, I am updating an existing column "existingColumnToUpdate". When the userid is equal to the specified value, I will update the column with valueWhenTrue. Otherwise, we will keep the value in the column unchanged.

Change Value of a Dataframe Column Based on a Filter:
from pyspark.sql.functions import lit new_df = xxDf.filter(xxDf.userid == "22650984").withColumn('clumn_to update', lit(<update_expression>)

You can use withColumn to achieve what you are looking to do:
new_df = xxDf.filter(xxDf.userid = "22650984").withColumn(xxDf.field_to_update, <update_expression>)
the update_expression would have your logic for update - could be UDF, or derived field, etc..

Related

In Spark I am not able to filter by existing column

I am trying to filter by one of the column in the dataframe using spark. But spark throws below error,
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'Inv. Pty' given input columns: [Pstng Date, Name 1, Inv. Pty, Year]
invDF.filter(col("Inv. Pty") === "2001075").show()
Try this with the backwards `
invDF.filter(col("`Inv. Pty`") === "2001075").show()
The issue is Spark assumes the column with "dot" as struct column.
To counter that, you need to use a backtick "`". This should work:
invDF.filter(col("`Inv. Pty`") === "2001075").show()
Not sure but given input columns: [Pstng Date, Name 1, Inv. Pty, Year] column has an extra space Inv. Pty,might be that is the problem.

Replacing a character in dataframe string columns in Scala

I have a task to remove all line delimiters (\n) from all string columns in a table.
The number of table columns is unknown, the code should process any table.
I wrote the code that would go through all columns in a loop, retrieve a column data type and replace the line delimiter:
//let's assume we already have a dataframe 'df' that can contain any table
df.cache()
val dfTypes = df.dtypes
for ( i <- 0 to (dfTypes.length - 1)) {
var tupCol = dfTypes(i)
if (tupCol._2 == "StringType" )
df.unpersist()
df = df.withColumn(tupCol._1, regexp_replace(col(tupCol._1), "\n", " "))
df.cache()
}
df.unpersist()
The code itself works fine, but when I run this code for ~50 tables in parallel I constantly get the following error for one random table:
18/11/20 04:31:41 WARN TaskSetManager: Lost task 9.0 in stage 6.0 (TID 29, ip-10-114-4-145.us-west-2.compute.internal, executor 1): java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:260)
at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:190)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.close(UnsafeRowSerializer.scala:96)
at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:173)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:156)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I can run less or more than 50 jobs but the only one (random) keeps failing.
The jobs are running on EMR cluster with the following configuration:
Master node: r4.2xlarge x 1
Core nodes: m5.2xlarge x 3
Task nodes: m5.2xlarge x (Autoscaling from 1 to 10)
I think my code consumes a lot of memory and disc space because it creates new dataframes in a loop. But I do not see any other solution to process a table without knowing a number of string columns.
I need a suggestion of how optimize the code.
Thanks.

RandomForestClassifier for multiclass classification Spark 2.x

I'm trying to use random forest for a multiclass classification using spark 2.1.1
After defining my pipeline as usual, it's failing during indexing stage.
I have a dataframe with many string type columns. I have created a StringIndexer for each of them.
I am creating a Pipeline by chaining the StringIndexers with VectorAssembler and finally a RandomForestClassifier following by a label converter.
I've checked all my columns with distinct().count() to make sure I do not have too many categories and so on...
After some debugging, I understand that whenever I started the indexing of some of the columns I get the following errors...
When calling:
val indexer = udf { label: String =>
if (labelToIndex.contains(label)) {
labelToIndex(label)
} else {
throw new SparkException(s"Unseen label: $label.")
}
}
Error evaluating methog: 'labelToIndex'
Error evaluating methog: 'labels'
Then inside the transformation, there is this error when defining the metadata:
Error evaluating method: org$apache$spark$ml$feature$StringIndexerModel$$labelToIndex
Method threw 'java.lang.NullPointerException' exception. Cannot evaluate org.apache.spark.sql.types.Metadata.toString()
This is happening because I have null on some columns that I'm indexing.
I could reproduce the error with the following example.
val df = spark.createDataFrame(
Seq(("asd2s","1e1e",1.1,0), ("asd2s","1e1e",0.1,0),
(null,"1e3e",1.2,0), ("bd34t","1e1e",5.1,1),
("asd2s","1e3e",0.2,0), ("bd34t","1e2e",4.3,1))
).toDF("x0","x1","x2","x3")
val indexer = new
StringIndexer().setInputCol("x0").setOutputCol("x0idx")
indexer.fit(df).transform(df).show
// java.lang.NullPointerException
https://issues.apache.org/jira/browse/SPARK-11569
https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
The solution present here can be used, and on the Spark 2.2.0, the issue is fixed upstream.
You can use
DataFrame.na.fill(Map("colName1", val1), ("colName2", val2),..))
Where:
DataFrame - DataFrame Object ; "colName" - name of the column & val - value for replacing nulls if any found in column "colName".
Use feature transformations, after filling all nulls.
You can check for number of nulls in all columns of as:
for ( column <- DataFrame.columns ) {
DataFrame.filter(DataFrame(column) === null || DataFrame(column).isNull || DataFrame(column).isNan).count()
}
OR
DataFrame.count() will give you total number of rows in DataFrame. Then number of nulls can be judged by DataFrame.describe()

How to get the difference between minimum and maximum stock for each product?

I want to get maximum and minimum stock for each product and calculate the difference between these values. If a stock is equal to null or is empty, then it should be substituted by 0.
This is my code:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("product_pk")
val sales = df
.select($"product_pk",$"stock")
.withColumn("sold",max($"stock")-min($"stock").over(windowSpec))
But I get this error when I run my code. I do not understand why I cannot use Window with product_pk:
diagnostics: User class threw exception:
org.apache.spark.sql.AnalysisException: grouping expressions sequence
is empty, and 'mytable.product_pk' is not an aggregate function.
Wrap '(max(mytable.stock) AS _w0)' in windowing function(s) or
wrap 'mytable.product_pk' in first() (or first_value) if you don't
care which value you get.;;
Or should I use groupBy product_pk?
currently you use max in the wrong context (no window specified), try:
val sales = df
.select($"product_pk",$"stock")
.withColumn("sold",max($"stock").over(windowSpec)-min($"stock").over(windowSpec))
You can also use groupBy :
val sales = df
.groupBy($"product_pk",$"stock")
.agg(
max($"stock").as("max_stock"),
min($"stock").as("min_stock")
)
.withColumn("sold",coalesce($"max_stock"-$"min_stock",lit(0.0)))

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark