I know this seems like a really simple question and I have scoured google and stackoverflow for it, but could not find exactly what I need.
I have aggregated some data from one dataframe config into another config1 with the following code. The basis of the code was provided by another stackoverflow member Thank You #Sunny Shukla.
exprs=map(lambda c: max(c).alias(c), config.columns)
config1=config.groupBy(["seq_id","tool_id"])\
.agg(f.count(f.lit(1)).alias('count'),
*exprs).where('count = 1').drop('count')
The config dataframe has 20 columns and the config1 df has 22 columns because I have grouped it using 2 columns seq_id and tool_id but mapped the entire original columns to retain the original column names (im sure there is a more elegant way to do this)
The resulting dataframe config1 therefore has a duplicated columns of seq_id and tool_id. If I do
the config1.drop('seq_id','tool_id') then it drops 4 columns and i end up with 18 columns instead of 20.
Is there a more elegant way to do this without writing UDFs?
Thank You
Related
I've a DataFrame where I want to get the column names of the columns that contains one or more null values in them.
So far what I've done :
df.select([c for c in tbl_columns_list if df.filter(F.col(c).isNull()).count() > 0]).columns
I have almost 500 columns in my dataframe and when I execute that code, it becomes incredibly slow for a reason I don't know. Do you have any clue how can I make it work and how can I optimize that please? I need optimized solution in Pyspark please. Thanks in advance.
So, I have parquet files separated by folder with date in it, something like
root_folder
|_date=20210101
|_ file_A.parquet
|_date=20210102
|_ file_B.parquet
file_A has 2 column X,Y, file_B has 3 column X,Y,Z
but when i query using sparksession on the date 20210102, it's using schema from the topmost folder that is 20210101 and when i tried querying column Z it doesn't exist.
I've tried using mergeSchema=true option, but it doesn't fit my use case because I need to treat those with column Z differently, and i'm checking if there's column Z using DataFrame.columns.
Is there any workaround for this? I need to get schema from the one i query only.
If computational cost is not a concern, you can solve this problem by reading the entire dataset into spark, filter to the date you are looking for, and then drop the column if is entirely null.
This performs a pass over the data just to figure out if the column should be dropped, which is not great. Luckily .where and .count parallelize pretty well so you have enough compute it might be okay.
val base = spark.read
.option("mergeSchema", true)
.parquet("root_folder/")
.where(col("date") === "20210101")
val df = if (base.where(col("Z").isNotNull).count > 0) base.drop("Z") else base
df.schema // Should only have X, Y
If you want to generalize this into a function that drops all empty columns, you can compute the .isNotNull count for all columns in 1 pass.
I’m comparing the data ingested in hive table with that of that source and storing the differences in mariadb There are no primary keys for the tables and would like to have a optimise solution and though I’ve used except method to check the difference I’m finding difficult in printing out the difference in the columns for the same row which are different.
As far as I can think it's not possible to solve your problem in the absence of primary key as in that case each row of one DataFrame is potentially different than each row of the other DataFrame and practically you wouldn't want to report difference with each row of the other DataFrame.
I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame
I have a table of distinct users, which has 400,000 users. I would like to split it into 4 parts, and expected each user located in one part only.
Here is my code:
val numPart = 4
val size = 1.0 / numPart
val nsizes = Array.fill(numPart)(size)
val data = userList.randomSplit(nsizes)
Then I write each data(i), i from 0 to 3, into parquet files. Select the directory, group by user id and count by part, there are some users that located in two or more parts.
I still have no idea why?
I have found the solution: cache the DataFrame before you split it.
Should be
val data = userList.cache().randomSplit(nsizes)
Still have no idea why. My guess, each time the randomSplit function "fill" the data, it reads records from userList which is re-evaluate from the parquet file(s), and give a different order of rows, that's why some users are lost and some users are duplicated.
That's what I thought. If some one have any answer or explanation, I will update.
References:
(Why) do we need to call cache or persist on a RDD
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
http://159.203.217.164/using-sparks-cache-for-correctness-not-just-performance/
If your goal is to split it to different files you can use the functions.hash to calculate a hash, then mod 4 to get a number between 0 to 4 and when you write the parquet use partitionBy which would create a directory for each of the 4 values.