I am trying to get the first 15 columns from a DataFrame that contains more than 500 cols. But I don't know how to do it because is my first time using Scala Spark.
I was searching but didn't find anything, just how to get cols by name, for example:
val df2 = df.select("firstColName", "secondColeName")
How can i do this by index?
Thanks in advance!
Scala example:
df.selectExpr(df.columns.take(15):_*)
Related
I have a DataFrame with two columns, id1, id2 and what I'd like to get is to count the number of distinct values of these two columns. Essentially this is count(set(id1+id2)).
How can I do that with PySpark?
Thanks!
Please note that this isn't a duplicate as I'd like for PySpark to calculate the count(). Of course it's possible to get the two lists id1_distinct and id2_distinct and put them in a set() but it doesn't seem to me the proper solution when dealing with big data and it's not really in the PySpark spirit
You can combine the two columns into one using union, and get the countDistinct:
import pyspark.sql.functions as F
cnt = df.select('id1').union(df.select('id2')).select(F.countDistinct('id1')).head()[0]
I am tryping to drop rows of a spark dataframe which contain a specific value in a specific row.
For example, if i have the following DataFrame, i´d like to drop all rows which have "two" in column "A". So i´d like to drop the rows with index 1 and 2.
I want to do this using Scala 2.11 and Spark 2.4.0.
A B C
0 one 0 0
1 two 2 4
2 two 4 8
3 one 6 12
4 three 7 14
I tried something like this:
df = df.filer(_.A != "two")
or
df = df.filter(df("A") != "two")
Anyway both did not work. Any suggestions how that can be done?
Try:
df.filter(not($"A".contains("two")))
Or if you look for exact match:
df.filter(not($"A".equalTo("two")))
I finally found the solution in a very old post:
Is there a way to filter a field not containing something in a spark dataframe using scala?
The trick which does it is the following:
df = df.where(!$"A".contains("two")
I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.
I am trying to improve my Spark Scala skills and I have this case which I cannot find a way to manipulate so please advise!
I have original data as it shown in the figure bellow:
I want to calculate the percentage of every result of the count column . E.g. the last error value is 64 how much is 64 as a percentage out of the all column values. Please note that I am reading the original data as Dataframes using sqlContext:
Here is my code:
val df1 = df.groupBy(" Code")
.agg(sum("count").alias("sum"), mean("count")
.multiply(100)
.cast("integer").alias("percentage"))
I want results similar to this:
Thanks in advance!
Use agg and window functions:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
df
.groupBy("code")
.agg(sum("count").alias("count"))
.withColumn("fraction", col("count") / sum("count").over())
Assume the following two Dataframes in pyspark with equal number of rows:
df1:
|_ Column1a
|_ Column1b
df2:
|_ Column2a
|_ Column2b
I wish to create a a new DataFrame "df" which has Column1a and Column 2a only. What could be the best possible solution for it?
Denny Lee's answer is the way.
It involves creating another column on both the DataFrames which is the Unique_Row_ID for every row. We then perform a join on Unique_Row_ID. And then drop Unique_Row_ID if required.