I have a dataframe as below:
| 4| inhibitori_effect|
| 4| novel_therapeut|
| 4| antiinflammator...|
| 4| promis_approach|
| 4| cell_function|
| 4| cell_line|
| 4| cancer_cell|
I want to create a new dataframe by taking all terms as sequence so that I can use them with Word2vec. That is:
| 4| inhibitori_effect, novel_therapeut,..., cell_line |
As a result I want to apply this sample code as given here: https://spark.apache.org/docs/latest/ml-features.html#word2vec
So far I have tried to convert df to RDD and map it. And then I could not manage to re-convert it to a df.
Thanks in advance.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sc = new SparkContext(conf)
val sqlContext: SQLContext = new HiveContext(sc)
val df = sqlContext.load("jdbc",Map(
"url" -> "jdbc:oracle:thin:...",
"dbtable" -> "table"))
You can use collect_list or collect_set functions:
import org.apache.spark.sql.functions.{collect_list, collect_set}
In Spark < 2.0 it requires HiveContext and in Spark 2.0+ you have to enable hive support in SessionBuilder. See Use collect_list and collect_set in Spark SQL
The data frame what I get after reading text file in spark context
| _1| _2| _3|
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
the dataframe I required is
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
Here is the the code:
## from spark context
df_txt1=df_txt.map(lambda x: x.split(" "))
You can use spark csv reader to read your comma seperate file.
For reading text file, you have to take first row as header and create a Seq of String and pass to toDF function. Also, remove first header to the rdd.
Note: Below code has written in spark scala. you can convert into lambda function to make it work in pyspark
import org.apache.spark.sql.functions._
val df = spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
val header = df.first()
val headerCol: Seq[String] = header.split(",").toList
val filteredRDD = df.filter(x=> x!= header)
val finaldf = filteredRDD.map( _.split(",")).map(w => (w(0),w(1),w(2))).toDF(headerCol: _*)
w(0),w(1),w(2) - you have to define fixed number of column from your file.
This is the current code:
from pyspark.sql import SparkSession
park_session = SparkSession\
lines = spark_session\
.option("host", "")\
.option("port", 9998)\
The 'lines' looks like this:
| value |
| a,b,c |
But I want to look like this:
| a | b | c |
I tried using the 'split()' method, but it didn't work. You could only split each string into a list in a column, not into multiple columns
What should I do?
Split the value column and by accessing array index (or) element_at(from spark-2.4) (or) getItem() functions to create new columns.
from pyspark.sql.functions import *
#| a| b| c|
from pyspark.sql.functions import *
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
spark_session = SparkSession\
lines = spark_session\
.option("host", "")\
.option("port", 9998)\
split_col = f.split(lines['value'], ",")
df = df.withColumn('col1', split_col.getItem(0))
df = df.withColumn('col2', split_col.getItem(1))
df = df.withColumn('col2', split_col.getItem(2))
Incase you have different numbers of delimiters and not just 3 for each row , you can use the below:
|value |
|a,b,c |
import pyspark.sql.functions as F
max_size = df.select(F.max(F.length(F.regexp_replace('value','[^,]','')))).first()[0]
out = df.select([F.split("value",',')[x].alias(f"Col{x+1}") for x in range(max_size+1)])
| a| b| c|null|
| d| e| f| g|
Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?
I am not sure you are looking for these...
approx_count_distinct and countDistinct
are the things available wtih spark api
there is no approx_count_groupby
Examples :
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
object CountAgg extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
val spark = SparkSession.builder.appName(getClass.getName)
import spark.implicits._
import org.apache.spark.sql.functions._
val df =
).toDF("Page", "Visitor")
println("groupby abd count example ")
println("group by and countDistinct")
.agg( countDistinct('visitor)).show
println("group by and approx_count_distinct")
.agg( approx_count_distinct('visitor)).show
| page|count|
|PAGE2| 4|
|PAGE1| 8|
group by and countDistinct
| page|count(DISTINCT visitor)|
|PAGE2| 2|
|PAGE1| 3|
group by and approx_count_distinct
[2020-04-06 01:04:24,488] WARN Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. (org.apache.spark.util.Utils:66)
| page|approx_count_distinct(visitor)|
|PAGE2| 2|
|PAGE1| 3|
I have got a dataframe, on which I want to add a header and a first column
manually. Here is the dataframe :
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema",true).csv("C:\\gg.csv").cache()
the content of the dataframe
The expected output is
What you want to do is:
df.withColumn("columnName", column) //here "columnName" should be "define" for you
Now you just need to create the said column (this might help)
Here is a solution that depends on Spark 2.4:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.Row
//First off the dataframe needs to be loaded with the expected schema
val spark = SparkSession.builder().appName().getOrCreate()
val schema = new StructType()
val df = spark.read.format("csv").schema(schema).load("C:\\gg.csv").cache()
val rddWithId = df.rdd.zipWithIndex
// Prepend "define" column of type Long
val newSchema = StructType(Array(StructField("define", StringType, false)) ++ df.schema.fields)
val dfZippedWithId = spark.createDataFrame(rddWithId.map{
case (row, index) =>
Row.fromSeq(Array("c" + index) ++ row.toSeq)}, newSchema)
// Show results
| c0| 12| 13| 14|
| c1| 11| 10| 5|
| c2| 3| 2| 45|
This is a mix of the documentation here and this example.
I am learning to work with Apache Spark (Scala) and still figuring out how things work out here
I am trying to achieve a simple task of
Finding max of column
Subtract each value of the column from this max and create a new column
The code I am using is
import org.apache.spark.sql.functions._
val training = sqlContext.createDataFrame(Seq(
val training_max = training.withColumn("Val_Max",training.groupBy().agg(max("Values"))
val training_max_sub = training_max.withColumn("Subs",training_max.groupBy().agg(col("Val_Max")-col("Values) ))
However I am getting a lot of errors. I am more or less fluent in R and had I been doing the same task my code would have been:
new_data <- training %>%
mutate(Subs= max(Values) - Values)
Here is a solution using window functions. You'll need a HiveContext to use them
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val training = sc.parallelize(Seq(10,13,14,21)).toDF("values")
max($"values").over(Window.partitionBy()) - $"values").show
Which produces the expected output :
| 10| 11|
| 13| 8|
| 14| 7|
| 21| 0|