I have a dataframe as below:
+-----+--------------------+
|LABEL| TERM|
+-----+--------------------+
| 4| inhibitori_effect|
| 4| novel_therapeut|
| 4| antiinflammator...|
| 4| promis_approach|
| 4| cell_function|
| 4| cell_line|
| 4| cancer_cell|
I want to create a new dataframe by taking all terms as sequence so that I can use them with Word2vec. That is:
+-----+--------------------+
|LABEL| TERM|
+-----+--------------------+
| 4| inhibitori_effect, novel_therapeut,..., cell_line |
As a result I want to apply this sample code as given here: https://spark.apache.org/docs/latest/ml-features.html#word2vec
So far I have tried to convert df to RDD and map it. And then I could not manage to re-convert it to a df.
Thanks in advance.
EDIT:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sc = new SparkContext(conf)
val sqlContext: SQLContext = new HiveContext(sc)
val df = sqlContext.load("jdbc",Map(
"url" -> "jdbc:oracle:thin:...",
"dbtable" -> "table"))
df.show(20)
df.groupBy($"label").agg(collect_list($"term").alias("term"))
You can use collect_list or collect_set functions:
import org.apache.spark.sql.functions.{collect_list, collect_set}
df.groupBy($"label").agg(collect_list($"term").alias("term"))
In Spark < 2.0 it requires HiveContext and in Spark 2.0+ you have to enable hive support in SessionBuilder. See Use collect_list and collect_set in Spark SQL
Related
The data frame what I get after reading text file in spark context
+----+---+------+
| _1| _2| _3|
+----+---+------+
|name|age|salary|
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
+----+---+------+
the dataframe I required is
+----+---+------+
|name|age|salary|
+----+---+------+
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
+----+---+------+
Here is the the code:
## from spark context
df_txt=spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
df_txt1=df_txt.map(lambda x: x.split(" "))
ddf=df_txt1.toDF().show()
You can use spark csv reader to read your comma seperate file.
For reading text file, you have to take first row as header and create a Seq of String and pass to toDF function. Also, remove first header to the rdd.
Note: Below code has written in spark scala. you can convert into lambda function to make it work in pyspark
import org.apache.spark.sql.functions._
val df = spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
val header = df.first()
val headerCol: Seq[String] = header.split(",").toList
val filteredRDD = df.filter(x=> x!= header)
val finaldf = filteredRDD.map( _.split(",")).map(w => (w(0),w(1),w(2))).toDF(headerCol: _*)
finaldf.show()
w(0),w(1),w(2) - you have to define fixed number of column from your file.
This is the current code:
from pyspark.sql import SparkSession
park_session = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
lines = spark_session\
.readStream\
.format("socket")\
.option("host", "127.0.0.1")\
.option("port", 9998)\
.load()
The 'lines' looks like this:
+-------------+
| value |
+-------------+
| a,b,c |
+-------------+
But I want to look like this:
+---+---+---+
| a | b | c |
+---+---+---+
I tried using the 'split()' method, but it didn't work. You could only split each string into a list in a column, not into multiple columns
What should I do?
Split the value column and by accessing array index (or) element_at(from spark-2.4) (or) getItem() functions to create new columns.
from pyspark.sql.functions import *
lines.withColumn("tmp",split(col("value"),',')).\
withColumn("col1",col("tmp")[0]).\
withColumn("col2",col("tmp").getItem(1)).\
withColumn("col3",element_at(col("tmp"),3))
drop("tmp","value").\
show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| a| b| c|
#+----+----+----+
from pyspark.sql.functions import *
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
spark_session = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
lines = spark_session\
.readStream\
.format("socket")\
.option("host", "127.0.0.1")\
.option("port", 9998)\
.load()
split_col = f.split(lines['value'], ",")
df = df.withColumn('col1', split_col.getItem(0))
df = df.withColumn('col2', split_col.getItem(1))
df = df.withColumn('col2', split_col.getItem(2))
df.show()
Incase you have different numbers of delimiters and not just 3 for each row , you can use the below:
Input:
+-------+
|value |
+-------+
|a,b,c |
|d,e,f,g|
+-------+
Solution
import pyspark.sql.functions as F
max_size = df.select(F.max(F.length(F.regexp_replace('value','[^,]','')))).first()[0]
out = df.select([F.split("value",',')[x].alias(f"Col{x+1}") for x in range(max_size+1)])
Output
out.show()
+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
| a| b| c|null|
| d| e| f| g|
+----+----+----+----+
Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?
I am not sure you are looking for these...
approx_count_distinct and countDistinct
are the things available wtih spark api
there is no approx_count_groupby
Examples :
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
object CountAgg extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
import spark.implicits._
import org.apache.spark.sql.functions._
val df =
Seq(("PAGE1","VISITOR1"),
("PAGE1","VISITOR1"),
("PAGE2","VISITOR1"),
("PAGE2","VISITOR2"),
("PAGE2","VISITOR1"),
("PAGE1","VISITOR1"),
("PAGE1","VISITOR2"),
("PAGE1","VISITOR1"),
("PAGE1","VISITOR2"),
("PAGE1","VISITOR1"),
("PAGE2","VISITOR2"),
("PAGE1","VISITOR3")
).toDF("Page", "Visitor")
println("groupby abd count example ")
df.groupBy($"page").agg(count($"visitor").as("count")).show
println("group by and countDistinct")
df.select("page","visitor")
.groupBy('page)
.agg( countDistinct('visitor)).show
println("group by and approx_count_distinct")
df.select("page","visitor")
.groupBy('page)
.agg( approx_count_distinct('visitor)).show
}
Result
+-----+-----+
| page|count|
+-----+-----+
|PAGE2| 4|
|PAGE1| 8|
+-----+-----+
group by and countDistinct
+-----+-----------------------+
| page|count(DISTINCT visitor)|
+-----+-----------------------+
|PAGE2| 2|
|PAGE1| 3|
+-----+-----------------------+
group by and approx_count_distinct
[2020-04-06 01:04:24,488] WARN Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. (org.apache.spark.util.Utils:66)
+-----+------------------------------+
| page|approx_count_distinct(visitor)|
+-----+------------------------------+
|PAGE2| 2|
|PAGE1| 3|
+-----+------------------------------+
I have got a dataframe, on which I want to add a header and a first column
manually. Here is the dataframe :
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema",true).csv("C:\\gg.csv").cache()
the content of the dataframe
12,13,14
11,10,5
3,2,45
The expected output is
define,col1,col2,col3
c1,12,13,14
c2,11,10,5
c3,3,2,45
What you want to do is:
df.withColumn("columnName", column) //here "columnName" should be "define" for you
Now you just need to create the said column (this might help)
Here is a solution that depends on Spark 2.4:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.Row
//First off the dataframe needs to be loaded with the expected schema
val spark = SparkSession.builder().appName().getOrCreate()
val schema = new StructType()
.add("col1",IntegerType,true)
.add("col2",IntegerType,true)
.add("col3",IntegerType,true)
val df = spark.read.format("csv").schema(schema).load("C:\\gg.csv").cache()
val rddWithId = df.rdd.zipWithIndex
// Prepend "define" column of type Long
val newSchema = StructType(Array(StructField("define", StringType, false)) ++ df.schema.fields)
val dfZippedWithId = spark.createDataFrame(rddWithId.map{
case (row, index) =>
Row.fromSeq(Array("c" + index) ++ row.toSeq)}, newSchema)
// Show results
dfZippedWithId.show
Displays:
+------+----+----+----+
|define|col1|col2|col3|
+------+----+----+----+
| c0| 12| 13| 14|
| c1| 11| 10| 5|
| c2| 3| 2| 45|
+------+----+----+----+
This is a mix of the documentation here and this example.
I am learning to work with Apache Spark (Scala) and still figuring out how things work out here
I am trying to achieve a simple task of
Finding max of column
Subtract each value of the column from this max and create a new column
The code I am using is
import org.apache.spark.sql.functions._
val training = sqlContext.createDataFrame(Seq(
(10),
(13),
(14),
(21)
)).toDF("Values")
val training_max = training.withColumn("Val_Max",training.groupBy().agg(max("Values"))
val training_max_sub = training_max.withColumn("Subs",training_max.groupBy().agg(col("Val_Max")-col("Values) ))
However I am getting a lot of errors. I am more or less fluent in R and had I been doing the same task my code would have been:
library(dplyr)
new_data <- training %>%
mutate(Subs= max(Values) - Values)
Here is a solution using window functions. You'll need a HiveContext to use them
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val training = sc.parallelize(Seq(10,13,14,21)).toDF("values")
training.withColumn("subs",
max($"values").over(Window.partitionBy()) - $"values").show
Which produces the expected output :
+------+----+
|values|subs|
+------+----+
| 10| 11|
| 13| 8|
| 14| 7|
| 21| 0|
+------+----+