structtype add method to add arraytype

structtype add method to add arraytype - pyspark

structtype has a method call add. i see example to use
schema = Structype()
schema.add('testing',string)
schema.add('testing2',string)
how can I add Structype and array type in the schema , using add()?

You need to use it as below -
from pyspark.sql.types import *
schema = StructType()
schema.add('testing',StringType())
schema.add('testing2',StringType())
Sample example to create a dataframe using this schema -
df = spark.createDataFrame(data=[(1,2), (3,4)],schema=schema)
df.show()
+-------+--------+
|testing|testing2|
+-------+--------+
| 1| 2|
| 3| 4|
+-------+--------+

Related

How to creat a pyspark DataFrame inside of a loop?

How to creat a pyspark DataFrame inside of a loop? In this loop in each iterate I am printing 2 values print(a1,a2). now I want to store all these value in a pyspark dataframe.

Initially, before the loop, you could create an empty dataframe with your preferred schema. Then, create a new df for each loop with the same schema and union it with your original dataframe. Refer the code below.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField('a1', StringType(), True),
StructField('a2', StringType(), True)
])
df = spark.createDataFrame([],schema)
for i in range(1,5):
a1 = i
a2 = i+1
newRow = spark.createDataFrame([(a1,a2)], schema)
df = df.union(newRow)
print(df.show())
This gives me the below result where the values are appended to the df in each loop.
+---+---+
| a1| a2|
+---+---+
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+

How to make first row as header in PySpark reading text file as Spark context

The data frame what I get after reading text file in spark context
+----+---+------+
| _1| _2| _3|
+----+---+------+
|name|age|salary|
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
+----+---+------+
the dataframe I required is
+----+---+------+
|name|age|salary|
+----+---+------+
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
+----+---+------+
Here is the the code:
## from spark context
df_txt=spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
df_txt1=df_txt.map(lambda x: x.split(" "))
ddf=df_txt1.toDF().show()

You can use spark csv reader to read your comma seperate file.
For reading text file, you have to take first row as header and create a Seq of String and pass to toDF function. Also, remove first header to the rdd.
Note: Below code has written in spark scala. you can convert into lambda function to make it work in pyspark
import org.apache.spark.sql.functions._
val df = spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
val header = df.first()
val headerCol: Seq[String] = header.split(",").toList
val filteredRDD = df.filter(x=> x!= header)
val finaldf = filteredRDD.map( _.split(",")).map(w => (w(0),w(1),w(2))).toDF(headerCol: _*)
finaldf.show()
w(0),w(1),w(2) - you have to define fixed number of column from your file.

Remove all records which are duplicate in spark dataframe

I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.

I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
.where($"cnt"===1).drop($"cnt")
.show()

This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
+---+---+
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")

I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
df.killDuplicates(col("id"))
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
df
.withColumn(
"my_super_secret_count",
count("*").over(Window.partitionBy(cols: _*))
)
.where(col("my_super_secret_count") === 1)
.drop(col("my_super_secret_count"))
}
}
}
You might want to leverage the spark-daria library to keep this logic out of your codebase.

PySpark difference between pyspark.sql.functions.col and pyspark.sql.functions.lit

I find it hard to understand the difference between these two methods from pyspark.sql.functions as the documentation on PySpark official website is not very informative. For example the following code:
import pyspark.sql.functions as F
print(F.col('col_name'))
print(F.lit('col_name'))
The results are:
Column<b'col_name'>
Column<b'col_name'>
so what are the difference between the two and when should I use one and not the other?

The doc says:
col:
Returns a Column based on the given column name.
lit:
Creates a Column of literal value
Say if we have a data frame as below:
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
>>> df.show()
+---+
| A|
+---+
| a|
| b|
| c|
+---+
If using col to create a new column from A:
>>> df.withColumn("new", F.col("A")).show()
+---+---+
| A|new|
+---+---+
| a| a|
| b| b|
| c| c|
+---+---+
So col grabs an existing column with the given name, F.col("A") is equivalent to df.A or df["A"] here.
If using F.lit("A") to create the column:
>>> df.withColumn("new", F.lit("A")).show()
+---+---+
| A|new|
+---+---+
| a| A|
| b| A|
| c| A|
+---+---+
While lit will create a constant column with the given string as the values.
Both of them return a Column object but the content and meaning are different.

To explain in a very succinct manner, col is typically used to refer to an existing column in a DataFrame, as opposed to lit which is typically used to set the value of a column to a literal
To illustrate with an example:
Assume i have a DataFrame df containing two columns of IntegerType, col_a and col_b
If i wanted a column total which were the sum of the two columns:
df.withColumn('total', col('col_a') + col('col_b'))
Instead of i wanted a column fixed_val having the value "Hello" for all rows of the DataFrame df:
df.withColumn('fixed_val', lit('Hello'))

How to convert a dataframe column to sequence

I have a dataframe as below:
+-----+--------------------+
|LABEL| TERM|
+-----+--------------------+
| 4| inhibitori_effect|
| 4| novel_therapeut|
| 4| antiinflammator...|
| 4| promis_approach|
| 4| cell_function|
| 4| cell_line|
| 4| cancer_cell|
I want to create a new dataframe by taking all terms as sequence so that I can use them with Word2vec. That is:
+-----+--------------------+
|LABEL| TERM|
+-----+--------------------+
| 4| inhibitori_effect, novel_therapeut,..., cell_line |
As a result I want to apply this sample code as given here: https://spark.apache.org/docs/latest/ml-features.html#word2vec
So far I have tried to convert df to RDD and map it. And then I could not manage to re-convert it to a df.
Thanks in advance.
EDIT:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sc = new SparkContext(conf)
val sqlContext: SQLContext = new HiveContext(sc)
val df = sqlContext.load("jdbc",Map(
"url" -> "jdbc:oracle:thin:...",
"dbtable" -> "table"))
df.show(20)
df.groupBy($"label").agg(collect_list($"term").alias("term"))

You can use collect_list or collect_set functions:
import org.apache.spark.sql.functions.{collect_list, collect_set}
df.groupBy($"label").agg(collect_list($"term").alias("term"))
In Spark < 2.0 it requires HiveContext and in Spark 2.0+ you have to enable hive support in SessionBuilder. See Use collect_list and collect_set in Spark SQL

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

structtype add method to add arraytype - pyspark

structtype has a method call add. i see example to use schema = Structype() schema.add('testing',string) schema.add('testing2',string) how can I add Structype and array type in the schema , using add()?

Related

How to creat a pyspark DataFrame inside of a loop?

How to make first row as header in PySpark reading text file as Spark context

Remove all records which are duplicate in spark dataframe

PySpark difference between pyspark.sql.functions.col and pyspark.sql.functions.lit

How to convert a dataframe column to sequence

Categories

Resources