How to split column into multiple columns in Spark 2? - scala

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.

The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()

A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))

If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

Related

Logic to manipulate dataframe in spark scala [Spark]

Take for example the following dataFrame:
x.show(false)
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/done/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/done/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
Now I am trying to update the existing DF to create a new DF based based on the column hdfsPath
The new DF should look like the following:
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/target/20200219/12/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
So the path done changes to target and then from the compiled-20200218050518-1-0-0-1582020318751.snappy portion I get the date 20200218 and then colID 11 and then finally the snappy file. What would be the easiest and most efficient way to achieve this?
It's not a hard requirement to create a newDF, I can update the existing DF with a new column.
To summarize:
Current hdfsPath:
hdfs://novus-nameservice/a/b/c/done/compiled-20200218050518-1-0-0-1582020318751.snappy
Expected hdfsPath:
hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy
Based on colID.
The simplest way i can imagine doing this is converting your dataframe to a dataset and apply a map operation and then back to dataframe,
// Define a case class
case class MyType(colId:Int,path:String,timestamp:Int) // they need to match the column names
dataframe.as[MyType].map(x=> <<Your Transformation code>>).toDf()
Here is what you can do with regex_replace and regex_extract, Extract the values you want and replace with it
df.withColumn("hdfsPath", regexp_replace(
$"hdfsPath",
lit("/done"),
concat(
lit("/target/"),
regexp_extract($"hdfsPath", "compiled-([0-9]{1,8})", 1),
lit("/"),
$"colId")
))
Output:
+-----+----------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-----+----------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/target/20200219/12/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-----+----------------------------------------------------------------------------------------------------+-------------+
Hope this helps!

Spark Column merging all list into 1 single list

I want the below column to merge into a single list for n-gram calculation. I am not sure how can I merge all the lists in a column into a single one.
+--------------------+
| author|
+--------------------+
| [Justin, Lee]|
|[Chatbots, were, ...|
|[Our, hopes, were...|
|[And, why, wouldn...|
|[At, the, Mobile,...|
+--------------------+
(Edit)Some more info:
I would like this as a spark df column and all the words including the repeated ones in a single list. The data is kind of big so I want to try avoiding methods like collect
OP wants to aggregate all the arrays/lists into the top row.
values = [(['Justin','Lee'],),(['Chatbots','were'],),(['Our','hopes','were'],),
(['And','why','wouldn'],),(['At','the','Mobile'],)]
df = sqlContext.createDataFrame(values,['author',])
df.show()
+------------------+
| author|
+------------------+
| [Justin, Lee]|
| [Chatbots, were]|
|[Our, hopes, were]|
|[And, why, wouldn]|
| [At, the, Mobile]|
+------------------+
This step suffices.
from pyspark.sql import functions as F
df = df.groupby().agg(F.collect_list('author').alias('list_of_authors'))
df.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|list_of_authors |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray(Justin, Lee), WrappedArray(Chatbots, were), WrappedArray(Our, hopes, were), WrappedArray(And, why, wouldn), WrappedArray(At, the, Mobile)]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
DataFrames, same as other distributed data structures, are not iterable and by only using dedicated higher order function and / or SQL methods can be accessed
Suppose your Dataframe is DF1 and Output is DF2
You need something like :
values = [(['Justin', 'Lee'],), (['Chatbots', 'were'],), (['Our', 'hopes', 'were'],),
(['And', 'why', 'wouldn'],), (['At', 'the', 'Mobile'],)]
df = spark.createDataFrame(values, ['author', ])
df.agg(F.collect_list('author').alias('author')).show(truncate=False)
Upvote if works

Spark How can I filter out rows that contain char sequences from another dataframe?

So, I am trying to remove rows from df2 if the Value in df2 is "like" a key from df1. I'm not sure if this is possible, or if I might need to change df1 into a list first? It's a fairly small dataframe, but as you can see, we want to remove the 2nd and 3rd rows from df2 and just return back df2 without them.
df1
+--------------------+
| key|
+--------------------+
| Monthly Beginning|
| Annual Percentage|
+--------------------+
df2
+--------------------+--------------------------------+
| key| Value|
+--------------------+--------------------------------+
| Date| 1/1/2018|
| Date| Monthly Beginning on Tuesday|
| Number| Annual Percentage Rate for...|
| Number| 17.5|
+--------------------+--------------------------------+
I thought it would be something like this?
df.filter(($"Value" isin (keyDf.select("key") + "%"))).show(false)
But that doesn't work and I'm not surprised, but I think it helps show what I am trying to do if my previous explanation was not sufficient enough. Thank you for your help ahead of time.
Convert the first dataframe df1 to List[String] and then create one udf and apply filter condition
Spark-shell-
import org.apache.spark.sql.functions._
//Converting df1 to list
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
//Creating udf , spark stands for spark session
spark.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
//Applying filter
df2.filter("filterUDF(Value)=0").show
//output
+------+--------+
| key| Value|
+------+--------+
| Date|1/1/2018|
|Number| 17.5|
+------+--------+
Scala-IDE -
val sparkSession=SparkSession.builder().master("local").appName("temp").getOrCreate()
val df1=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df1.csv")
val df2=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df2.csv")
import sparkSession.implicits._
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
sparkSession.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
df2.filter("filterUDF(Value)=0").show
Convert df1 to List. Convert df2 to Dataset.
case class s(key:String,Value:String)
df2Ds = df2.as[s]
Then we can use the filter method to filter out the records.
Somewhat like this.
def check(str:String):Boolean = {
var i = ""
for(i<-df1List)
{
if(str.contains(i))
return false
}
return true
}
df2Ds.filter(s=>check(s.Value)).collect

How to remove the fractional part from a dataframe column?

Input dataframe:
val ds = Seq((1,34.44),
(2,76.788),
(3,54.822)).toDF("id","mark")
Expected output:
val ds = Seq((1,34),
(2,76),
(3,54)).toDF("id","mark")
I want to remove the fractional part from the column mark as above. I have searched for any builtin functions, but did not find anything. How should an udf look like to achieve the above result?
You can just use cast to integer as
import org.apache.spark.sql.functions._
ds.withColumn("mark", $"mark".cast("integer")).show(false)
which should give you
+---+----+
|id |mark|
+---+----+
|1 |34 |
|2 |76 |
|3 |54 |
+---+----+
I hope the answer is helpful
Update
You commented as
But if any string values are there in the column , it is getting null since we are casting into integer . i don't want that kind of behaviour
So I guess your mark column must be a StringType() and you can use regexp_replace
import org.apache.spark.sql.functions._
ds.withColumn("mark", regexp_replace($"mark", "(\\.\\d+)", "")).show(false)

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+