I am trying to parse a string and append the results to a new fields in a dataframe? In SQL, it would work like this.
UPDATE myDF
SET theyear = SUBSTRING(filename, 52, 4),
SET themonth = SUBSTRING(filename, 57, 2),
SET theday = SUBSTRING(filename, 60, 2),
SET thefile = SUBSTRING(filename, 71, 99)
I want to use Scala to do the work because the dataframes that I'm working with are really huge and using this will be magnitudes faster than using SQL to do the same. So, based on my research, I think it would look something like this, but I don't know how to count the number of characters in a field.
Here is some sample data:
abc://path_to_all_files_in_data_lake/2018/10/27/Parent/CPPP1027.Mid.414.gz
I want to get the year, the month, the day, and the file name, so in this example, I want the dataframe to have this.
val modifiedDF = df
.withColumn("theyear", )
.withColumn("themonth", )
.withColumn("theday", )
.withColumn("thefile", )
modifiedDF.show(false)
So, I want to append four fields to a dataframe: theyear, themonth, theday, and thefile. Then, do the parsing based on the count of characters in a string. Thanks.
I would probably rather use RegEx for pattern matching than string length. In this simple example, I extract the main date pattern using regexp_extract then build the other columns from there using substring:
%scala
import org.apache.spark.sql.functions._
val df = Seq( ( "abc://path_to_all_files_in_data_lake/2018/10/27/Parent/CPPP1027.Mid.414.gz" ), ( "abc://path_to_all_files_in_data_lake/2019/02/28/Parent/CPPP77.Mid.303.gz" ) )
.toDF("somePath")
.withColumn("theDate", regexp_extract($"somePath", "[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]", 0) )
.withColumn("theYear", substring($"theDate", 1, 4 ) )
.withColumn("theMonth", substring($"theDate", 6, 2 ) )
.withColumn("theDay", substring($"theDate", 9, 2 ) )
.withColumn("theFile", regexp_extract($"somePath", "[^/]+\\.gz", 0) )
df.show
My results:
Does that work for you?
Using built in functions on data frame -
You can use length(Column) from org.apache.spark.sql.functions to find the size of the data in a column.
val modifiedDF = df
.withColumn("theyear", when(length($"columName"),??).otherwise(??))
Using Scala -
df.map{row =>
val c = row.getAs[String]("columnName")
//length of c = c.length()
//build all columns
// return (column1,column2,,,)
}.toDF("column1", "column2")
Here is the final working solution!
%scala
import org.apache.spark.sql.functions._
val dfMod = df
.withColumn("thedate", regexp_extract($"filepath", "[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]", 0) )
.withColumn("theyear", substring($"thedate", 1, 4 ) )
.withColumn("themonth", substring($"thedate", 6, 2 ) )
.withColumn("theday", substring($"thedate", 9, 2 ) )
.withColumn("thefile", regexp_extract($"filepath", "[^/]+\\.gz", 0) )
dfMod.show(false)
Thanks for the assist wBob!!!
Related
print(
(
df1.lazy()
.with_context(df2.lazy())
.select(
pl.col("df1_date")
.apply(lambda s: pl.col("df2_date").filter(pl.col("df2_date") >= s).first())
.alias("release_date")
)
).collect()
)
Instead of getting actual data, I get a df of query plans. Is there any other way to solve my problem, Thx!!
In pandas, I can get what I want by using:
df1["release_date"] = df1.index.map(
lambda x: df2[df2.index < x].index[-1]
)
Edit:
Pls try code below and you will see polars only return query plans for this. While pandas gives the right data I want.
import polars as pl
df1 = pl.DataFrame(
{
"df1_date": [20221011, 20221012, 20221013, 20221014, 20221016],
"df1_col1": ["foo", "bar", "foo", "bar", "foo"],
}
)
df2 = pl.DataFrame(
{
"df2_date": [20221012, 20221015, 20221018],
"df2_col1": ["1", "2", "3"],
}
)
print(
(
df1.lazy()
.with_context(df2.lazy())
.select(
pl.col("df1_date")
.apply(lambda s: pl.col("df2_date").filter(pl.col("df2_date") <= s).last())
.alias("release_date")
)
).collect()
)
df1 = df1.to_pandas().set_index("df1_date")
df2 = df2.to_pandas().set_index("df2_date")
df1["release_date"] = df1.index.map(
lambda x: df2[df2.index <= x].index[-1] if len(df2[df2.index <= x]) > 0 else 0
)
print(df1)
It looks like you're trying to do an asof join. In other words a join where you take the last value that matched rather than exact matches.
You can do
df1 = (df1.lazy().join_asof(df2.lazy(), left_on='df1_date', right_on='df2_date')) \
.select(['df1_date', 'df1_col1',
pl.col('df2_date').fill_null(0).alias('release_date')]).collect()
The first difference is that in polars you don't assign new columns, you assign the whole df so it's always just the name of the df on the left side of the equals. The join_asof replaces your index/map/lambda thing. Then the last thing is to just replace the null value with 0 with fill_null and then rename the column. There was a bug in an old version of polars preventing the collect from working at the end. That is fixed in at least 0.15.1 (maybe an earlier version too but I'm just checking in with that version)
I have a dataset with columns month, id and value, something like this:
val df = Seq(
(201801, "fghufhg", 3),
(201801, "bhfbhgf", 6),
(201801, "dgdjjh", 5),
(201802, "ehfjrnfj", 6),
(201802, "ehghghfj", 98),
(201803, "nfrghj", 75),
(201803, "nfnrjfj", 7)
).toDF("month", "id", "value")
I created the function below to select a month in my dataset
def selectMonth(input:org.apache.spark.sql.Dataset[org.apache.spark.sql.Row], col:Column , month:Int) : org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
input.where(col === month)
}
So when I do this
val month201801 = selectMonth(df, $"month", "201801")
I get a dataframe (org.apache.spark.sql.DataFrame) with only the rows with info for this month.
Now I want to find an easier way to create several dataframes like this from a list of months like:
Seq(201801, 201802, 201803, 201804, 201805)
I wanted to do something like the code below, but I am clearly not thinking about this in the right way:
val listCohorts = Seq(201801, 201802, 201803, 201804, 201805)
for (i <- listCohorts) {
val (month +i) = selectMonth(df, $"month", i)
}
Because I get this error:
notebook:4: error: recursive value i needs type
val (C +i) = selectMonth(df, $"month", i)
^
notebook:4: error: not found: value +
val (C +i) = selectMonth(df, $"month", i)
^
notebook:4: error: not found: value C
val (C +i) = selectMonth(df, $"month", i)
^
The "month +i" was my attempt to name each dataframe like month201801, month201802, and the "i" was supposed to be the input of the month in the function
In other words, what I want is a way to create several dataframes (org.apache.spark.sql.DataFrame) performing only a where operation in the original dataset and naming it based on the condition used on the where. And to be able to adapt this (like choose other months to create other dataframes) by changing only the list that contains the information for the where.
In python this would be as simple as this:
monthlist = ['201801', '201802', '201803']
column = 'month'
for i in monthlist:
globals()[column + i] = df[df[column] == i]
This would create 3 dataframes named month201801, month201802, and month201803, each one containing only the rows of the original dataframe for the month in their name
Can be done without separate function, list of dates converted to Map with specifyc keys:
val column = "month"
val df = Seq(201801, 201802, 201803, 201804, 201805).toDF(column)
val dates = Seq(201801, 201802, 201803, 201804, 201805)
val monthDfMap = dates.map ( date => column+date -> df.where(col(column)===date)).toMap
val may: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = monthDfMap("month201805")
may.show(false)
Output is:
+------+
|month |
+------+
|201805|
+------+
You can't dynamically name variables in Scala. Use a Map instead. (Map is called dict in Python.)
I have a dataframe that looks like this
+--------------------
| unparsed_data|
+--------------------
|02020sometext5002...
|02020sometext6682...
I need to get it split it up into something like this
+--------------------
|fips | Name | Id ...
+--------------------
|02020 | sometext | 5002...
|02020 | sometext | 6682...
I have a list like this
val fields = List(
("fips", 5),
(“Name”, 8),
(“Id”, 27)
....more fields
)
I need the spit to take the first 5 characters in unparsed_data and map it to fips, take the next 8 characters in unparsed_data and map it to Name, then the next 27 characters and map them to Id and so on. I need the split to use/reference the filed lengths supplied in the list to do the splitting/slicing as there are allot of fields and the unparsed_data field is very long.
My scala is still pretty week and I assume the answer would look something like this
df.withColumn("temp_field", split("unparsed_data", //some regex created from the list values?)).map(i => //some mapping to the field names in the list)
any suggestions/ideas much appreciated
You can use foldLeft to traverse your fields list to iteratively create columns from the original DataFrame using
substring. It applies regardless of the size of the fields list:
import org.apache.spark.sql.functions._
val df = Seq(
("02020sometext5002"),
("03030othrtext6003"),
("04040moretext7004")
).toDF("unparsed_data")
val fields = List(
("fips", 5),
("name", 8),
("id", 4)
)
val resultDF = fields.foldLeft( (df, 1) ){ (acc, field) =>
val newDF = acc._1.withColumn(
field._1, substring($"unparsed_data", acc._2, field._2)
)
(newDF, acc._2 + field._2)
}._1.
drop("unparsed_data")
resultDF.show
// +-----+--------+----+
// | fips| name| id|
// +-----+--------+----+
// |02020|sometext|5002|
// |03030|othrtext|6003|
// |04040|moretext|7004|
// +-----+--------+----+
Note that a Tuple2[DataFrame, Int] is used as the accumulator for foldLeft to carry both the iteratively transformed DataFrame and next offset position for substring.
This can get you going. Depending on your needs it can get more and more complicated with variable lengths etc. which you do not state. But you can I think use column list.
import org.apache.spark.sql.functions._
val df = Seq(
("12334sometext999")
).toDF("X")
val df2 = df.selectExpr("substring(X, 0, 5)", "substring(X, 6,8)", "substring(X, 14,3)")
df2.show
Gives in this case (you can rename cols again):
+------------------+------------------+-------------------+
|substring(X, 0, 5)|substring(X, 6, 8)|substring(X, 14, 3)|
+------------------+------------------+-------------------+
| 12334| sometext| 999|
+------------------+------------------+-------------------+
I have a table which contains id, offset, text. Suppose input:
id offset text
1 1 hello
1 7 world
2 1 foo
I want output like:
id text
1 hello world
2 foo
I'm using:
df.groupby(id).agg(concat_ws("",collect_list(text))
But I don't know how to ensure the order in the text. I did sort before groupby the data, but I've heard that groupby might shuffle the data. Is there a way to do sort within group after groupby data?
this will create a required df:
df1 = sqlContext.createDataFrame([("1", "1","hello"), ("1", "7","world"), ("2", "1","foo")], ("id", "offset" ,"text" ))
display(df1)
then you can use the following code, could be optimized further:
#udf
def sort_by_offset(col):
result =""
text_list = col.split("-")
for i in range(len(text_list)):
text_list[i] = text_list[i].split(" ")
text_list[i][0]=int(text_list[i][0])
text_list = sorted(text_list, key=lambda x: x[0], reverse=False)
for i in range(len(text_list)):
result = result+ " " +text_list[i][1]
return result.lstrip()
df2 = df1.withColumn("offset_text",concat(col("offset"),lit(" "),col("text")))
df3 = df2.groupby(col("id")).agg(concat_ws("-",collect_list(col("offset_text"))).alias("offset_text"))
df4 = df3.withColumn("text",sort_by_offset(col("offset_text")))
display(df4)
Final Output:
Add sort_array:
from pyspark.sql.functions import sort_array
df.groupby(id).agg(concat_ws("", sort_array(collect_list(text))))
How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))