I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?
To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.
Related
I have one spark DF1 with datatype,
string (nullable = true)
integer (nullable = true)
timestamp (nullable = true)
And one more spark DF2 (which I created from Pandas DF)
object
int64
object
Now I need to change the DF2 datatype to match with the Df1 datatype. Is there any common way to do that. Because every time I may get different columns and different datatypes.
Is there any way like assign the DF1 data type to some structType and use that for DF2?
suppose you have 2 dataframes - data1_sdf and data2_sdf. you can use a dataframe's schema to extract the column's data type by data_sdf.select('column_name').schema[0].dataType.
here's an example where data2_sdf columns are casted using data1_sdf within a select
data2_sdf. \
select(*[func.col(c).cast(data1_sdf.select(c).schema[0].dataType) if c in data1_sdf.columns else c for c in data_sdf.columns])
If you make sure that your first object is a string-like column and your third object is timestamp-like column, you can try to use this method:
df2 = spark.createDataFrame(
df2.rdd, schema=df1.schema
)
However, this method is not guaranteed, some cases are not valid (eg transform from string to integer). Also, this method might not be good in performance perspective. Therefore, you better use cast to change the data type of each column.
I have one column in DataFrame with format =
'[{jsonobject},{jsonobject}]'. here length will be 2 .
I have to find length of this array and store it in another column.
I've only worked with pySpark, but the Scala solution would be similar. Assuming the column name is input:
from pyspark.sql import functions as f, types as t
json_schema = t.ArrayType(t.MapType(t.StringType(), t.StringType()))
df.select(f.size(f.from_json(df.input, json_schema)).alias("num_objects"))
I'm trying to convert a String column, which is essentially a date, into a TimestampType column, however I'm having problems with how to split the value.
-RECORD 0-------------------------------------
year | 2016
month | 4
arrival_date | 2016-04-30
date_added | 20160430
allowed_date | 10292016
I have 3 columns, all of which are in different formats so I'm trying to find a way to split the string in a custom way, since the date_added column is yyyymmdd and the allowed_date is mmddyyyy.
I've tried something in the lines of:
df_imigration.withColumn('cc'.F.date_format(df_imigration.allowed_date.cast(dataType=t.TimestampType()), "yyyy-mm-dd"))
But with no success and I'm kind of stuck tring to find what's the right or best way to solve this.
The t and F aliases are for the following imports:
from pyspark.sql import functions as F
from pyspark.sql import types as t
The problem with your code is you are casting the date without specifying the date format.
To specify the format you should use function to_timestamp().
Here I have created a dataframe with three different formats and it worked.
df1 = spark.createDataFrame([("20201231","12312020","31122020"), ("20201231","12312020","31122020" )], ["ID","Start_date","End_date"])
df1=df1.withColumn('cc',f.date_format(f.to_timestamp(df1.ID,'yyyymmdd'), "yyyy-mm-dd"))
df1=df1.withColumn('dd',f.date_format(f.to_timestamp(df1.Start_date,'mmddyyyy'), "yyyy-mm-dd"))
df1.withColumn('ee',f.date_format(f.to_timestamp(df1.End_date,'ddmmyyyy'), "yyyy-mm-dd")).show()
Output:
+--------+----------+--------+----------+----------+----------+
| ID|Start_date|End_date| cc| dd| ee|
+--------+----------+--------+----------+----------+----------+
|20201231| 12312020|31122020|2020-12-31|2020-12-31|2020-12-31|
|20201231| 12312020|31122020|2020-12-31|2020-12-31|2020-12-31|
+--------+----------+--------+----------+----------+----------+
Hope it helps!
I need to save a file delimited by "|~" characters but I get an error when I execute the command below. Can I save a file using multiple delimiters in Spark?
mydf1.coalesce(1).write.option("compression","none").format("csv").mode("Overwrite").option("delimiter","|~").save("my_hdfs_path")
// Error : pyspark.sql.utils.IllegalArgumentException: u'Delimiter cannot be more than one character: |~'
AFAIK, we are still waiting for an "official" solution, because the issue "support for multiple delimiter in Spark CSV read" is still open, and SparkSQL still rely on univocity-parsers. In univocity CSV settings, the CSV delimiter can only be a single character, which constrains both the parser (reader) and generator (writer).
Workarounds
Finding a universally fast and safe way to write as CSV is hard. But depends on your data size and the complexity of CSV contents (date format? currency? quoting?), we may find a shortcut. Following are just some, hopefully inspiring, thoughts...
write to CSV with special character (say ⊢) then substitute to |~.
(haven't been benchmarked, but IMO it's very hopeful to be the fastest)
df.coalesce(1).write.option("compression","none").option("delimiter", "⊢").mode("overwrite").csv("raw-output")
then post-process with (ideally locally) with, say sed
sed -i '.bak' 's/⊢/\|~/g' raw-output/*.csv
within PySpark, concatenate each row to a string, then write as a text file
(can be flexible to deal with locality and special needs -- with a bit more work)
d = [{'name': 'Alice', 'age': 1},{'name':'Bob', 'age':3}]
df = spark.createDataFrame(d, "name:string, age:int")
df.show()
#+-----+---+
#| name|age|
#+-----+---+
#|Alice| 1|
#| Bob| 3|
#+-----+---+
#udf
def mkstr(name, age):
"""
for example, the string field {name} should be quoted with `"`
"""
return '"{name}"|~{age}'.format(name=name, age=age)
# unparse a CSV row back to a string
df_unparsed = df.select(mkstr("name", "age").alias("csv_row"))
df_unparsed.show()
#+----------+
#| csv_row|
#+----------+
#|"Alice"|~1|
#| "Bob"|~3|
#+----------+
df_unparsed.coalesce(1).write.option("compression", "none").mode("overwrite").text("output")
numpy.savetxt allows multiple character as delimiter, so ...
(numpy has lots of builtins if you cares about precisions of floating numbers)
import pandas as pd
import numpy as np
# convert `Spark.DataFrame` to `Pandas.DataFrame`
df_pd = df.toPandas()
# use `numpy.savetxt` to save `Pandas.DataFrame`
np.savetxt("a-long-day.csv", df_pd, delimiter="|~", fmt="%s")
From Spark 3.0 We dont have this issue But if using prior version > Spark 2.3 This can be also used as solution. basically concatenating all column and fill null with blank and write the data with the desired delimiter along with the header . This will more generic solution than hardcoding.This allows to have the header also retained.
from pyspark.sql.functions import *
from pyspark.sql import functions as F
target_row_delimited = "|,"
df=df.select([col(c).cast("string") for c in df.columns])
df=df.na.fill("")
headername=target_row_delimited.join(df.columns)
df=df.withColumn(headername, concat_ws(target_row_delimited, *df.columns))
df.select(df[headername].write.format("csv").mode(modeval).option("quoteAll", "false").option("quote","\u0000").option("header", "true").save(tgt_path + "/")
In case we need to read with multiple delimiters , the following solution can be avialed
source_delimiter = "|_|"
headcal = spark.read.text(source_filename)
headers = headcal.take(1)[0]['value']
header_column = headers.split(source_delimiter)
df = sc.textFile(source_filename).map(lambda x: x.split(source_delimiter)).toDF(header_column)
I am new to Scala and Spark and I am trying to read a csv file locally (for testing):
val spark = org.apache.spark.sql.SparkSession.builder.master("local").appName("Spark CSV Reader").getOrCreate;
val topics_df = spark.read.format("csv").option("header", "true").load("path-to-file.csv")
topics_df.show(10)
Here's how the file looks like:
+-----+--------------------+--------------------+
|topic| termindices| termweights|
+-----+--------------------+--------------------+
| 15|[21,31,51,108,101...|[0.0987100701,0.0...|
| 16|[42,25,121,132,55...|[0.0405490884,0.0...|
| 7|[1,23,38,7,63,0,1...|[0.1793091892,0.0...|
| 8|[13,40,35,104,153...|[0.0737646511,0.0...|
| 9|[2,10,93,9,158,18...|[0.1639456608,0.1...|
| 0|[28,39,71,46,123,...|[0.0867449145,0.0...|
| 1|[11,34,36,110,112...|[0.0729913664,0.0...|
| 17|[6,4,14,82,157,61...|[0.1583892199,0.1...|
| 18|[9,27,74,103,166,...|[0.0633899386,0.0...|
| 19|[15,81,289,218,34...|[0.1348582482,0.0...|
+-----+--------------------+--------------------+
with
ReadSchema: struct<topic:string,termindices:string,termweights:string>
The termindices column is supposed to be of type Array[Int], but when saved to CSV it is a String (This usually would not be a problem if I pull from databases).
How do I convert the type and eventually cast the DataFrame to a:
case class TopicDFRow(topic: Int, termIndices: Array[Int], termWeights: Array[Double])
I have the function ready to perform the conversion:
termIndices.substring(1, termIndices.length - 1).split(",").map(_.toInt)
I have looked at udf and a few other solutions but I am convinced that there should be a much cleaner and faster way to perform said conversion. Any help is greatly appreciated!
UDFs should be avoided when it's possible to use the more efficient in-built Spark functions. To my knowledge there is no better way than the one proposed; remove the first and last characters of the string, split and convert.
Using the in-built functions, this can be done as follows:
df.withColumn("termindices", split($"termindices".substr(lit(2), length($"termindices")-2), ",").cast("array<int>"))
.withColumn("termweights", split($"termweights".substr(lit(2), length($"termweights")-2), ",").cast("array<double>"))
.as[TopicDFRow]
substr if 1-index based so to remove the first character we start from 2. The second argument is the length to take (not the end point) hence the -2.
The last command will cast the dataframe to a dataset of type TopicDFRow.