data transformations in scala/park

data transformations in scala/park - scala

brand,month,price
abc,jan, - \n
abc,feb, 29 \n
abc,mar, - \n
abc,apr, 45.23 \n
bb-c,jan, 34 \n
bb-c,feb,-35 \n
bb-c,mar, - \n
sum(price) groupby(brand)
challenges
1)csv file available in xl sheet
2)trim the extra spaces in price
3)replace non-numeric(" - ") with zero
4)sum the price group by brand
--read the csv file to df1
--changed the price data type string to double
--created registered temp table on df1
--but still facing issue with trim and
--replace the zero for non numeric
can someone please help me on this issue.

Theoretical explaining :
A simple use of sqlContext to read the csv file, regexp_replace inbuilt function to replace the strings to doubles (casting) and groupBy and sum aggregation should get you your desired output,
Programmatically explaining :
//1)csv file available in xl sheet
val df = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", true)
.load("path to the csv file")
df.show(false)
//+-----+-----+------+
//|brand|month|price |
//+-----+-----+------+
//|abc |jan | - |
//|abc |feb | 29 |
//|abc |mar | - |
//|abc |apr | 45.23|
//|bb-c |jan | 34 |
//|bb-c |feb |-35 |
//|bb-c |mar | - |
//+-----+-----+------+
import org.apache.spark.sql.functions._
//2)trim the extra spaces in price
//3)replace non-numeric(" - ") with zero
df.withColumn("price", regexp_replace(col("price"), "[\\s+a-zA-Z- :]", "").cast("double"))
//4)sum the price group by brand
.groupBy("brand")
.agg(sum("price").as("price_sum"))
.show(false)
//+-----+-----------------+
//|brand|price_sum |
//+-----+-----------------+
//|abc |74.22999999999999|
//|bb-c |69.0 |
//+-----+-----------------+
I hope the answer is helpful

Related

I need to create a new dataframe as below in pysaprk from given input dataset

persons who has same salary should come in same record and their names should be separated by ",".
input Dataset :
Expected Dataset

You can achieve this as below -
Apply a groupBy on Salary and use - collect_list to club all the Name inside an ArrayType()
Further you can choose to convert it to a StringType using - concat_ws
Data Preparation
df = pd.read_csv(StringIO("""Name,Salary
abc,100000
bcd,20000
def,100000
pqr,20000
xyz,30000
""")
,delimiter=','
).applymap(lambda x: str(x).strip())
sparkDF = sql.createDataFrame(df)
sparkDF.groupby("Salary").agg(F.collect_list(F.col("Name")).alias('Name')).show(truncate=False)
+------+----------+
|Salary|Name |
+------+----------+
|100000|[abc, def]|
|20000 |[bcd, pqr]|
|30000 |[xyz] |
+------+----------+
Concat WS
sparkDF.groupby("Salary").agg(F.concat_ws(",",F.collect_list(F.col("Name"))).alias('Name')).show(truncate=False)
+------+-------+
|Salary|Name |
+------+-------+
|100000|abc,def|
|20000 |bcd,pqr|
|30000 |xyz |
+------+-------+

export many files from a table

I have a sql query that generate a table with the below format
|sex |country|popularity|
|null |null | x |
|null |value | x |
|value|null | x |
|value|null | x |
|null |value | x |
|value|value | x |
value for sex column could be woman,man
value for country could be Italy,England,US etc.
x is a int
Now i would like to save four files based on data combination(value,null). So file1 consist of (value,value) for column sex,country.
file2 consist of (value,null) for column sex,country. file3 consist of (null,value) and file4 consist of
(null,null).
I have searched a lot of things but i couldn't find any useful info. I have also tried the below
val df1 = data.withColumn("combination",concat(col("sex") ,lit(","), col("country")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("text.csv")
but i receive more files because this command generate files based on all possible data of (sex-country).
Same with the below
val df1 = data.withColumn("combination",concat(col("sex")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("text.csv")
Is there any command similar to partitionby that gives me a combination of pairs (value,null) and not for columns?

You can convert the columns into Boolean depending on whether they are null or not, and concat into a string, which will look like "true_true", "true_false" etc.
df = df.withColumn("coltype", concat(col("sex").isNull(), lit("_"), col("country").isNull()))
df.coalesce(1)
.write
.partitionBy("coltype")
.format("csv")
.option("header", "true")
.mode("overwrite")
.save("output")

Spark - Replace column value - regex pattern value is having slash value- how to handle it?

Data Frame:
+-------------------+-------------------+
| Desc| replaced_columns|
+-------------------+-------------------+
|India is my Country|India is my Country|
| Delhi is my Nation| Delhi is my Nation|
| I Love India\Delhi| I Love India\Delhi|
| I Love USA| I Love USA|
|I am stay in USA\SA|I am stay in USA\SA|
+-------------------+-------------------+
"Desc" column is the original column name from DataFrame. replace_columns is after we are doing some transformation. In desc column , i need to replace "India\Delhi" value to "-". I tried below code.
dataDF.withColumn("replaced_columns", regexp_replace(dataDF("Desc"), "India\\Delhi", "-")).show()
it is NOT replacing with "-" string. How can i do that one ?

I found 3 approaches for the above question:
val approach1 = dataDF.withColumn("replaced_columns", regexp_replace(col("Desc"), "\\\\","-")).show() // (it should be 4 backslash in actual while running in IDE)
val approach2 = dataDF.select($"Desc",translate($"Desc","\\","-").as("replaced_columns")).show()
The below one is for specific record which you asked above -- ( In desc column , I need to replace "India\Delhi" value to "-". I tried below code.)
val approach3 = dataDF
.withColumn("replaced_columns",when(col("Desc").like("%Delhi")
, regexp_replace(col("Desc"), "\\\\", "-")).otherwise(col("Desc")))
.show()

Spark 1.6: Store dataframe into multiple csv file in hdfs (partition by id)

I'm trying to save a dataFrame into csv partition by id, for that I'm using spark 1.6 and scala.
The function partitionBy("id") dont give me the right result.
My code is here :
validDf.write
.partitionBy("id")
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ";")
.mode("overwrite")
.save("path_hdfs_csv")
My Dataframe looks like :
-----------------------------------------
| ID | NAME | STATUS |
-----------------------------------------
| 1 | N1 | S1 |
| 2 | N2 | S2 |
| 3 | N3 | S1 |
| 4 | N4 | S3 |
| 5 | N5 | S2 |
-----------------------------------------
This code create 3 csv default partitions (part_0, part_1, part_2) not based on column ID.
What I expect is : getting sub dir or partition for each id.
Any help ?

Spark-csv in spark1.6 (or all spark versions lower than 2) does not support partitioning.
Your code would work for spark > 2.0.0.
For your spark version, you will need to prepare the csv first and save it as text (partitioning works forspark-text):
import org.apache.spark.sql.functions.{col,concat_ws}
val key = col("ID")
val concat_col = concat_ws(",",df.columns.map(c=>col(c)):_*) // concat cols to one col
val final_df = df.select(col("ID"),concat_col) // dataframe with 2 columns: id and string
final_df.write.partitionBy("ID").text("path_hdfs_csv") //save to hdfs

In spark and scala, how to convert or map a dataframe to specific columns info?

Scala.
Spark.
intellij IDEA.
I have a dataframe (multiple rows, multiple columns) from CSV file.
And I want it maps to another specific column info.
I think scala class (not case class, because columns count > 22) or map().....
But I don't know how to convert them.
Example
a dataframe from CSV file.
----------------------
| No | price| name |
----------------------
| 1 | 100 | "A" |
----------------------
| 2 | 200 | "B" |
----------------------
another specific columns info.
=> {product_id, product_name, seller}
First, product_id is mapping to 'No'.
Second, product_name is mapping to 'name'.
Third, seller is null or ""(empty string).
So, finally, I want a dataframe that have another columns info.
-----------------------------------------
| product_id | product_name | seller |
-----------------------------------------
| 1 | "A" | |
-----------------------------------------
| 2 | "B" | |
-----------------------------------------

If you already have a dataframe (eg. old_df) :
val new_df=old_df.withColumnRenamed("No","product_id").
withColumnRenamed("name","product_name").
drop("price").
withColumn("seller", ... )

Let's say your CSV file is "products.csv",
First you have to load it in spark, you can do that using
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
Once the data is loaded you will have all the column names in the dataframe df. As you mentioned your column name will be "No","Price","Name".
To change the name of the column you just have to use withColumnRenamed api of dataframe.
val renamedDf = df.withColumnRenamed("No","product_id").
withColumnRenames("name","product_name")
Your renamedDf will have the name of the column as you have assigned.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

data transformations in scala/park - scala

Related

I need to create a new dataframe as below in pysaprk from given input dataset

export many files from a table

Spark - Replace column value - regex pattern value is having slash value- how to handle it?

Spark 1.6: Store dataframe into multiple csv file in hdfs (partition by id)

In spark and scala, how to convert or map a dataframe to specific columns info?

Categories

Resources