Preprocessing weird data in pyspark - pyspark

I am working with a set of climate data that come in a very strange distribution and is difficult to work with. I decided to work with pyspark because it is a large volume of data, you know, with the idea of saving time.
The data format is .ascii/.text/.dat, whatever you want to call it, And the distribution is as follows:
Date 1
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Value 9
Value 10
Value 11
Value 12
.
.
.
.
.
Value 101178
Date 2
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Value 9
Value 10
Value 11
Value 12
.
.
.
.
.
Value 101178
That is, it is a table composed of tables of 101178 data distributed in 6 columns (16863 rows).
In case the explanation is not very clear, I attach a link to a small fragment of the file. (the original file is >50GB)
https://drive.google.com/file/d/1-aJRTWzpQ5lHyZgt-h7DuEY5GpYZRcUh/view?usp=sharing
My idea is to generate a matrix with the following structure:
Date 1
Date 2
Date n
Value 1
Value1.2
Value1.n
Value 2
Value2.2
Value2.n
Value n
Valuen.2
Valuen.n
I have tried to make the question as clear as possible. As I said I am working with pyspark so if anyone has any solution to do this data processing using this tool I would be very grateful.
Thank you all very much!

I managed to get it very close, but not exactly, to your expected dataframe structure.
To test code's output I made this dummy dataset to play with, because it's super hard to follow lots of number in your original dataset
+--------------------------------------------+
|value |
+--------------------------------------------+
| 1990010100 0 0 24|
| 001 002 003 004 005 006 |
| 007 008 009 010 011 012 |
| 013 014 015 016 017 018 |
| 019 020 021 022 023 024 |
| 1990010101 0 0 24|
| 101 102 103 104 105 106 |
| 107 108 109 110 111 112 |
| 113 114 115 116 117 118 |
| 119 120 121 122 123 124 |
| 1990010102 0 0 24|
| 201 202 203 204 205 206 |
| 207 208 209 210 211 212 |
| 213 214 215 216 217 218 |
| 219 220 221 222 223 224 |
+--------------------------------------------+
And here is the complete code that I tested out. The main idea is to mark the date block and its records somehow so they can join with each other.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window as W
# I'm not sure what cluster do you have, or you're going to run on your local machine, so I added this piece here for your reference. The most important thing is you would want to provide enough resources to handle entire dataset all at once.
spark = (SparkSession
.builder
.master('local[*]')
.appName('SO')
.config('spark.driver.cores', '4')
.config('spark.driver.memory', '4g')
.config('spark.executor.cores', '4')
.config('spark.executor.memory', '4g')
.getOrCreate()
)
# load raw weird data and apply some transformation to make it "readable" or "processable"
df = (spark
.read.text('t2.dat')
.withColumn('id', F.row_number().over(W.orderBy(F.lit(1)))) # making ID per row, very important to mark the dates
.withColumn('value', F.trim(F.col('value'))) # trim spaces before and after
.withColumn('value', F.split(F.col('value'), '\s+')) # turn single line to individual values
)
# +------------------------------+---+
# |value |id |
# +------------------------------+---+
# |[1990010100, 0, 0, 24] |1 |
# |[001, 002, 003, 004, 005, 006]|2 |
# |[007, 008, 009, 010, 011, 012]|3 |
# |[013, 014, 015, 016, 017, 018]|4 |
# |[019, 020, 021, 022, 023, 024]|5 |
# |[1990010101, 0, 0, 24] |6 |
# |[101, 102, 103, 104, 105, 106]|7 |
# |[107, 108, 109, 110, 111, 112]|8 |
# |[113, 114, 115, 116, 117, 118]|9 |
# |[119, 120, 121, 122, 123, 124]|10 |
# |[1990010102, 0, 0, 24] |11 |
# |[201, 202, 203, 204, 205, 206]|12 |
# |[207, 208, 209, 210, 211, 212]|13 |
# |[213, 214, 215, 216, 217, 218]|14 |
# |[219, 220, 221, 222, 223, 224]|15 |
# +------------------------------+---+
# Extracting available date blocks
date_df = (df
.where(F.size('value') == 4)
.withColumn('grp', ((F.col('id') - 1) / 5).cast('int')) # replace 5 with 16864 when run with your actual dataset
.select('grp', F.col('value')[0].alias('date'))
)
date_df.show(10, False)
# +---+----------+
# |grp|date |
# +---+----------+
# |0 |1990010100|
# |0 |1990010101|
# |0 |1990010102|
# +---+----------+
# Extracting available value blocks
value_df = (df
.where(F.size('value') == 6)
.withColumn('grp', ((F.col('id') - 1) / 5).cast('int')) # replace 5 with 16864 when run with your actual dataset
.groupBy('grp')
.agg(F.collect_list('value').alias('value'))
.withColumn('value', F.flatten('value'))
)
# +---+------------------------------------------------------------------------------------------------------------------------+
# |grp|value |
# +---+------------------------------------------------------------------------------------------------------------------------+
# |0 |[001, 002, 003, 004, 005, 006, 007, 008, 009, 010, 011, 012, 013, 014, 015, 016, 017, 018, 019, 020, 021, 022, 023, 024]|
# |1 |[101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124]|
# |2 |[201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224]|
# +---+------------------------------------------------------------------------------------------------------------------------+
# join them together and "explode" array to different rows
joined_df = (date_df
.join(value_df, on=['grp'])
.withColumn('value', F.explode('value'))
)
# +---+----------+-----+
# |grp|date |value|
# +---+----------+-----+
# |0 |1990010100|001 |
# |0 |1990010100|002 |
# |0 |... |... |
# |0 |1990010100|023 |
# |0 |1990010100|024 |
# |1 |1990010101|101 |
# |1 |1990010101|102 |
# |1 |... |... |
# |1 |1990010101|123 |
# |1 |1990010101|124 |
# |2 |1990010102|201 |
# |2 |1990010102|202 |
# |2 |... |... |
# |2 |1990010102|223 |
# |2 |1990010102|224 |
# +---+----------+-----+
# now, joined_df is basically holding your entire dataset, it's totally up to you how do you want to handle it.
# One option is you can save each date as a partition of one Hive table.
# Another option is saving each date as a file.
# It's just for the sake of simplicity when you'll ever need to read that painful dataset again.
for date in [row['date'] for row in date_df.collect()]:
(joined_df
.where(F.col('date') == date)
.write
.mode('overwrite')
.csv(date)
)

After your comment I have revised my answer, not that I have used 'pandas' and their dataframe, this should directly translate to spark if necessary.
Also I think your data is corrupt as the last array is not the correct length MY CODE DOES NOT HANDLE THIS so you will need to using the captured value from the regex expected_values = m.group(4) etc.
WARNING: The := operator requires Python 3.8... but you can fix that up if you need to
NOTES:
The header of each 'section' is captured by the regex and is used to form the column name
Split the file on the date row:
import pandas as pd
import numpy as np
import re
from pathlib import Path
header = re.compile(r"^\s+(\d{10})\s+(\d*)\s+(\d*)\s+(\d*)$")
df = pd.DataFrame()
with open("t2.dat", "r") as ifp:
rows = []
date = None
count = 0
while line := ifp.readline():
# Get the header and start a new file
if m := header.match(line):
# We have a header so convert to array then flatten to a vector
# before appending to the dataframe.
if rows and date:
df[date] = np.array(rows, dtype=float).flatten(order="C")
rows = []
# Get the header
date = m.group(1)
else:
rows.append(line.strip().split())
print(f"Appending the last {len(rows)*len(rows[0])} values")
df[date] = np.array(rows, dtype=float).flatten(order="C")
Output an abbreviated form (there is 1 column per date with 101178 rows:
1990010100 1990010101 1990010102 1990010103 1990010104
0 10.4310 10.0490 9.7269 9.3801 9.0038
1 10.3110 9.9225 9.5431 9.1758 8.7899
2 10.2290 9.8144 9.4156 9.0304 8.6171
3 10.1500 9.7154 9.2999 8.8890 8.4713
4 9.8586 9.3968 8.9156 8.4328 7.9764
... ... ... ... ... ...
101173 -1.5511 -1.5472 -1.5433 -1.5251 -1.5399
101174 -1.8659 -1.8719 -1.8485 -1.8481 -1.8325
101175 -1.9044 -1.8597 -1.7963 -1.8094 -1.7653
101176 -2.0564 -2.0404 -1.9779 -1.9893 -1.9521
101177 -2.1842 -2.2840 -2.3216 -2.2794 -2.2655

Related

Hive/pyspark: pivot non numeric data for huge dataset

I'm looking for a way to pivot a input dataset with the below structure in hive or pyspark, the input contains more than half a billion records and for each emp_id there are 8 rows with and 5 columns possible, so I will end up with 40 columns. I did refer to this link but here the pivoted output column is already there in the dataset, in mine it's not and I also tried this link, but the sql is becoming very huge (not that it matters), but Is there a much way to do where the resultant pivoted columns needs to concatenated with the rank.
input
emp_id, dept_id, dept_name, rank
1001, 101, sales, 1
1001, 102, marketing, 2
1002 101, sales 1
1002 102, marketing, 2
expected output
emp_id, dept_id_1, dept_name_1, dept_id_2, dept_id_2
1001, 101, sales, 102, marketing
1002, 101, sales, 102, marketing
You can use aggregations after pivoting, you'd have an option to rename column like so
import pyspark.sql.functions as F
(df
.groupBy('emp_id')
.pivot('rank')
.agg(
F.first('dept_id').alias('dept_id'),
F.first('dept_name').alias('dept_name')
)
.show()
)
# Output
# +------+---------+-----------+---------+-----------+
# |emp_id|1_dept_id|1_dept_name|2_dept_id|2_dept_name|
# +------+---------+-----------+---------+-----------+
# | 1002| 101| sales| 102| marketing|
# | 1001| 101| sales| 102| marketing|
# +------+---------+-----------+---------+-----------+

how to identify digital chars as date from a string column in spark dataframe

I would like to extract digital chars from a string in a column of spark dataframe.
e.g.
id val (string)
58 [dttg] 201805_mogtca_onvt
91 20050221_frcas
17 201709 dcsevas
I need:
id a_date year month
58 201805 2018 05
91 20050221 2005 02
17 201709 2017 09
I am trying:
df.withColumn('date', DF.to_date(F.col('val').isdigit() # how to get digital chars ?
You should start by removing all non numeric characters through a regex_replace for instance:
df.withColumn("a_date", regexp_replace($"val", "[^0-9]", ""))
Then, since you seem to have different time format in each row, the easiest way is by using substrings
df.withColumn("a_date", regexp_replace($"val", "[^0-9]", ""))
.withColumn("year", substring($"a_date", 0, 4))
.withColumn("month", substring($"a_date", 5, 2))
.drop("val")
INPUT
+---+-------------------------+
|id |val |
+---+-------------------------+
|58 |[dttg] 201805_mogtca_onvt|
|91 |20050221_frcas |
|17 |201709 dcsevas |
+---+-------------------------+
OUTPUT
+---+--------+----+-----+
|id |a_date |year|month|
+---+--------+----+-----+
|58 |201805 |2018|05 |
|91 |20050221|2005|02 |
|17 |201709 |2017|09 |
+---+--------+----+-----+

Merge rows into List for similar values in SPARK

Spark version 2.0.2.6 and Scala Version 2.11.11
I am having the following csv file.
sno name number
1 hello 1
1 hello 2
2 hai 12
2 hai 22
2 hai 32
3 how 43
3 how 44
3 how 45
3 how 46
4 are 33
4 are 34
4 are 45
4 are 44
4 are 43
I want output as:
sno name number
1 hello [1,2]
2 hai [12,22,32]
3 how [43,44,45,46]
4 are [33,34,44,45,43]
Order of the elements in the list is not important.
Using dataframes or RDD's which ever is appropriate.
Thanks
Tom
import org.apache.spark.sql.functions._
scala> df.groupBy("sno", "name").agg(collect_list("number").alias("number")).sort("sno").show()
+---+-----+--------------------+
|sno| name| number|
+---+-----+--------------------+
| 1|hello| [1, 2]|
| 2| hai| [12, 22, 32]|
| 3| how| [43, 44, 45, 46]|
| 4| are|[33, 34, 45, 44, 43]|
+---+-----+--------------------+

Spark Dataframe Group by having New Indicator Column

I need to group by "KEY" Column and need to check whether "TYPE_CODE" column has both "PL" and "JL" values , if so then i need to add a Indicator Column as "Y" else "N"
Example :
//Input Values
val values = List(List("66","PL") ,
List("67","JL") , List("67","PL"),List("67","PO"),
List("68","JL"),List("68","PO")).map(x =>(x(0), x(1)))
import spark.implicits._
//created a dataframe
val cmc = values.toDF("KEY","TYPE_CODE")
cmc.show(false)
------------------------
KEY |TYPE_CODE |
------------------------
66 |PL |
67 |JL |
67 |PL |
67 |PO |
68 |JL |
68 |PO |
-------------------------
Expected Output :
For each "KEY", If it has "TYPE_CODE" has both PL & JL then Y
else N
-----------------------------------------------------
KEY |TYPE_CODE | Indicator
-----------------------------------------------------
66 |PL | N
67 |JL | Y
67 |PL | Y
67 |PO | Y
68 |JL | N
68 |PO | N
---------------------------------------------------
For example,
67 has both PL & JL - So "Y"
66 has only PL - So "N"
68 has only JL - So "N"
One option:
1) collect TYPE_CODE as list;
2) check if it contains the specific strings;
3) then flatten the list with explode:
(cmc.groupBy("KEY")
.agg(collect_list("TYPE_CODE").as("TYPE_CODE"))
.withColumn("Indicator",
when(array_contains($"TYPE_CODE", "PL") && array_contains($"TYPE_CODE", "JL"), "Y").otherwise("N"))
.withColumn("TYPE_CODE", explode($"TYPE_CODE"))).show
+---+---------+---------+
|KEY|TYPE_CODE|Indicator|
+---+---------+---------+
| 68| JL| N|
| 68| PO| N|
| 67| JL| Y|
| 67| PL| Y|
| 67| PO| Y|
| 66| PL| N|
+---+---------+---------+
Another option:
Group by KEY and use agg to create two separate indicator columns (one for JL and on for PL), then calculate the combined indicator
join with the original DataFrame
Altogether:
val indicators = cmc.groupBy("KEY").agg(
sum(when($"TYPE_CODE" === "PL", 1).otherwise(0)) as "pls",
sum(when($"TYPE_CODE" === "JL", 1).otherwise(0)) as "jls"
).withColumn("Indicator", when($"pls" > 0 && $"jls" > 0, "Y").otherwise("N"))
val result = cmc.join(indicators, "KEY")
.select("KEY", "TYPE_CODE", "Indicator")
This might be slower than #Psidom's answer, but might be safer - collect_list might be problematic if you have a huge number of matches for a specific key (that list would have to be stored in a single worker's memory).
EDIT:
In case the input is known to be unique (i.e. JL / PL would only appear once per key, at most), indicators could be created using simple count aggregation, which is (arguably) easier to read:
val indicators = cmc
.where($"TYPE_CODE".isin("PL", "JL"))
.groupBy("KEY").count()
.withColumn("Indicator", when($"count" === 2, "Y").otherwise("N"))

How to find distinct values for different groups on a dataframe in Pyspark and recode the dataframe

I have a big dataframe, the dataframe contain groups of people which are flag in the variable called "groups".
What I need to do for this dataframe now, is to presented in a more meaningful way.
For example in the following group 148, this is the table below:
df.select('gender','postcode','age','groups','bought').filter(df.groups==148).show()
+------+--------+---+----------+----------+
|gender|postcode|age| groups|bought |
+------+--------+---+----------+----------+
| 0| 2189| 25| 148|car |
| 0| 2192| 34| 148|house |
| 1| 2193| 37| 148|car |
| 1| 2194| 38| 148|house |
| 1| 2196| 54| 148|laptop |
| 1| 2197| 27| 148|laptop |
| 0| 2198| 44| 148|laptop |
+------+--------+---+----------+----------+
Gender has 0,1, so all these people in this group, will be changed to "people"
if all 1, then female, if all 0 then male. the rule but not for this group.
Now postcodes, the lowest is 2189 and the highest is 2211, then each case will change to [2189 - 2198].
For age, the lowest is 18 and the highest is 62, so it will be [25-54]
for bought, I need to check which items have been bought, these are [car,house,laptop]
So, this group recoding will end up as:
+------+-------------+--------+----------+------------------+
|gender| postcode| age| groups| bought |
+------+-------------+--------+----------+------------------+
|person|[2189 - 2198]| [25-54]| 148|[car,house,laptop]|
|person|[2189 - 2198]| [25-54]| 148|[car,house,laptop]|
|person|[2189 - 2198]| [25-54]| 148|[car,house,laptop]|
|person|[2189 - 2198]| [25-54]| 148|[car,house,laptop]|
|person|[2189 - 2198]| [25-54]| 148|[car,house,laptop]|
|person|[2189 - 2198]| [25-54]| 148|[car,house,laptop]|
|person|[2189 - 2198]| [25-54]| 148|[car,house,laptop]|
+------+-------------+--------+----------+------------------+
and that will be done for all groups in the dataframe.
Any ideas?
Here I found something similar but is in scala
Thank you in advance!
Hope this helps!
import pyspark.sql.functions as f
from pyspark.sql.types import StringType
df = sc.parallelize([
[0, 2189, 25, 148, 'car'],
[0, 2192, 34, 148, 'house'],
[1, 2193, 37, 148, 'car'],
[1, 2194, 38, 148, 'house'],
[1, 2196, 54, 148, 'laptop'],
[1, 2197, 27, 148, 'laptop'],
[0, 2198, 44, 148, 'laptop']
]).toDF(('gender', 'postcode', 'age', 'groups', 'bought'))
df.show()
df1 = df.groupBy("groups").agg(f.collect_set("bought")).withColumnRenamed("collect_set(bought)","bought")
df2 = df.groupBy("groups").agg(f.min("age"), f.max("age")). \
withColumn("age", f.concat(f.col("min(age)"), f.lit("-"), f.col("max(age)"))).select("groups","age")
df3 = df.groupBy("groups").agg(f.min("postcode"), f.max("postcode")). \
withColumn("postcode", f.concat(f.col("min(postcode)"), f.lit("-"), f.col("max(postcode)"))).select("groups","postcode")
def modify_values(l):
if l == [0, 1]:
return "person"
else:
if l == [0]:
return "male"
else:
return "female"
modified_val = f.udf(modify_values, StringType())
df4 = df.groupBy("groups").agg(f.collect_set("gender")).withColumn("gender",modified_val("collect_set(gender)")).select("groups","gender")
merged_df = df1.join(df2, "groups").join(df3, "groups").join(df4, "groups")
merged_df.show()
Output is:
+------+--------------------+-----+---------+------+
|groups| bought| age| postcode|gender|
+------+--------------------+-----+---------+------+
| 148|[laptop, house, car]|25-54|2189-2198|person|
+------+--------------------+-----+---------+------+
Don't forget to let us know if it solved your problem