issue when i try to connect two columns by a third columns - networkx

I have a csv file that have the following:
id_1
id_2
date
FD345
MER3345
06/12/2020
i want to connect id_1 -> id_2
the edge between them should be the date see the below pic
see id_1 it have a direct connected edge to id_2
the edge between them should be the date
so what i did is something like that:
import networkx as nx
import pandas as pd
df = pd.read_csv('data.csv')
G = nx.from_pandas_edgelist(df, source = "id_1", target = "id_2", edge_attr='date', create_using=nx.DiGraph())
but in this way it did not connect the node_1 and node_2 by date it give only the attributes to be date!!
or i am not understanding correctly because the output if i did like this when i print G.edges()
('UCU6lC', 'vOGN5A'), ........
it connect the nodes but i am not sure if it connected with the date or not!
Thank you for clear out something to me.

You need to use draw_networkx_edge_labels() to draw edge labels.
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
df = pd.DataFrame({'id_1': ['FD345'],
'id_2': ['MER3345'],
'date': ['06/12/2020']
})
G = nx.from_pandas_edgelist(df, source="id_1", target="id_2", create_using=nx.DiGraph())
nx.draw_networkx(G)
nx.draw_networkx_edge_labels(G, nx.spring_layout(G), edge_labels=dict(zip(G.edges, df['date'].tolist())),
verticalalignment='center_baseline')
plt.show()

Related

Geopandas to postgresql

I have a geopandas dataframe gdf_shp with columns: 'name_', 'kwd_ypes', 'geom', 'fid'. My question is if how to set the fid as primary key and if there is a geometry type in the sqlalchemy.types package. Here is my code:
import psycopg2
import geopandas as gpd
from sqlalchemy import create_engine, types
shapefile = 'file.shp'
gdf_shp = gpd.read_file(shapefile, encoding='windows-1253')
gdf_shp['fid'] = gdf_shp.index
gdf_shp.rename(columns={'NAME': 'name_',
'KWD_YPES': 'kwd_ypes',
'geometry': 'geom'}, inplace=True)
gdf_shp.set_geometry('geom', inplace='True')
engine = create_engine(...)
gdf_shp.to_postgis(name=table_name,
con=engine,
dtype={'name_': types.VARCHAR(),
'kwd_ypes': types.INTEGER(),
'geom': 'geometry',
'fid': types.INTEGER(primary_key=True)},
if_exists='replace')
I was also wondering if I could skip somehow the
gdf_shp.set_geometry('geom', inplace='True')
line of code by setting the column to some short of geometry type in the dtype argument of to_postgis.

pyspark parent child recursive on same dataframe

I have the following two Dataframes that stores diagnostic and part change for helicopter parts.
diagnostic dataframe stores the maintenance activities carried out date. The part change dataframe stores all part removals for all the helicopter parts, parent(rotor), and child (turbofan, axle, module).
What I am trying to achieve is quite complex, based on the diagnostic df I want to provide me the first removal for the same part along with its parent roll all the way up to so that I get the helicopter serial no at that maintenance date.
Here the initial code to generate the sample datasets:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.sql import Window as W
diagnostic_list = [
{"SN":"tur26666","PN": "turbofan", "maintenance":"23/05/2016"}
]
part_change_list= [
{"SN":"tur26666","PN": "turbofan", "removal":"30/03/2019", "Parent_SN" : "com26666","Parent_PN": "compressor"},
{"SN":"tur26666","PN": "turbofan", "removal":"30/06/2016", "Parent_SN" : "com26666","Parent_PN": "compressor"},
{"SN":"com26666","PN": "compressor", "removal":"13/03/2019", "Parent_SN" : "rot71777","Parent_PN": "rotorcraft"},
{"SN":"com26666","PN": "compressor", "removal":"30/06/2016", "Parent_SN" : "rot26666","Parent_PN": "rotorcraft"},
{"SN":"rot26666","PN": "rotorcraft", "removal":"31/12/2019", "Parent_SN" : "OYAAA","Parent_PN": "helicopter"},
{"SN":"rot26666","PN": "rotorcraft", "removal":"24/06/2016", "Parent_SN" : "OYHZZ","Parent_PN": "helicopter"},
]
spark = SparkSession.builder.getOrCreate()
diagnostic_df = spark.createDataFrame(Row(**x) for x in diagnostic_list)
part_change_df = spark.createDataFrame(Row(**x) for x in part_change_list)
diagnostic_df.show()
+--------+--------+-----------+
| SN| PN|maintenance|
+--------+--------+-----------+
|tur26666|turbofan| 23/05/2016|
+--------+--------+-----------+
part_change_df.show()
+--------+----------+----------+---------+----------+
| SN| PN| removal|Parent_SN| Parent_PN|
+--------+----------+----------+---------+----------+
|tur26666| turbofan|30/03/2019| com26666|compressor|
|tur26666| turbofan|30/06/2016| com26666|compressor|
|com26666|compressor|13/03/2019| rot71777|rotorcraft|
|com26666|compressor|29/06/2016| rot26666|rotorcraft|
|rot26666|rotorcraft|31/12/2019| OYAAA|helicopter|
|rot26666|rotorcraft|24/06/2016| OYHZZ|helicopter|
+--------+----------+----------+---------+----------+
I was able to get the first removal for the child turbofan with the below code :
working_df = (
diagnostic_df.join(part_change_df, ["SN", "PN"], how="inner")
.filter(F.col("removal") >= F.col("maintenance"))
.withColumn(
"rank",
F.rank().over(
W.partitionBy([F.col(col) for col in ["SN", "PN", "maintenance"]]).orderBy(
F.col("removal")
)
),
)
.filter(F.col("rank") == 1)
.drop("rank")
)
working_df.show()
+--------+--------+-----------+----------+---------+----------+
| SN| PN|maintenance| removal|Parent_SN| Parent_PN|
+--------+--------+-----------+----------+---------+----------+
|tur26666|turbofan| 23/05/2016|30/06/2016| com26666|compressor|
+--------+--------+-----------+----------+---------+----------+
How can I create a for loop or a recursive loop within the part_change_df to get the results like this that takes each parent of the first child and makes it the next child and get the first removal information after the first child(turbofan)'s maintenance date)?
+--------+--------+-----------+----------+---------+----------+--------------+--------------+--------------+-------------------+-------------------+-------------------+
| SN| PN|maintenance| removal|Parent_SN| Parent_PN|Parent_removal|next_Parent_SN|next_Parent_PN|next_Parent_removal|next_next_Parent_PN|next_next_Parent_SN|
+--------+--------+-----------+----------+---------+----------+--------------+--------------+--------------+-------------------+-------------------+-------------------+
|tur26666|turbofan| 23/05/2016|30/06/2016| com26666|compressor| 29/06/2016| rot26666| rotorcraft| 24/06/2016| helicopter| OYHZZ|
+--------+--------+-----------+----------+---------+----------+--------------+--------------+--------------+-------------------+-------------------+-------------------+
I could hardcode each parent and join working dataframe with the part change dataframe, but the problem i don't know exactly how high the number of parents a child will have .
The ultimate goal is like to get the child maintenance date and roll up all the way to the final parent removal date and the helicopter serial no:
+--------+--------+-----------+----------+-------------------+-------------------+-------------------+
| SN| PN|maintenance| removal|next_Parent_removal|next_next_Parent_PN|next_next_Parent_SN|
+--------+--------+-----------+----------+-------------------+-------------------+-------------------+
|tur26666|turbofan| 23/05/2016|30/06/2016| 24/06/2016| helicopter| OYHZZ|
+--------+--------+-----------+----------+-------------------+-------------------+-------------------+

Exporting PySpark Dataframe to Azure Data Lake Takes Forever

The code below ran perfectly well on the standalone version of PySpark 2.4 on Mac OS (Python 3.7) when the size of the input data (around 6 GB) was small. However, when I ran the code on HDInsight cluster (HDI 4.0, i.e. Python 3.5, PySpark 2.4, 4 worker nodes and each has 64 cores and 432 GB of RAM, 2 header nodes and each has 4 cores and 28 GB of RAM, 2nd generation of data lake) with larger input data (169 GB), the last step, which is, writing data to the data lake, took forever (I killed it after 24 hours of execution) to complete. Given the fact that HDInsight is not popular in the cloud computing community, I could only reference posts that complained about the low speed when writing dataframe to S3. Some suggested to repartition the dataset, which I did, but it did not help.
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType, IntegerType, BooleanType
from pyspark.sql.functions import udf, regexp_extract, collect_set, array_remove, col, size, asc, desc
from pyspark.ml.fpm import FPGrowth
import os
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.5"
def work(order_path, beer_path, corpus_path, output_path, FREQ_THRESHOLD=1000, LIFT_THRESHOLD=1):
print("Creating Spark Environment...")
spark = SparkSession.builder.appName("Menu").getOrCreate()
print("Spark Environment Created!")
print("Working on Checkpoint1...")
orders = spark.read.csv(order_path)
orders.createOrReplaceTempView("orders")
orders = spark.sql(
"SELECT _c14 AS order_id, _c31 AS in_menu_id, _c32 AS in_menu_name FROM orders"
)
orders.createOrReplaceTempView("orders")
beer = spark.read.csv(
beer_path,
header=True
)
beer.createOrReplaceTempView("beer")
beer = spark.sql(
"""
SELECT
order_id AS beer_order_id,
in_menu_id AS beer_in_menu_id,
'-999' AS beer_in_menu_name
FROM beer
"""
)
beer.createOrReplaceTempView("beer")
orders = spark.sql(
"""
WITH orders_beer AS (
SELECT *
FROM orders
LEFT JOIN beer
ON orders.in_menu_id = beer.beer_in_menu_id
)
SELECT
order_id,
in_menu_id,
CASE
WHEN beer_in_menu_name IS NOT NULL THEN beer_in_menu_name
WHEN beer_in_menu_name IS NULL THEN in_menu_name
END AS menu_name
FROM orders_beer
"""
)
print("Checkpoint1 Completed!")
print("Working on Checkpoint2...")
corpus = spark.read.csv(
corpus_path,
header=True
)
keywords = corpus.select("Food_Name").rdd.flatMap(lambda x: x).collect()
orders = orders.withColumn(
"keyword",
regexp_extract(
"menu_name",
"(?=^|\s)(" + "|".join(keywords) + ")(?=\s|$)",
0
)
)
orders.createOrReplaceTempView("orders")
orders = spark.sql("""
SELECT order_id, in_menu_id, keyword
FROM orders
WHERE keyword != ''
""")
orders.createOrReplaceTempView("orders")
orders = orders.groupBy("order_id").agg(
collect_set("keyword").alias("items")
)
print("Checkpoint2 Completed!")
print("Working on Checkpoint3...")
fpGrowth = FPGrowth(
itemsCol="items",
minSupport=0,
minConfidence=0
)
model = fpGrowth.fit(orders)
print("Checkpoint3 Completed!")
print("Working on Checkpoint4...")
frequency = model.freqItemsets
frequency = frequency.filter(col("freq") > FREQ_THRESHOLD)
frequency = frequency.withColumn(
"items",
array_remove("items", "-999")
)
frequency = frequency.filter(size(col("items")) > 0)
frequency = frequency.orderBy(asc("items"), desc("freq"))
frequency = frequency.dropDuplicates(["items"])
frequency = frequency.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(frequency.items)
)
frequency.createOrReplaceTempView("frequency")
lift = model.associationRules
lift = lift.drop("confidence")
lift = lift.filter(col("lift") > LIFT_THRESHOLD)
lift = lift.filter(
udf(
lambda x: x == ["-999"], BooleanType()
)(lift.consequent)
)
lift = lift.drop("consequent")
lift = lift.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(lift.antecedent)
)
lift.createOrReplaceTempView("lift")
result = spark.sql(
"""
SELECT lift.antecedent, freq AS frequency, lift
FROM lift
INNER JOIN frequency
ON lift.antecedent = frequency.antecedent
"""
)
print("Checkpoint4 Completed!")
print("Writing Result to Data Lake...")
result.repartition(1024).write.mode("overwrite").parquet(output_path)
print("All Done!")
def main():
work(
order_path=169.1 GB of txt,
beer_path=4.9 GB of csv,
corpus_path=210 KB of csv,
output_path="final_result.parquet"
)
if __name__ == "__main__":
main()
I first thought this was caused by the file format parquet. However, when I tried csv, I met with the same problem. I tried result.count() to see how many rows the table result has. It took forever to get the row number, just like writing the data to the data lake.
There was a suggestion to use broadcast hash join instead of the default sort-merge join if a large dataset is joined with a small dataset. I thought it was worth trying because the smaller samples in the pilot study told me the row number of frequency is roughly 0.09% of that of lift (See the query below if you have difficulty tracking frequency and lift).
SELECT lift.antecedent, freq AS frequency, lift
FROM lift
INNER JOIN frequency
ON lift.antecedent = frequency.antecedent
With that in mind, I revised my code:
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType, IntegerType, BooleanType
from pyspark.sql.functions import udf, regexp_extract, collect_set, array_remove, col, size, asc, desc
from pyspark.ml.fpm import FPGrowth
import os
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.5"
def work(order_path, beer_path, corpus_path, output_path, FREQ_THRESHOLD=1000, LIFT_THRESHOLD=1):
print("Creating Spark Environment...")
spark = SparkSession.builder.appName("Menu").getOrCreate()
print("Spark Environment Created!")
print("Working on Checkpoint1...")
orders = spark.read.csv(order_path)
orders.createOrReplaceTempView("orders")
orders = spark.sql(
"SELECT _c14 AS order_id, _c31 AS in_menu_id, _c32 AS in_menu_name FROM orders"
)
orders.createOrReplaceTempView("orders")
beer = spark.read.csv(
beer_path,
header=True
)
beer.createOrReplaceTempView("beer")
beer = spark.sql(
"""
SELECT
order_id AS beer_order_id,
in_menu_id AS beer_in_menu_id,
'-999' AS beer_in_menu_name
FROM beer
"""
)
beer.createOrReplaceTempView("beer")
orders = spark.sql(
"""
WITH orders_beer AS (
SELECT *
FROM orders
LEFT JOIN beer
ON orders.in_menu_id = beer.beer_in_menu_id
)
SELECT
order_id,
in_menu_id,
CASE
WHEN beer_in_menu_name IS NOT NULL THEN beer_in_menu_name
WHEN beer_in_menu_name IS NULL THEN in_menu_name
END AS menu_name
FROM orders_beer
"""
)
print("Checkpoint1 Completed!")
print("Working on Checkpoint2...")
corpus = spark.read.csv(
corpus_path,
header=True
)
keywords = corpus.select("Food_Name").rdd.flatMap(lambda x: x).collect()
orders = orders.withColumn(
"keyword",
regexp_extract(
"menu_name",
"(?=^|\s)(" + "|".join(keywords) + ")(?=\s|$)",
0
)
)
orders.createOrReplaceTempView("orders")
orders = spark.sql("""
SELECT order_id, in_menu_id, keyword
FROM orders
WHERE keyword != ''
""")
orders.createOrReplaceTempView("orders")
orders = orders.groupBy("order_id").agg(
collect_set("keyword").alias("items")
)
print("Checkpoint2 Completed!")
print("Working on Checkpoint3...")
fpGrowth = FPGrowth(
itemsCol="items",
minSupport=0,
minConfidence=0
)
model = fpGrowth.fit(orders)
print("Checkpoint3 Completed!")
print("Working on Checkpoint4...")
frequency = model.freqItemsets
frequency = frequency.filter(col("freq") > FREQ_THRESHOLD)
frequency = frequency.withColumn(
"antecedent",
array_remove("items", "-999")
)
frequency = frequency.drop("items")
frequency = frequency.filter(size(col("antecedent")) > 0)
frequency = frequency.orderBy(asc("antecedent"), desc("freq"))
frequency = frequency.dropDuplicates(["antecedent"])
frequency = frequency.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(frequency.antecedent)
)
lift = model.associationRules
lift = lift.drop("confidence")
lift = lift.filter(col("lift") > LIFT_THRESHOLD)
lift = lift.filter(
udf(
lambda x: x == ["-999"], BooleanType()
)(lift.consequent)
)
lift = lift.drop("consequent")
lift = lift.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(lift.antecedent)
)
result = lift.join(
frequency.hint("broadcast"),
["antecedent"],
"inner"
)
print("Checkpoint4 Completed!")
print("Writing Result to Data Lake...")
result.repartition(1024).write.mode("overwrite").parquet(output_path)
print("All Done!")
def main():
work(
order_path=169.1 GB of txt,
beer_path=4.9 GB of csv,
corpus_path=210 KB of csv,
output_path="final_result.parquet"
)
if __name__ == "__main__":
main()
The code ran perfectly well with the same sample data on my Mac OS and as expected took less time (34 seconds vs. 26 seconds). Then I decided to run the code to HDInsight with full datasets. In the last step, which is writing data to the data lake, the task failed and I was told Job cancelled because SparkContext was shut down. I am rather new to big data and have no idea with this means. Posts on the internet said there could be many reasons behind it. Whatever the method I should use, how to optimize my code so I can get the desired output in the data lake within bearable amount of time?
I would try several things, ordered by the amount of energy they require:
Check if the ADL storage is in the same region as your HDInsight cluster.
Add calls for df = df.cache() after heavy calculations, or even write and then read the dataframes into and from a cache storage in between these calculations.
Replace your UDFs with "native" Spark code, since UDFs are one of the performance bad practices of Spark.
I have figured it out after five days' struggle. Here are the approaches that I took to optimize the code. The time of code execution dropped from more than 24 hours to around 10 minutes. Code optimization is really really important.
As David Taub below pointed out, use df.cache() after heavy computation or before feeding the data to the model. I used df.cache().count() since calling .cache() on its own is lazily evaluated but the following .count() forces an evaluation of the entire dataset.
Use flashtext instead of regular expression to extract keywords. This greatly improves code performance.
Be careful with joins / merge. It might get extremely slow due to data skewness. Always think about ways to avoid unnecessary joins.
Set minSupport for FPGrowth. This significantly reduces the time when calling model.freqItemsets.

window functions( lag, lead) implementation in pyspark?

Below is the T-SQL code attached. I tried to convert it to pyspark using window functions which is also attached.
case
when eventaction = 'IN' and lead(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid) in('IN','OUT')
then lead(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid)
else ''
end as next_action
Pyspark code giving error using window function lead
Tgt_df = Tgt_df.withColumn((('Lead', lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")) == 'IN' )|
('1', lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")) == 'OUT')
, (lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate"))).otherwise('').alias("next_action")))
But it's not working. What to do!?
The withColumn method should be used as df.withColumn('name_of_col', value_of_column), that's why you have an error.
From your T-SQL requests, the corresponding pyspark code should be :
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")
Tgt_df = Tgt_df.withColumn('next_action',
F.when((F.col('event_action')=='IN')&(F.lead('event_action', 1).over(w).isin(['IN', 'OUT'])),
F.lead('event_action', 1).over(w)
).otherwise('')
)

Select from panadas dataframe using multiple columns

I have the following dataframe:
Date Adj Close
Ticker
ZTS 2014-12-22 43.41
ZTS 2014-12-19 43.51
ZTS 2014-12-18 43.15
ZTS 2014-12-17 41.13
There are many more tickers than just ZTS, it continues on for many more rows.
I would like to select using both the Ticker and the Date but I cannot figure out how. I would like to select as if I were saying in SQL:
Select 'Adj Close' from prices where Ticker = 'ZTS' and 'Date' = '2014-12-22'
Thanks!
the following should work:
df[(df['Date'] == '2014-12-22') & (df.index == 'ZTS')]['Adj Close']
Here we have to use the array & operators rather than and also you must use parentheses due to operator precedence
>>> import pandas
>>> from pandas import *
>>> L = [['2014-12-22',43.41],['2014-12-19',43.51],['2014-12-18',43.15], ['2014-12-17',41.13]]
>>> C = ['ZTS', 'ZTS','ZTS','ZTS']
>>> df = DataFrame(L, columns=['Date','Adj Close'], index=[C])
>>> df
Date Adj Close
ZTS 2014-12-22 43.41
ZTS 2014-12-19 43.51
ZTS 2014-12-18 43.15
ZTS 2014-12-17 41.13
>>> D1 = df.ix['ZTS'][df['Date']=='2014-12-22']['Adj Close']
>>> D1
ZTS 43.41
I've figured out to seperate out the Ticker into a subset dataframe, and then index by Date, and then select by date. But I'm still wondering if there's a more efficient way.
cur_df = df.ix['A']
cur_df = cur_df.set_index(['Date'])
print cur_df['Adj Close']['2014-11-20']