I'm trying to use google datalab, but I can't write csv to GCS(Google Cloud Storage) well.
import pandas as pd
from pandas import DataFrame
from io import BytesIO
df = DataFrame({"a":[1,2],"b":1})
print(df)
>> | a | b
>> 0 | 1 | 1
>> 1 | 2 | 1
In stackoverflow, I found this command
%storage write --object gs://my-bucket/data/test.csv --variable df
But if I use this command, reading data doesn't work well. Because the data is not separated by comma (separated by space). and it includes index.
%storage read --object gs://my-bucket/data/test.csv --variable test_file
df2 = pd.read_csv(BytesIO(test_file))
print(df2)
>> | a b
>> 0 | 0 1 1
>> 1 | 1 2 1
I want to write as csv without index.(like df.to_csv('test_file.csv',index=False)
How should I do? Please advice.
Can you try the following?
import pandas as pd
from io import BytesIO
df = pd.DataFrame({"a":[1,2],"b":1})
df.to_csv('text.csv', index = False)
!gsutil cp 'text.csv' 'gs://path-to-your-bucket/test.csv'
%gcs read --object gs://path-to-your-bucket/test.csv --variable test_file
df2 = pd.read_csv(BytesIO(test_file))
Related
I am trying to import an unstructured csv from datalake storage to databricks and i want to read the entire content of this file:
EdgeMaster
Name Value Unit Status Nom. Lower Upper Description
Type A A
Date 1/1/2022 B
Time 0:00:00 A
X 1 m OK 1 2 3 B
Y - A
EdgeMaster
Name Value Unit Status Nom. Lower Upper Description
Type B C
Date 1/1/2022 D
Time 0:00:00 C
X 1 m OK 1 2 3 D
Y - C
1. Method 1 : I tried reading the first line a header
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load('abfss://xyz/sample.csv')
I get only this :
2. Method 2: I skipped reading header
No improvements :
3. Method 3: Defined a custom schema
Query returns no result:
If you know the schema ahead of time it should be possible to read the csv file and drop malformed data.
See this as an example:
name_age.csv
Hello
name,age
aj,19
Hello
name,age
test,20
And the code to read this would be:
>>> from pyspark.sql.types import StringType,IntegerType,StructField,StructType
>>> schema=StructType([StructField("name",StringType(),True),StructField("age",IntegerType(),True)])
>>> df=spark.read.csv("name_age.csv",sep=",",mode="DROPMALFORMED",schema=schema)
>>> df.show()
+----+---+
|name|age|
+----+---+
| aj| 19|
|test| 20|
+----+---+
Other helpful link: Remove first and last row from the text file in pyspark
I'm a newbie in PySpark, and I want to translate the NLP-based feature code which is pythonic, into PySpark.
#Python
N = 2
n_grams = lambda input_text: 0 if pd.isna(input_text) else len(set([input_text[character_index:character_index+N] for character_index in range(len(input_text)-N+1)]))
#quick test
n_grams_example = 'zhang1997' #output = [‘zh’, ‘ha’, ‘an’, ‘ng’, ‘g1’, ‘19’, ‘99’ , ‘97’]
n_grams(n_grams_example) # 8
I checked the NGram Python docs and I tried the following unseccessfully:
#PySpark
from pyspark.ml.feature import NGram
ndf = spark.createDataFrame([
(0, ["zhang1997"])], ["id", "words"])
ndf.show()
+---+-----------+
| id| words|
+---+-----------+
| 0|[zhang1997]|
+---+-----------+
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(ndf)
ngramDataFrame.select("ngrams").show(truncate=False)
+------+
|ngrams|
+------+
|[] |
+------+
Do I miss something here I get empty [] as a result instead of [‘zh’, ‘ha’, ‘an’, ‘ng’, ‘g1’, ‘19’, ‘99’ , ‘97’]? I'm interested to get its length of n-gram sets which is 8 in this case.
update: I found a way to do this without using NGram but I'm not happy with its performance.
def n_grams(input_text):
if input_text is None:
return 0
N = 2
return len(set([input_text[character_index:character_index+N] for character_index in range(len(input_text)-N+1)]))
my input dataframe is df
valx valy
1: 600060 09283744
2: 600131 96733110
3: 600194 01700001
and I want to create the graph treating above two columns are edgelist and then my output should have list of all vertices of graph with its membership .
I have tried Graphframes in pyspark and networx library too, but not getting desired results
My output should look like below (its basically all valx and valy under V1 (as vertices) and their membership info under V2)
V1 V2
600060 1
96733110 1
01700001 3
I tried below
import networkx as nx
import pandas as pd
filelocation = r'Pathtodataframe df csv'
Panda_edgelist = pd.read_csv(filelocation)
g = nx.from_pandas_edgelist(Panda_edgelist,'valx','valy')
g2 = g.to_undirected(g)
list(g.nodes)
``
I'm not sure if you are violating any rules here by asking the same question two times.
To detect communities with graphframes, at first you have to create graphframes object. Give your example dataframe the following code snippet shows you the necessary transformations:
from graphframes import *
sc.setCheckpointDir("/tmp/connectedComponents")
l = [
( '600060' , '09283744'),
( '600131' , '96733110'),
( '600194' , '01700001')
]
columns = ['valx', 'valy']
#this is your input dataframe
edges = spark.createDataFrame(l, columns)
#graphframes requires two dataframes: an edge and a vertice dataframe.
#the edge dataframe has to have at least two columns labeled with src and dst.
edges = edges.withColumnRenamed('valx', 'src').withColumnRenamed('valy', 'dst')
edges.show()
#the vertice dataframe requires at least one column labeled with id
vertices = edges.select('src').union(edges.select('dst')).withColumnRenamed('src', 'id')
vertices.show()
g = GraphFrame(vertices, edges)
Output:
+------+--------+
| src| dst|
+------+--------+
|600060|09283744|
|600131|96733110|
|600194|01700001|
+------+--------+
+--------+
| id|
+--------+
| 600060|
| 600131|
| 600194|
|09283744|
|96733110|
|01700001|
+--------+
You wrote in the comments of your other question that the community detection algorithmus doesn't matter for you currently. Therefore I will pick the connected components:
result = g.connectedComponents()
result.show()
Output:
+--------+------------+
| id| component|
+--------+------------+
| 600060|163208757248|
| 600131| 34359738368|
| 600194|884763262976|
|09283744|163208757248|
|96733110| 34359738368|
|01700001|884763262976|
+--------+------------+
Other community detection algorithms (like LPA) can be found in the user guide.
In python I am doing this to replace leading 0 in column phone with 91.
But how to do it in pyspark.
con dataframe is :
id phone1
1 088976854667
2 089706790002
Outptut i want is
1 9188976854667
2 9189706790002
# Replace leading Zeros in a phone number with 91
con.filter(regex='[_]').replace('^0','385',regex=True)
You are looking for the regexp_replace function. This function takes 3 parameter:
column name
pattern
repleacement
from pyspark.sql import functions as F
columns = ['id', 'phone1']
vals = [(1, '088976854667'),(2, '089706790002' )]
df = spark.createDataFrame(vals, columns)
df = df.withColumn('phone1', F.regexp_replace('phone1',"^0", "91"))
df.show()
Output:
+---+-------------+
| id| phone1|
+---+-------------+
| 1|9188976854667|
| 2|9189706790002|
+---+-------------+
I'm running the following code on databricks:
dataToShow = jDataJoined.\
withColumn('id', monotonically_increasing_id()).\
filter(
(jDataJoined.containerNumber == 'SUDU8108536')).\
select(col('id'), col('returnTemperature'), col('supplyTemperature'))
This will give me tabular data like
Now I would like to display a line graph with this returnTemperature and supplyTemperature as categories.
As far as I understood, the method display in databricks wants as second argument the category, so basically what I should have is something like
id - temperatureCategory - value
1 - returnTemperature - 25.0
1 - supplyTemperature - 27.0
2 - returnTemperature - 24.0
2 - supplyTemperature - 28.0
How can I transform the dataframe in this way?
I don't know if your format is what the display method is expecting, but you can do this transformation with the sql functions create_map and explode:
#creates a example df
from pyspark.sql import functions as F
l1 = [(1,25.0,27.0),(2,24.0,28.0)]
df = spark.createDataFrame(l1,['id','returnTemperature','supplyTemperature'])
#creates a map column which contains the values of the returnTemperature and supplyTemperature
df = df.withColumn('mapCol', F.create_map(
F.lit('returnTemperature'),df.returnTemperature
,F.lit('supplyTemperature'),df.supplyTemperature
)
)
#The explode function creates a new row for each element of the map
df = df.select('id',F.explode(df.mapCol).alias('temperatureCategory','value'))
df.show()
Output:
+---+-------------------+-----+
| id|temperatureCategory|value|
+---+-------------------+-----+
| 1 | returnTemperature| 25.0|
| 1 | supplyTemperature| 27.0|
| 2 | returnTemperature| 24.0|
| 2 | supplyTemperature| 28.0|
+---+-------------------+-----+