I am able to create a .csv file using Talend job and I want to convert .csv to .parquet file using tSystem component? - talend

I have a Talend job to create a .csv file and now I want to convert .parquet format using Talend v6.5.1. Only option I can think, tSystem component to call the python script from local or directory where .csv landing temporarily. I know I can convert this easily using pandas or pyspark but I am not sure the same code will be work for tSystem in Talend. Can you please provide the suggestions or instructions-
Code:
import pandas as pd
DF = pd.read_csv("Path")
DF1 = to_parquet(DF)

If you have an external script on your file system, you can try
"python \"myscript.py\" "
Here is a link on talend forum regarding this problem :
https://community.talend.com/t5/Design-and-Development/how-to-execute-a-python-script-file-with-an-argument-using/m-p/23975#M3722

I am able to resolve the problem following below steps-
import pandas as pd
import pyarrow as pa
import numpy as np
import sys
filename = sys.argv[1]
test = pd.read_csv(r"C:\\Users\\your desktop\\Downloads\\TestXML\\"+ filename+".csv")
test.to_parquet(r"C:\\Users\\your desktop\\Downloads\\TestXML\\"+ filename+".parque
t")

Related

How read .shp files in databricks from filestore?

I'm using Databricks community, and I save a .shp in the FileStore, but when I tried to read I get this error:
DriverError: /dbfs/FileStore/tables/World_Countries.shp: No such file or directory
this is my Code
import geopandas as gpd
gdf = gpd.read_file("/dbfs/FileStore/tables/World_Countries.shp")
I also tried
gdf = gpd.read_file("/FileStore/tables/World_Countries.shp")
You should first verify that the file path is correct and that the file exists in the specified location. You can use the dbutils.fs.ls command to list the contents of the directory and check if the file is present. You can do this using:
dbutils.fs.ls("dbfs:/FileStore/path/to/your/file.shp")
Also, make sure that you have the correct permissions to access the file. In Databricks, you may need to be an administrator or have the correct permissions to access the file.
Try to read the file using the full path, including the file extension:
file_path = "dbfs:/FileStore/path/to/your/file.shp"
df = spark.read.format("shapefile").option("shape", file_path).load()
There are then several methods to read files in Databrick:
1.
from pyspark.sql.functions import *
file_path = "dbfs:/FileStore/path/to/your/file.shp"
df = spark.read.format("shapefile").option("shape", file_path).load()
df.show()
df = spark.read.shape(file_path)
and
3.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as F
from shapely.geometry import Point
geo_df = df.select("shape").withColumn("geometry", F.shape_to_geometry("shape")).drop("shape").select("geometry")``

Pyspark ModuleNotFound when importing custom package

Context: I'm running a script on azure databricks and I'm using imports to import functions from a given file
Let's say we have something like this in a file called "new_file"
from old_file import x
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.types import *
spark = SparkSession.builder.appName('workflow').config(
"spark.driver.memory", "32g").getOrCreate()
The imported funcion "x" will take as argument a string that was read as a pyspark dataframe as such:
new_df_spark = spark.read.parquet(new_file)
new_df = ps.DataFrame(new_df_spark)
new_df is then passed as argument to a function that calls the function x
I then get an error like
ModuleNotFoundError: No module named "old_file"
Does this mean I can't use imports? Or do I need to install the old_file in the cluster for this to work? If so, how would this work and will the package update if I change old_file again?
Thanks

Reading a Python file from Scala

I'm trying to work with a file
But when I try to access this file, I get an error: No such file or directory
Can you tell me how to access files in hdfs correctly?
UPD:
The author of the answer directed me in the right direction.
As a result, this is how I execute the python script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
#import pandas as pd
import sys
for line in sys.stdin:
print('Hello, ' + line)
# this is hello.py
And Scala application:
spark.sparkContext.addFile(getClass.getResource("hello.py").getPath, true)
val test = spark.sparkContext.parallelize(List("Body!")).repartition(1)
val piped = test.pipe(SparkFiles.get("./hello.py"))
val c = piped.collect()
c.foreach(println)
Output: Hello, Body!
Now I have to think about whether, as a cluster user, I can install pandas on workers.
I think you should try directly referencing the external file rather than attempting to download it to your Spark driver just to upload it again
spark.sparkContext.addFile(s"hdfs://$srcPy")

How to read .pkl file in pyspark

I have a dictionary saved in .pkl format using the following code in python 3.X
import pickle as cpick
OutputDirectory="My data file path"
with open("".join([OutputDirectory, 'data.pkl']), mode='wb') as fp:
cpick.dump(data_dict, fp, protocol=cpick.HIGHEST_PROTOCOL)
I want to read this file in pyspark. Can you suggest me how to do that? Currently I'm using spark 2.0 & python 2.7.13

How to import .py in google Colaboratory?

I want to simplify code. so i make a utils.py , but Google Colaboratory directory is "/content" I read other questions. but this is not my solution
In Google's Colab notebook, How do I call a function from a Python file?
%%writefile example.py
def f():
print 'This is a function defined in a Python source file.'
# Bring the file into the local Python environment.
execfile('example.py')
f()
This is a function defined in a Python source file.
It look likes just using def().
using this, i always write the code in cell.
but i want to this code
import example.py
example.f()
A sample maybe you want:
!wget https://raw.githubusercontent.com/tensorflow/models/master/samples/core/get_started/iris_data.py -P local_modules -nc
import sys
sys.path.append('local_modules')
import iris_data
iris_data.load_data()
I have also had this problem recently.
I addressed the issue by the following steps, though it's not a perfect solution.
src = list(files.upload().values())[0]
open('util.py','wb').write(src)
import util
This code should work with Python 3:
from google.colab import drive
import importlib.util
# Mount your drive. It will be at this path: "/content/gdrive/My Drive/"
drive.mount('/content/gdrive')
# Load your module
spec = importlib.util.spec_from_file_location("YOUR_MODULE_NAME", "/content/gdrive/My Drive/utils.py")
your_module_name = importlib.util.module_from_spec(spec)
spec.loader.exec_module(your_module_name)
import importlib.util
import sys
from google.colab import drive
drive.mount('/content/gdrive')
# To add a directory with your code into a list of directories
# which will be searched for packages
sys.path.append('/content/gdrive/My Drive/Colab Notebooks')
import example.py
This works for me.
Use this if you are out of content folder! hope this help!
import sys
sys.path.insert(0,'/content/my_project')
from example import*
STEP 1. I have just created a folder 'common_module' like shown in the image :
STEP 2 called the required Class from my "colab" code cell,
sys.path.append('/content/common_module/')
from DataPreProcessHelper import DataPreProcessHelper as DPPHelper
My class file 'DataPreProcessHelper.py' looks like this
Add path of 'sample.py' file to system paths as:
import sys
sys.path.append('drive/codes/')
import sample