How do I update the source file name in Talend - talend

I have the source file on HDFS and I want to write the output file with a new column to have the name of the source file for each row. I have my Talend job like:
tHDFSGet --> tInputFilePositional --> tmap --> tfileoutputfile
Please help get the file name for each row in the new column.

Used thdfsList to get the filename in it and used
StringHandling.RIGHT(StringHandling.LEFT(((String)globalMap.get("tHDFSList_2_CURRENT_FILEPATH")),StringHandling.LEN(((String)globalMap.get("tHDFSList_2_CURRENT_FILEPATH")))+6),7)
This trimmed the filepath to just the filename.

Related

Talend - how to configure tFileInputDelimited do not throw error when file not found

Good day,
I am using tFileInputDelimited in Talend Data Studio to read a txt file and get some value inside.
The input file name is something like follow, it contain day in the file name:
checksum_150123.txt
This file will create in last few steps before the job end and the file not found.
Thus, every day the job first run, there is no file exist, and then tFileInputDelimited will throw error on file not found.
C:\LandingZone\jx\checksum_180123.txt (The system cannot find the file specified)
[ERROR] 14:13:35 my_track.my_precheck_registration_0_1.DL_PRECHECK_REGISTRATION- CollectCheckSum_1_tFileInputDelimited_1 - C:\LandingZone\jx\checksum_180123.txt (The system cannot find the file specified)
I have a requirement to not showing this, may I know how can I configure this?
for that I recommend you to use the tFileExist component and then use the tFileExist variable Exist (((Boolean)globalMap.get("tFileExist_1_EXISTS")) for example) in a run if trigger
Hope this answers your question

Error: `path` does not exist: ‘MIS_655_RS_T3_Wholesale_Customers’

I imported my excel file into R Environment and saved the path by creating a new file in R scrip. However, when I tried to check my directory and load the dataset, I received the following message " Error: path does not exist: ‘MIS_655_RS_T3_Wholesale_Customers’
What am I doing wrong here?
Thanks
Have you missed the format of your dataset, eg. csv, xlsx.
I suggest you first set your file as working directory, then the following code might help you with it.
Dat_customers <- readxl::read_excel("MIS_655_RS_T3_Wholesale_Customers.xlsx")

Error while loading parquet format file into Amazon Redshift using copy command and manifest file

I'm trying to load parquet file using manifest file and getting below error.
query: 124138ailed due to an internal error. File 'https://s3.amazonaws.com/sbredshift-east/data/000002_0 has an invalid version number: )
Here is my copy command
copy testtable from 's3://sbredshift-east/manifest/supplier.manifest'
IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123'
FORMAT AS PARQUET
manifest;
here is my manifest file
**{
"entries":[
{
"url":"s3://sbredshift-east/data/000002_0",
"mandatory":true,
"meta":{
"content_length":1000
}
}
]
}**
I'm able to load the same file using copy command by specifying the file name.
copy testtable from 's3://sbredshift-east/data/000002_0' IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123' FORMAT AS PARQUET;
INFO: Load into table 'supplier' completed, 800000 record(s) loaded successfully.
COPY
What could be wrong in my copy statement?
This error happens when the content_length value is wrong. You have to specify the correct content_length. You could check it executing an s3 ls command.
aws s3 ls s3://sbredshift-east/data/
2019-12-27 11:15:19 539 sbredshift-east/data/000002_0
The 539 (file size) should be the same than the content_lenght value in your manifest file.
I don't know why they are using this meta value when you don't need it in the direct copy command.
¯\_(ツ)_/¯
The only way I've gotten parquet copy to work with manifest file is to add the meta key with the content_length.
From what I can gather in my error logs, the COPY command for parquet (w/ manifest) might first be reading the files using Redshift Spectrum as an external table. If that's the case, this hidden step does require the content_step which contradicts their initial statement about COPY commands.
https://docs.amazonaws.cn/en_us/redshift/latest/dg/loading-data-files-using-manifest.html

Facing issue on adding Parallelism feature in an Avroconvertor application

I have an application which is used to take zip file and convert the text file under the zip files into avro files.
It executes the process in a serial manner in following way:
1) Picks the zip file and unzip it
2) Take each text file under that zip file and its content
3) Take avsc files(Schema files) from different location
4) Merge the text file content with the respective schema and hence making an avro file
But this process is done in serial manner(one file at a time).
Now I want execute this process in parallel. I have all the zip files under a folder.
folder/
A.zip
B.zip
C.zip
1) Under each zip file there are text files which only consist of data (without Schema/headers)
My text file looks like this:
ABC 1234
XYZ 2345
EFG 3456
PQR 4567
2) Secondly I have a avsc files which has the schema for the same text files
My avsc File looks like
{
"Name": String,
"Employee Id" : Int
}
As an experiment I used
SparkContext.parallelize(Folder having all the zip files).map {each file => //code of avro conversion}
but in the code of avro conversion part(which is under SparkContext.parallelize) I have used SparkContext.newHadoopAPIFile feature of spark which also returns an RDD
So when I run application with these changes I get Task not Serializable issue.
Suspecting this issue because of two reasons
1) Have used SparkContext under SparkContext.parallelize
2) Have made an RDD inside an RDD.
org.apache.spark.SparkException: Task not serializable
Now I need to have the Parallelism feature but not sure if there is any alternative approach to achieve parallelism for this Use Case OR how to resolve this Task not Serializable issue.
I am using Spark 1.6 and Scala Version 2.10.5

How to access hive data using spark

I have table stored as text file e.g employee in hive and I want to access it using spark.
First i have set sql context object using
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Then i have created table
scala>sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(
id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
',' LINES TERMINATED BY '\n'")
Further i was trying to load the contents of text file by using
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
I am getting error as
SET hive.support.sql11.reserved.keywords=false
FAILED: SemanticException Line 1:23 Invalid path ''employee.txt'': No files
matching path file:/home/username/employee.txt
If i have to place the textfile in current directory where the spark-shell is running how to do that ?
Do you run hive on hadoop?
try to use absolute path... if this doesn't work, try to load your file to hdfs and then give absolute path to your file (hdfs location) .
Try doing the below steps
Start spark-shell in local mode a Eg:spark-shell --master local[*]
Give the file full path for loading file
Eg:file:///home/username/employee.txt