Is there a way to read an Excel file stored in a GCS bucket using Dataflow?
And I would also like to know if we can access the metadata of an object in GCS using Dataflow. If yes then how?
CSV files are often used to read files from excel. These files can be split and read line by line so they are ideal for dataflow. You can use TextIO.Read to pull in each line of the file, then parse them as CSV lines.
If you want to use a different binary excel format, then I believe that you would need to read in the entire file and use a library to parse it. I recommend using CSV files if you can.
As for reading the GCS metadata. I don't think that you can do this with TextIO, but you could call the GCS API directly to access the metadata. If you only do this for a few files at the start of your program then it will work and not be too expensive. If you need to read many files like this, you'll be adding an extra RPC for each file.
Be careful to not read the same file multiple times, I suggest reading each file's metadata once once and then writing the metadata out to a side input. Then in one of your ParDo's you can access the side input for each file.
Useful links:
ETL & Parsing CSV files in Cloud Dataflow
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/TextIO.Read
https://cloud.google.com/dataflow/model/par-do#side-inputs
private static final int BUFFER_SIZE = 64 * 1024;
private static void printBlob(com.google.cloud.storage.Storage storage, String bucketName, String blobPath) throws IOException, InvalidFormatException {
try (ReadChannel reader = ((com.google.cloud.storage.Storage) storage).reader(bucketName, blobPath)) {
InputStream inputStream = Channels.newInputStream(reader);
Workbook wb = WorkbookFactory.create(inputStream);
StringBuffer data = new StringBuffer();
for(int i=0;i<wb.getNumberOfSheets();i++) {
String fName = wb.getSheetAt(i).getSheetName();
File outputFile = new File("D:\\excel\\"+fName+".csv");
FileOutputStream fos = new FileOutputStream(outputFile);
XSSFSheet sheet = (XSSFSheet) wb.getSheetAt(i);
Iterator<Row> rowIterator = sheet.iterator();
data.delete(0, data.length());
while (rowIterator.hasNext())
{
// Get Each Row
Row row = rowIterator.next();
data.append('\n');
// Iterating through Each column of Each Row
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext())
{
Cell cell = cellIterator.next();
// Checking the cell format
switch (cell.getCellType())
{
case Cell.CELL_TYPE_NUMERIC:
data.append(cell.getNumericCellValue() + ",");
break;
case Cell.CELL_TYPE_STRING:
data.append(cell.getStringCellValue() + ",");
break;
case Cell.CELL_TYPE_BOOLEAN:
data.append(cell.getBooleanCellValue() + ",");
break;
case Cell.CELL_TYPE_BLANK:
data.append("" + ",");
break;
default:
data.append(cell + ",");
}
}
}
fos.write(data.toString().getBytes());
}
}
}
You should be able to read the metadata of a GCS file by using the GCS API. However you would need the filenames. You can do this by doing a ParDo or other transform over a list of PCollection<string> which holds the filenames.
We don't have any default readers for excel files. You can parse from a CSV file by using a text input:(ETL & Parsing CSV files in Cloud Dataflow)
I'm not very knowledgeable on excel, and how the file format is stored. If you want to process one file at a time, you can use a PCollection<string> of files. And then use some library to parse the excel file at a time.
If an excel file can be split into easily-parallelizable parts, I'd suggest you take a look at this doc (https://beam.apache.org/documentation/io/authoring-overview/). (If you are still using Dataflow SDK, it should be similar.) It may be worth splitting into smaller chunks before reading to get more parallelization out of your pipeline. In this case you could use IOChannelFactory to read from the file.
Related
I want to get the time creation of files in GCS, I used the code below :
println(Files
.getFileAttributeView(Paths.get("gs://datalake-dev/mu/tpu/file.0450138"), classOf[BasicFileAttributeView])
.readAttributes.creationTime)
The problem is that the Paths.get function replace // with / so I will get gs:/datalake-dev/mu/tpu/file.0450138 instead of gs://datalake-dev/mu/tpu/file.0450138.
Anyone can help me with this ?
Thanks a lot !
I solved the problem by adding the following java code and then calling the java function in scala.
import com.google.cloud.storage.*;
import java.sql.Timestamp;
public class ExtractDate {
public static String getTime(String fileName){
String bucketName = "bucket-data";
String blobName = "doc/files/"+fileName;
// Instantiates a client
Storage storage_client = StorageOptions.getDefaultInstance().getService();
Bucket bucket = storage_client.get(bucketName);
//val storage_client = Storage.
BlobId blobId = BlobId.of(bucketName, blobName);
Blob blob = storage_client.get(blobId);
Timestamp tmp = new Timestamp(bucket.get(blobName).getCreateTime());
System.out.print(bucket.get(blobName).getContent());
// return the year of the file date creation
return tmp.toString().substring(0,4);
}
}
You can use the file_get_contents method to read the contents of the path. From the documentation on Reading and Writing Files
Read objects contents using PHP to fetch an object's custom metadata from Google Cloud Storage.An App Engine PHP 5 app must use the Cloud Storage stream wrapper to write files at runtime. However, if an app needs to read files, and these files are static, you can optionally read static files uploaded with your app using PHP filesystem functions such as file_get_contents.
$fileContents = file_get_contents($filePath);
where the path specified must be a path relative to the script accessing them.
You must upload the file or files in an application subdirectory when you deploy your app to App Engine, and must configure the app.yaml file so your app can access those files. For complete details, see PHP 5 Application Configuration with app.yaml.
In the app.yaml configuration, notice that if you use a static file or directory handler (static_files or static_dir) you must specify application_readable set to true or your app won't be able to read the files. However, if the files are served by a script handler, this isn't necessary, because these files are readable by script handlers by default.
I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn't matter where exactly do I want to save the outputs, but I am mentioning it just in case).
The following code works well, but it saves folders with the names like jsonFile-19-45-46.json, and then inside the folders it saves files _SUCCESS and part-00000.
Is it possible to save each RDD[String] (these are JSON strings) data into the JSON files, not the folders? I thought that repartition(1) had to make this trick, but it didn't.
myDStream.foreachRDD { rdd =>
// datetimeString = ....
rdd.repartition(1).saveAsTextFile("s3n://mybucket/keys/jsonFile-"+datetimeString+".json")
}
AFAIK there is no option to save it as a file. Because it's a distributed processing framework and it's not a good practice write on single file rather than each partition writes it's own files in the specified path.
We can pass only output directory where we wanted to save the data. OutputWriter will create file(s)(depends on partitions) inside specified path with part- file name prefix.
As an alternative to rdd.collect.mkString("\n") you can use hadoop Filesystem library to cleanup output by moving part-00000 file into it's place. Below code works perfectly on local filesystem and HDFS, but I'm unable to test it with S3:
val outputPath = "path/to/some/file.json"
rdd.saveAsTextFile(outputPath + "-tmp")
import org.apache.hadoop.fs.Path
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.rename(new Path(outputPath + "-tmp/part-00000"), new Path(outputPath))
fs.delete(new Path(outputPath + "-tmp"), true)
For JAVA I implemented this one. Hope it helps:
val fs = FileSystem.get(spark.sparkContext().hadoopConfiguration());
File dir = new File(System.getProperty("user.dir") + "/my.csv/");
File[] files = dir.listFiles((d, name) -> name.endsWith(".csv"));
fs.rename(new Path(files[0].toURI()), new Path(System.getProperty("user.dir") + "/csvDirectory/newData.csv"));
fs.delete(new Path(System.getProperty("user.dir") + "/my.csv/"), true);
http://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds
The link shows how to change txt file into RDD, and then change to Dataframe.
So how to deal with binary file ?
Ask for an example ,Thank you very much .
There is a similar question without answer here : reading binary data into (py) spark DataFrame
To be more detail, I don't know how to parse the binary file .for example , I can parse txt file into lines or words like this:
JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}
});
It seems that I just need the API that could parse the binary file or binary stream like this way:
JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.bin").map(
new Function<String, Person>() {
public Person call(/*stream or binary file*/) throws Exception {
/*code to construct every row*/
return person;
}
});
EDIT:
The binary file contains structure data (relational database 's table,the database is a self-made database) and I know the meta info of the structure data.I plan to change the structure data into RDD[Row].
And I could change every thing about the binary file when I use FileSystem's API (http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) to write the binary stream into HDFS .And The binary file is splittable. I don't have any idea to parse the binary file like the example code above . So I cann't try anything so far.
There is a binary record reader that is already available for spark (I believe available in 1.3.1, atleast in the scala api).
sc.binaryRecord(path: string, recordLength: int, conf)
Its on you though to convert those binaries to an acceptable format for processing.
This is the first time i am integrating Email service with liftweb
I want to send Email with attachments(Like:- Documents,Images,Pdfs)
my code looking like below
case class CSVFile(bytes: Array[Byte],filename: String = "file.csv",
mime: String = "text/csv; charset=utf8; header=present" )
val attach = CSVFile(fileupload.mkString.getBytes("utf8"))
val body = <p>Please research the enclosed.</p>
val msg = XHTMLPlusImages(body,
PlusImageHolder(attach.filename, attach.mime, attach.bytes))
Mailer.sendMail(
From("vyz#gmail.com"),
Subject(subject(0)),
To(to(0)),
)
this code is taken from LiftCookbook its not working like my requirement
its working but only the Attached file name is coming(file.csv) no data in it(i uploaded this file (gsy.docx))
Best Regards
GSY
You don't specify what type fileupload is, but assuming it is of type net.liftweb.http. FileParamHolder then the issue is that you can't just call mkString and expect it to have any data since there is no data in the object, just a fileStream method for retrieving it (either from disk or memory).
The easiest to accomplish what you want would be to use a ByteArrayInputStream and copy the data to it. I haven't tested it, but the code below should solve your issue. For brevity, it uses Apache IO Commons to copy the streams, but you could just as easily do it natively.
val data = {
val os = new ByteArrayOutputStream()
IOUtils.copy(fileupload.fileStream, os)
os.toByteArray
}
val attach = CSVFile(data)
BTW, you say you are uploading a Word (DOCX) file and expecting it to automatically be CSV when the extension is changed? You will just get a DOCX file with a csv extension unless you actually do some conversion.
How to convert excel data into xml file using ado.net?
You can use the Microsoft Jet OLEDB 4.0 Data Provider to read the Excel file. Information about how to establish a connection to an Excel file can be found here.
This article explains how to read an Excel file using the provider. Once you have read the data, you can compose your XML document using LINQ to XML or the System.Xml classes.
In Excel, you can save the file to XML by using the File menu and changing the saved file type to XML spreadsheet.
If you want to read an Excel XML file with ADO.Net, try the XmlReader.
Or see this step-by-step example from Microsoft.
I've not used ado.net, but I've used xquery very successfully for this. Use excel export to create an XML file, then write xquery/xpath commands to convert as you want. Excel XML export format is pretty gnarly but it does do the job. Use the Oxygen 30 day eval license to lighten the xquery debug job.
use this code :
public static DataSet exceldata(string filelocation)
{
DataSet ds = new DataSet();
OleDbCommand excelCommand = new OleDbCommand();OleDbDataAdapter excelDataAdapter = new OleDbDataAdapter();
string excelConnStr = "Provider=Microsoft.Jet.OLEDB.4.0; Data Source=" + filelocation + "; Extended Properties =Excel 8.0;";
OleDbConnection excelConn = new OleDbConnection(excelConnStr);
excelConn.Open();
DataTable dtPatterns = new DataTable();excelCommand = new OleDbCommand("SELECT UUID, `PATTERN` as PATTERN, `PLAN` as PLAN FROM [PATTERNS$]", excelConn);
excelDataAdapter.SelectCommand = excelCommand;
excelDataAdapter.Fill(dtPatterns);
dtPatterns.TableName = "Patterns";
ds.Tables.Add(dtPatterns);
return ds;
}
and then convert returned datatable to xml with DataTable.WriteXml()