I'm reading the Apache Beam programming guide, which starts off very excellent but becomes a bit harder to get through starting with the Schemas section.
My main question here is: Are schemas relevant if you are using Beam in Python? It seems like they might only be relevant if you are using a strongly typed language like Java, but I'm not sure. And while the programming guide is good about using different wording for Java vs Python early on in the guide, once you get to the Schemas section it is focused entirely on Java. So it's hard for me to tell if this is a topic I should know anything about if I am using Python.
Here is the section of the guide I am asking about: https://beam.apache.org/documentation/programming-guide/#schemas
There is a direct mapping between Python NamedTuple and Java's Schema. For Python it's especially important in the cross-language part of Beam (right now JDBC and Snowflake IOs)
For example Java's
Schema javaSchema = Schema.builder().addInt32Field("f_int32").build()
Has its equivalent in Python:
PythonSchema = NamedTuple("PythonSchema", [ ("f_int32", int) ]
The same about Row and the PythonSchema instances.
When you send the tuples in your Python transform via cross-language, e.g.:
pipeline | GenerateSequence(start=0, end=3) # return some int sequence
| Map(lambda x: PythonSchema(f_int32=x)).with_output_types(PythonSchema)
| WriteToJdbc(...) # cross-language transform, accepts java Row elements
Then WriteToJdbc will receive in its Java pipeline PCollection with objects equal to
Row.withSchema(javaSchema).addValue(x).build()
Cross-language feature is quite new and very experimental and there are little transforms that use it. But it's considered to be Beam's future and the further Python/Go transforms are going to be the cross-language ones as they require just one (mostly) Java implementation and then just a mapping to the other languages. The native transforms usually use their own data types and don't bother with Schemas AFAIK.
I wrote it in a quite hurry so if something is not clear I will edit it. I hope I helped.
You're right, this section is missing details for Python. Schemas are definitely useful in Beam Python, however. You can do things like:
# Copyright 2022 Google LLC.
# SPDX-License-Identifier: Apache-2.0
import beam
input = beam.Row(my_first_row="x", my_second_row=1)
with beam.pipeline() as pipeline:
(pipeline | "Create input" >> beam.Create(input)
| "Select output data" >> beam.Select(
new_first_row=lambda x: "Row: " + x.my_first_row,
new_second_row=lambda x: x.my_second_row + 1))
If you're using regular beam I find this alone to make it worth using schemas. Plus when you add keys, you can keep the values as beam.Rows or list of beam.Rows which is really convenient.
That being said, if you're new to Beam, I would definitely recommend checking out Beam Dataframes in Python [link]. This allows you operate on PCollections as if they were Pandas Dataframes.
Related
I'm reading and writings some text files in Scala. As a complete beginner in the language, I wanted to make sure to find the right way to do it, e.g. get the encoding right.
So most of the stuff I found (also on SO ) recommends I use io.Source.fromFile.However, after trying it out like so, reading a UTF-8 file:
val user_list = Source.fromFile("usernames.txt").getLines.toList
val user_list = Source.fromFile("usernames.txt", enc="UTF8").getLines.toList
I looked at the docs but was left with some questions.
Get the encoding right:
the docs show that I can set an encoding in Source.fromFile as I tried above. Looking at the man on Codec and the types listed there, I was wondering if those are all my codec options - is there e.g. no Utf-16, Big-Endian vs Little-Endian, etc.?
I am slightly obsessed with this since it used to trip me up in Python a lot. Is this less of concern with Scala for some reason?
Get the reading in right:
All the examples I looked at used the getLines method and postprocessed it with MkString or List, etc. Is there any advantage to that over just reading in the entire file (my files are small) in one go?
Get the writing out right:
Every source I could find tells me that Scala has no file writing function and to use the Java FileWriter. I was surprised by this - is this still accurate?
Looking at it I feel the question might be a little broad for SO, so I'd be happy to take it back if it does not meet the requirements. At this point, I'm not struggling with specific examples but rather trying to set things up in a way I don't get in trouble later.
Thanks!
Scala only has a basic IO api in the standard library. For the most part you just use the java apis. The fact that a decent api from java exists is probably why the Scala team is not prioritizing having a robust and fully featured IO api.
There are also third party scala libraries you could use as well however. Better Files I've never used but heard good things about as a Scala file api. As well as fs2 which provides functional, streaming IO. I'm sure there are others out there as well.
For encoding, there are many possible encoding available. It's just that only a couple of the most common ones are available as static fields, the rest you typically access through Codec("Encoding Name"). Most apis will also let you just enter a String directly instead of needing to get a Codec instance first. The codec is really just a wrapper over java.nio.charset.Charset. You can run java.nio.charset.Charset.availableCharsets() to see all of the encodings available on your system.
As far as reading, if the files are small you can load them fully into memory if you prefer that. The only reason not to do so is if you want to avoid the extra memory use of loading the entire file at once if reading through line by line is enough. You may want to use Vector instead of List for efficiency reasons (Vector is better in many cases and should probably be preferred as a default collection, but tradition and old habits die hard and most people/guides seem to default to List, but this is a whole other topic)
I want to apply preprocessing phase on a large amount of text data in Spark-Scala such as Lemmatization - Remove Stop Words(using Tf-Idf) - POS tagging , there is any way to implement them in Spark - Scala ?
for example here is one sample of my data:
The perfect fit for my iPod photo. Great sound for a great price. I use it everywhere. it is very usefulness for me.
after preprocessing:
perfect fit iPod photo great sound great price use everywhere very useful
and they have POS tags e.g (iPod,NN) (photo,NN)
there is a POS tagging (sister.arizona) is it applicable in Spark?
Anything is possible. The question is what YOUR preferred way of doing this would be.
For example, do you have a stop word dictionary that works for you (it could just simply be a Set), or would you want to run TF-IDF to automatically pick the stop words (note that this would require some supervision, such as picking the threshold at which the word would be considered a stop word). You can provide the dictionary, and Spark's MLLib already comes with TF-IDF.
The POS tags step is tricky. Most NLP libraries on the JVM (e.g. Stanford CoreNLP) don't implement java.io.Serializable, but you can perform the map step using them, e.g.
myRdd.map(functionToEmitPOSTags)
On the other hand, don't emit an RDD that contains non-serializable classes from that NLP library, since steps such as collect(), saveAsNewAPIHadoopFile, etc. will fail. Also to reduce headaches with serialization, use Kryo instead of the default Java serialization. There are numerous posts about this issue if you google around, but see here and here.
Once you figure out the serialization issues, you need to figure out which NLP library to use to generate the POS tags. There are plenty of those, e.g. Stanford CoreNLP, LingPipe and Mallet for Java, Epic for Scala, etc. Note that you can of course use the Java NLP libraries with Scala, including with wrappers such as the University of Arizona's Sista wrapper around Stanford CoreNLP, etc.
Also, why didn't your example lower-case the processed text? That's pretty much the first thing I would do. If you have special cases such as iPod, you could apply the lower-casing except in those cases. In general, though, I would lower-case everything. If you're removing punctuation, you should probably first split the text into sentences (split on the period using regex, etc.). If you're removing punctuation in general, that can of course be done using regex.
How deeply do you want to stem? For example, the Porter stemmer (there are implementations in every NLP library) stems so deeply that "universe" and "university" become the same resulting stem. Do you really want that? There are less aggressive stemmers out there, depending on your use case. Also, why use stemming if you can use lemmatization, i.e. splitting the word into the grammatical prefix, root and suffix (e.g. walked = walk (root) + ed (suffix)). The roots would then give you better results than stems in most cases. Most NLP libraries that I mentioned above do that.
Also, what's your distinction between a stop word and a non-useful word? For example, you removed the pronoun in the subject form "I" and the possessive form "my," but not the object form "me." I recommend picking up an NLP textbook like "Speech and Language Processing" by Jurafsky and Martin (for the ambitious), or just reading the one of the engineering-centered books about NLP tools such as LingPipe for Java, NLTK for Python, etc., to get a good overview of the terminology, the steps in an NLP pipeline, etc.
There is no built-in NLP capability in Apache Spark. You would have to implement it for yourself, perhaps based on a non-distributed NLP library, as described in marekinfo's excellent answer.
I would suggest you to take a look in spark's ml pipeline. You may not get everything out of the box yet, but you can build your capabililties and use pipeline as a framework..
In The Pragmatic Programmer:
Normally, you can simply hide a third-party product behind a
well-defined, abstract interface. In fact , we've always been able to
do so on any project we've worked on. But suppose you couldn't isolate
it that cleanly. What if you had to sprinkle certain statements
liberally throughout the code? Put that requirement in metadata, and
use some automatic mechanism, such as Aspects (see page 39 ) or Perl,
to insert the necessary statements into the code itself.
Here the author is referring to Aspect Oriented Programming and Perl as tools that support "automatic mechanisms" for inserting metadata.
In my mind I envision some type of run-time injection of code. How does Perl allow for "automatic mechanisms" for inserting metadata?
Skip ahead to the section on Code Generators. The author provides a number of examples of processing input files to generate code, including this one:
Another example of melding environments using code generators happens when different programming languages are used in the same application. In order to communicate, each code base will need some information in commondata structures, message formats, and field names, for example. Rather than duplicate this information, use a code generator. Sometimes you can parse the information out of the source files of one language and use it to generate code in a second language. Often, though, it is simpler to express it in a simpler, language-neutral representation and generate the code for both languages, as shown in Figure 3.4 on the following page. Also see the answer to Exercise 13 on page 286 for an example of how to separate the parsing of the flat file representation from code generation.
The answer to Exercise 13 is a set of Perl programs used to generate C and Pascal data structures from a common input file.
I'm using the scala^Z3 tool for a small library that (among other things) prints the constraints of a Z3Context in latex format. While it's possible to traverse the Z3AST and latex-ify the expressions by string comparison, it would be much nicer to use the object structure of the z3.scala.dsl package. Is there a way to obtain a z3.scala.dsl.Tree from a Z3AST?
It's true that the DSL is currently "write only", in that you can use it to create trees and ship them to Z3 but not to read them back.
The standard way to read Z3 trees is to use getASTKind and getDeclKind from Z3Context. The classes that represent the results are Z3ASTKind and Z3DeclKind respectively. (Since most trees are applications, the latter is where most of the information is).
It looks like the way to do this is create the original constraints using z3.scala.dsl, then add each constraint using Z3Context.assertCnstr (tree: Tree[BoolSort]). This way I have the whole DSL tree for easy transformation to latex. For some reason the examples on the scala^Z3 website assemble the AST without using the DSL at all, so this alternative wasn't obvious.
Given a Scala AST, is there a way to generate Scala source code?
I'm looking into ways to autogenerate Scala source by parsing/analyzing other Scala source. Any tips would be appreciated!
I have been successfully using Scala-Refactoring by Mirko Stocker for this task.
For synthetically constructing ASTs, it relies strongly on the existing Tree DSL of Scala's NSC.
Although the code is a bit messy, you can find an example usage in my project ScalaCollider-UGens.
I have also come across a very useful class by Johannes Rudolph.
See our DMS Software Reengineering Toolkit.
DMS provides a complete ecosystem for parsing/analyzing/optimizing/transforming source code in many languages. It achieves this by provide generic machinery for these tasks as its core capabilities, and specializing those according to explicitly supplied language definitions ("front ends"). DMS has front ends for many languages (C, C++, C#, Java, COBOL, ...) that have been used in anger, and a process for defining others very quickly.
We work on expanding the language set more or less continuously. DMS already has parts of a Scala front end implemented, and we know how to finish it based on the other 30+ front ends we have built, with special emphasis on knowledge of Java.