Issues with files in pyspark data frame - pyspark
I've been working on a tool on pyspark tool that filters based on a search then sorts those results. The data frame is a compilation of over 1,400 csv's. When I attempt to run the code, I get a very lengthy error message. It appears to break down to a java error for an unexpected EOF:
Py4JJavaError: An error occurred while calling o1331.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 202 in stage 56.0 failed 4 times, most recent failure: Lost task 202.3 in stage 56.0 (TID 7632, emr-master-f35-eels.sss.local, executor 31): com.univocity.parsers.common.TextParsingException: java.lang.IllegalStateException - Error reading from input
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Delimiters for detection=null
Empty value=
Escape unquoted values=false
Header extraction enabled=null
Ignore leading whitespaces=false
Ignore leading whitespaces in quotes=false
Ignore trailing whitespaces=false
Ignore trailing whitespaces in quotes=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=field selection: [2]
Skip bits as whitespace=true
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
Comment character=\0
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character="
Quote escape escape character=null
Internal state when error was thrown: line=737459, column=3, record=235, charIndex=457399297, headers=[attachment_md5_checksum, attachment_filename, attachment_text, attachment_urlsafe_base64_bytes, notice_id, title, solicitation_number, department_ind_agency, cgac, sub_tier, fpds_code, office, aac_code, posted_date, type, base_type, archive_type, archive_date, set_aside_code, set_aside, response_deadline, naice_code, classification_code, pop_street_address, pop_city, pop_state, pop_zip, pop_country, active, award_number, award_date, award_dollars, awardee, primary_contact_title, primary_contact_full_name, primary_contact_email, primary_contact_phone, primary_contact_fax, secondary_contact_title, secondary_contact_full_name, secondary_contact_email, secondary_contact_phone, secondary_contact_fax, organization_type, state, city, zip_code, country_code, additional_info_link, link, description]
at com.univocity.parsers.common.AbstractParser.handleException(
at com.univocity.parsers.common.AbstractParser.parseNext(
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$
at scala.collection.Iterator$$anon$
at scala.collection.TraversableOnce$FlattenOps$$anon$1.hasNext(TraversableOnce.scala:464)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:585)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:561)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:346)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1990)
at org.apache.spark.api.python.BasePythonRunner$
Caused by: java.lang.IllegalStateException: Error reading from input
at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(
at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(
at com.univocity.parsers.common.input.AbstractCharInputReader.nextChar(
at com.univocity.parsers.common.input.NoopCharAppender.appendUntil(
at com.univocity.parsers.csv.CsvParser.parseRecord(
at com.univocity.parsers.common.AbstractParser.parseNext(
... 18 more
Caused by: Unexpected end of input stream
at sun.nio.cs.StreamDecoder.readBytes(
at sun.nio.cs.StreamDecoder.implRead(
at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(
... 23 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2030)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:967)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2264)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2213)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2202)
at org.apache.spark.util.EventLoop$$anon$
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:778)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1080)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1062)
at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1484)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1471)
at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:136)
at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3267)
at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3264)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3264)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at py4j.reflection.MethodInvoker.invoke(
at py4j.reflection.ReflectionEngine.invoke(
at py4j.Gateway.invoke(
at py4j.commands.AbstractCommand.invokeMethod(
at py4j.commands.CallCommand.execute(
Caused by: com.univocity.parsers.common.TextParsingException: java.lang.IllegalStateException - Error reading from input
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Delimiters for detection=null
Empty value=
Escape unquoted values=false
Header extraction enabled=null
Ignore leading whitespaces=false
Ignore leading whitespaces in quotes=false
Ignore trailing whitespaces=false
Ignore trailing whitespaces in quotes=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=field selection: [2]
Skip bits as whitespace=true
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
Comment character=\0
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character="
Quote escape escape character=null
Internal state when error was thrown: line=737459, column=3, record=235, charIndex=457399297, headers=[attachment_md5_checksum, attachment_filename, attachment_text, attachment_urlsafe_base64_bytes, notice_id, title, solicitation_number, department_ind_agency, cgac, sub_tier, fpds_code, office, aac_code, posted_date, type, base_type, archive_type, archive_date, set_aside_code, set_aside, response_deadline, naice_code, classification_code, pop_street_address, pop_city, pop_state, pop_zip, pop_country, active, award_number, award_date, award_dollars, awardee, primary_contact_title, primary_contact_full_name, primary_contact_email, primary_contact_phone, primary_contact_fax, secondary_contact_title, secondary_contact_full_name, secondary_contact_email, secondary_contact_phone, secondary_contact_fax, organization_type, state, city, zip_code, country_code, additional_info_link, link, description]
at com.univocity.parsers.common.AbstractParser.handleException(
at com.univocity.parsers.common.AbstractParser.parseNext(
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$
at scala.collection.Iterator$$anon$
at scala.collection.TraversableOnce$FlattenOps$$anon$1.hasNext(TraversableOnce.scala:464)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:585)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:561)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:346)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1990)
at org.apache.spark.api.python.BasePythonRunner$
Caused by: java.lang.IllegalStateException: Error reading from input
at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(
at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(
at com.univocity.parsers.common.input.AbstractCharInputReader.nextChar(
at com.univocity.parsers.common.input.NoopCharAppender.appendUntil(
at com.univocity.parsers.csv.CsvParser.parseRecord(
at com.univocity.parsers.common.AbstractParser.parseNext(
... 18 more
Caused by: Unexpected end of input stream
at sun.nio.cs.StreamDecoder.readBytes(
at sun.nio.cs.StreamDecoder.implRead(
at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(
... 23 more
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError('An error occurred while calling o1331.collectToPython.\n', JavaObject id=o1338), <traceback object at 0x7ff3f2ff9230>)
I went through the process of running this code on each csv individually and narrowed it down to 6 of them that are causing this error. Obviously I can remove those 6 files from the list and then the code can run just fine, but if there is a way to use code to diagnose and potentially repair these files I'd like to try that route first. Any suggestions/ideas for how I can go about this?
Per the suggestion below, I attempted to open the file with the following code and then print the contents:
with open('file.csv.bz2', 'r', encoding='ISO-8859-1') as f:
lines = f.readlines()
This can run without issue. However, I then tried to open it in pandas and got an EOFError.
You can load the files, while preserving all records:
You can load all records without these 6 files into a single data frame.
Load the 6 files, using the schema from #1, while using PERMISSIVE mode (see example) and preserve malformed columns.
Optionally, you can rename the currepted column name. See Not able to retain the corrupted rows in pyspark using PERMISSIVE mode
Now you can see what is malformed, and decide if to drop those, write them to a deal-letter queue, log them of whatever you decide.
Execution failed for task ':app:compileFlutterBuildDebug'.error Illegal character in opaque part at index 2: C:\#\mobile\build\app\intermediates\flutter\debug\flutter_assets\AssetManifest.json Illegal character in opaque part at index 2: C:\#\mobile\build\app\intermediates\flutter\debug\flutter_assets\AssetManifest.json
{Supplemental Symbols and Pictographs} Not being identified by Scala
I am trying to identify emojis within a sentence def extractEmojiFromSentence (sentence: Any) : Seq[String] = { return raw"[\p{block=Emoticons}\p{block=Miscellaneous Symbols and Pictographs}\p{block=Supplemental Symbols and Pictographs}]".r.findAllIn(sentence.toString).toSeq } This gives the following error Exception in thread "main" java.util.regex.PatternSyntaxException: Unknown character block name {Supplemental Symbols and Pictographs} near index 112 [\p{block=Emoticons}\p{block=Miscellaneous Symbols and Pictographs}\p{block=Supplemental Symbols and Pictographs}] Do I have to import some libraries into my build.sbt . Or which is the reason for the above error? UPDATE Im tyring the below code as suggested in the comment val x = raw"\p{block=Supplemental Symbols and Pictographs}".r.findAllIn(mySentence.toString).toSeq But im getting the below error Exception in thread "main" java.util.regex.PatternSyntaxException: Unknown character block name {Supplemental Symbols and Pictographs} near index 45 \p{block=Supplemental Symbols and Pictographs} ^
It appears that the regex engine in your JVM version does not recognize that block label. (Mine doesn't either.) You can just supply the equivalent character range instead. def extractEmojiFromSentence(sentence: String): Seq[String] = ("[\\p{block=Emoticons}" + "\\p{block=Miscellaneous Symbols and Pictographs}" + "\uD83E\uDD00-\uD83E\uDDFF]") //Supplemental Symbols & Pictographs .r.findAllIn(sentence).toSeq
In YAML template validation fails to recognize "try"
Code: DEBUG_MODE = True # Manually change when debugging try: CFN_CLIENT = boto3.client('cloudformation') except Exception as error: print('Error creating boto3.client, error text follows:\n%s' % error) raise Exception(error) Question: Template validation error: Template format error: YAML not well-formed. (line 37, column 1) In my case (above code) the word "try" is placed at line 37 column 1. I wonder what is not correct? Thanks.
How to recognize ID, Literals and Comments in Lex file
I have to write a lex program that has these rules: Identifiers: String of alphanumeric (and _), starting with an alphabetic character Literals: Integers and strings Comments: Start with ! character, go to until the end of the line Here is what I came up with [a-zA-Z][a-zA-Z0-9]+ return(ID); [+-]?[0-9]+ return(INTEGER); [a-zA-Z]+ return ( STRING); !.*\n return ( COMMENT ); However, I still get a lot of errors when I compile this lex file. What do you think the error is?
It would have helped if you'd shown more clearly what the problem was with your code. For example, did you get an error message or did it not function as desired? There are a couple of problems with your code, but it is mainly correct. The first issue I see is that you have not divided your lex program into the necessary parts with the %% divider. The first part of a lex program is the declarations section, where regular expression patterns are specified. The second part is where the action that match patterns are specified. The (optional) third section is where any code (for the compiler) is placed. Code for the compiler can also be placed in the declaration section when delineated by %{ and %} at the start of a line. If we put your code through lex we would get this error: "SoNov16.l", line 1: bad character: [ "SoNov16.l", line 1: unknown error processing section 1 "SoNov16.l", line 1: unknown error processing section 1 "SoNov16.l", line 1: bad character: ] "SoNov16.l", line 1: bad character: + "SoNov16.l", line 1: unknown error processing section 1 "SoNov16.l", line 1: bad character: ( "SoNov16.l", line 1: unknown error processing section 1 "SoNov16.l", line 1: bad character: ) "SoNov16.l", line 1: bad character: ; Did you get something like that? In your example code you are specifying actions (the return(ID); is an example of an action) and thus your code is for the second section. You therefore need to put a %% line ahead of it. It will then be a valid lex program. You code is dependant on (probably) a parser, which consumes (and declares) the tokens. For testing purposes it is often easier to just print the tokens first. I solved this problem by making a C macro which will do the print and can be redefined to do the return at a later stage. Something like this: %{ #define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext) %} %% [a-zA-Z][a-zA-Z0-9]+ TOKEN(ID); [+-]?[0-9]+ TOKEN(INTEGER); [a-zA-Z]+ TOKEN (STRING); !.*\n TOKEN (COMMENT); If we build and test this, we get the following: abc String: abc Matched: ID abc123 String: abc123 Matched: ID ! comment text String: ! comment text Matched: COMMENT Not quite correct. We can see that the ID rule is matching what should be a string. This is due to the ordering of the rules. We have to put the String rule first to ensure it matches first - unless of course you were supposed to match strings inside some quotes? You also missed the underline from the ID pattern. Its also a good idea to match and discard any whitespace characters: %{ #define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext) %} %% [a-zA-Z]+ TOKEN (STRING); [a-zA-Z][a-zA-Z0-9_]+ TOKEN(ID); [+-]?[0-9]+ TOKEN(INTEGER); !.*\n TOKEN (COMMENT); [ \t\r\n]+ ; Which when tested shows: abc String: abc Matched: STRING abc123_ String: abc123_ Matched: ID -1234 String: -1234 Matched: INTEGER abc abc123 ! comment text String: abc Matched: STRING String: abc123 Matched: ID String: ! comment text Matched: COMMENT Just in case you wanted strings in quotes, that is easy too: %{ #define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext) %} %% \"[^"]+\" TOKEN (STRING); [a-zA-Z][a-zA-Z0-9_]+ TOKEN(ID); [+-]?[0-9]+ TOKEN(INTEGER); !.*\n TOKEN (COMMENT ); [ \t\r\n] ; "abc" String: "abc" Matched: STRING
unicode error preventing creation of text file
What is causing this error and how can I fix it? (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape I have also tried reading different files in the same directory an get this same unicode error as well. file1 = open("C:\Users\Cameron\Desktop\newtextdocument.txt", "w") for i in range(1000000): file1.write(str(i) + "\n")
You should escape backslashes inside the string literal. Compare: >>> print("\U00000023") # single character # >>> print(r"\U00000023") # raw-string literal with \U00000023 >>> print("\\U00000023") # 10 characters \U00000023 >>> print("a\nb") # three characters (literal newline) a b >>> print(r"a\nb") # four characters (note: `r""` prefix) a\nb
\U is being treated as the start of a Unicode literal. Use a raw string (a preceding r) to prevent this translation: >>> 'C:\Users' File "<stdin>", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape >>> r'C:\Users' 'C:\\Users'