Replace the 2 number in a file name to get the correct format in python - numbers

I have some kind of the file name:
Love Scene (2021) 01 TMP.mp4
Love.Scene.2021 02.TMP.mp4
Love.Scene.2021.03.TMP.mp4
LoveScene04.TMP.mp4
01, 02, 03, 04 is the episode number.
In a correct name, it should be:
Love Scene (2021) E01 TMP.mp4
Love.Scene.2021 E02.TMP.mp4
Love.Scene.2021.E03.TMP.mp4
LoveScene.E04.TMP.mp4
How to use Regular Expression to add "E" before the Episode number?

Considering the data in the data.txt file, then string replace using regular expression can be done in the following way.
Output will be generated in output.txt
import re
def replaceString(d, prefx='E'):
val = re.findall(r'\d{2}\b', d)
if(len(val) > 0):
d = (prefx+val[-1]).join(d.rsplit(val[-1], 1))
print(d)
return d
def handleData():
prefx="E"
file1 = open('data.txt', 'r')
file2 = open('output.txt','w+')
for line in file1.readlines():
if not line.strip():
continue
line=replaceString(line,prefx)
file2.write(line)
file1.close()
file2.close()
handleData()
input data.txt:
Love Scene (2021) 01 TMP.mp4
Love.Scene.2021 02.TMP.mp4
Love.Scene.2021.03.TMP.mp4
LoveScene04.TMP.mp4
output:
Love Scene (2021) E01 TMP.mp4
Love.Scene.2021 E02.TMP.mp4
Love.Scene.2021.E03.TMP.mp4
LoveSceneE04.TMP.mp4

Related

pyspark error with reduceByKey call using simple wordcount from a file

Trying to run this pyspark wordcount program from this page: https://www.learntospark.com/2020/01/word-count-program-in-apache-spark.html
Here is my code:
import findspark
findspark.init()
# Create SparkSession and sparkcontext
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.master("spark://my-ubuntu.xxx.com:7077")\
.appName('wordcount')\
.getOrCreate()
sc=spark.sparkContext
# Read the input file and Calculating words count
text_file = sc.textFile("peterpan.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)
# Printing each word with its respective count
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
# Stopping Spark-Session and Spark context
sc.stop()
spark.stop()
I am getting following error: tuple index out of range.
I am trying to learn pyspark. Any help is appreciated.
PicklingError Traceback (most recent call last)
Cell In [9], line 16
12 # Read the input file and Calculating words count
13 text_file = sc.textFile("peterpan1.txt")
14 counts = text_file.flatMap(lambda line: line.split(" ")) \
15 .map(lambda word: (word, 1)) \
---> 16 .reduceByKey(lambda x, y: x + y)
18 # Printing each word with its respective count
19 output = counts.collect()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\rdd.py:2275, in RDD.reduceByKey(self, func, numPartitions, partitionFunc)
2252 def reduceByKey(
2253 self: "RDD[Tuple[K, V]]",
2254 func: Callable[[V, V], V],
2255 numPartitions: Optional[int] = None,
2256 partitionFunc: Callable[[K], int] = portable_hash,
2257 ) -> "RDD[Tuple[K, V]]":
2258 """
2259 Merge the values for each key using an associative and commutative reduce function.
2260
(...)
2273 [('a', 2), ('b', 1)]
2274 """
-> 2275 return self.combineByKey(lambda x: x, func, func, numPartitions, partitionFunc)
..
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\rdd.py:3345, in _prepare_for_python_RDD(sc, command)
3342 def _prepare_for_python_RDD(sc: "SparkContext", command: Any) -> Tuple[bytes, Any, Any, Any]:
3343 # the serialized command will be compressed by broadcast
3344 ser = CloudPickleSerializer()
-> 3345 pickled_command = ser.dumps(command)
3346 assert sc._jvm is not None
3347 if len(pickled_command) > sc._jvm.PythonUtils.getBroadcastThreshold(sc._jsc): # Default 1M
3348 # The broadcast will have same life cycle as created PythonRDD
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\serializers.py:468, in CloudPickleSerializer.dumps(self, obj)
466 msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
467 print_exec(sys.stderr)
--> 468 raise pickle.PicklingError(msg)
PicklingError: Could not serialize object: IndexError: tuple index out of range
=== peterpan.txt file is story with 6k+ lines. Bunch of lines from the file are listed below. =====
The Project Gutenberg EBook of Peter Pan, by James M. Barrie
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **
** Please follow the copyright guidelines in this file. **
Title: Peter Pan
Peter Pan and Wendy
Author: James M. Barrie
Posting Date: June 25, 2008 [EBook #16]
Release Date: July, 1991
Last Updated: October 14, 2016
Language: English
Character set encoding: UTF-8
*** START OF THIS PROJECT GUTENBERG EBOOK PETER PAN ***
PETER PAN
[PETER AND WENDY]
By J. M. Barrie [James Matthew Barrie]
A Millennium Fulcrum Edition (c)1991 by Duncan Research
Contents:
Chapter 1 PETER BREAKS THROUGH
Chapter 2 THE SHADOW
..
..
EBooks posted prior to November 2003, with eBook numbers BELOW #10000,
are filed in directories based on their release date. If you want to
download any of these eBooks directly, rather than using the regular
search system you may utilize the following addresses and just
download by the etext year.
http://www.ibiblio.org/gutenberg/etext06
(Or /etext 05, 04, 03, 02, 01, 00, 99,
98, 97, 96, 95, 94, 93, 92, 92, 91 or 90)
EBooks posted since November 2003, with etext numbers OVER #10000, are
filed in a different way. The year of a release date is no longer part
of the directory path. The path is based on the etext number (which is
identical to the filename). The path to the file is made up of single
digits corresponding to all but the last digit in the filename. For
example an eBook of filename 10234 would be found at:
http://www.gutenberg.org/1/0/2/3/10234
or filename 24689 would be found at:
http://www.gutenberg.org/2/4/6/8/24689
An alternative method of locating eBooks:
http://www.gutenberg.org/GUTINDEX.ALL
*** END: FULL LICENSE ***

Errors that I don't understand

I am trying to write a code from the following tutorial:
https://www.youtube.com/watch?v=9mAmZIRfJBs&t=197s
In my opinion I completely wrote it the same way, but it still gives an error. Can someone explain to me why Spyder(Python 3.7) does this.
This is my code:
I tried using another input function so raw_input instead of input. I also tried changing my working directory and saving the document
This is my code:
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 29 14:47:27 2019
#author: johan
"""
import random
restaurantsList = ['boloco', 'clover', 'sweetgreens']
def pickRestaurant():
print(restaurantsList[random.randint(0,2)])
def addRestaurant(name):
restaurantsList.append(name)
def removeRestaurant(name):
restaurantsList.remove(name)
def listRestaurant():
for restaurant in restaurantsList:
print(restaurant)
while True:
print('''
[1] - List restaurant
[2] - Add restaurant
[3] - Remove restaurant
[4] - Pick restaurant
[5] - Exit
''')
selection = raw_input(prompt='Please select an option: ')
if selection == '1':
print('')
listRestaurant()
elif selection == '2':
inName = raw_input(prompt='Type name of the restaurant that you want to add: ')
addRestaurant(inName)
elif selection == '3':
inName = raw_input(prompt='Type name of the restaurant that you want to remove: ')
removeRestaurant(inName)
elif selection == '4':
pickRestaurant()
elif selection == '5':
break
and this is the error
runfile('C:/Users/johan/Desktop/Unie jaar 2/untitled2.py', wdir='C:/Users/johan/Desktop/Unie jaar 2')
Traceback (most recent call last):
File "C:\Users\johan\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-93-2d7193d6cafb>", line 1, in <module>
runfile('C:/Users/johan/Desktop/Unie jaar 2/untitled2.py', wdir='C:/Users/johan/Desktop/Unie jaar 2')
File "C:\Users\johan\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 704, in runfile
execfile(filename, namespace)
File "C:\Users\johan\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/johan/Desktop/Unie jaar 2/untitled2.py", line 35
selection = raw_input(prompt='Please select an option: ')
^
IndentationError: unindent does not match any outer indentation level
The code should give a list of restaurant is 1 is put in. You are able to add a restaurant to the list if 2 is put in. 3 is like to but then you remove. 4 picks a random restaurant from the list. 5 does nothing.
It's imperative that you indent correctly in Python, as such;
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 29 14:47:27 2019
#author: johan
"""
import random
restaurantsList = ['boloco', 'clover', 'sweetgreens']
def pickRestaurant():
print(restaurantsList[random.randint(0,2)])
def addRestaurant(name):
restaurantsList.append(name)
def removeRestaurant(name):
restaurantsList.remove(name)
def listRestaurant():
for restaurant in restaurantsList:
print(restaurant)
while True:
print('''
[1] - List restaurant
[2] - Add restaurant
[3] - Remove restaurant
[4] - Pick restaurant
[5] - Exit
''')
selection = input('Please select an option: ')
if selection == '1':
print('')
listRestaurant()
elif selection == '2':
inName = input('Type name of the restaurant that you want to add: ')
addRestaurant(inName)
elif selection == '3':
inName = input('Type name of the restaurant that you want to remove: ')
removeRestaurant(inName)
elif selection == '4':
pickRestaurant()
elif selection == '5':
break
Python is indentation sensitive, and when creating a function or any statements you need to indent any code inside that function or you will get the error you have above.
Additional note: You're using print() which is python2 and raw_input which is python3, so I've assumed Python3 and changed the raw_input() for input().
You have 4 spaces before print statement within while loop, but all other lines in that loop have 3 spaces indent only, starting from selection = raw_input...
You should add a space at the start for every line starting from selection = raw_input... and below.

How to detect 4 digit using regexp

How do I get the year (4 digits) when given a source code, I can only detect the day (29), but could not detect the year(1997). There is something wrong in my regexp checking.
age = regexp(CharData,'(\d{1,4})','match','once')
For example,
Registered On
March 29, 1997
Desired output: 1997
Error output: 29
for i = 1:2
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F22_TR\f22_TR_pdata_',int2str(i),''),'r')
CharData = fread(data2, '*char'); %read text file and store data in CharData
fclose(data2);
age = regexp(CharData,'(\d{4})','match','once')
end
file : f22_TR_pdata_1 --> Registered On
June 24, 1997
file : f22_TR_pdata_2 --> Registered On
March 29, 1997
Age: 1997
To only grab four digits
age = regexp(CharData,'(\d{4})','match','once')
Doing d{1,4} means look for numbers with a length between 1 and 4. Meaning, 1, 29, 123, 4444 would all match because their length is between 1 and 4
d{4} says, get me the number with exact length of 4. Meaning, 1997, 2001, 1800 would all match.

How to read files using for loop in Matlab?

I need to read 8760 files (365 days * 24 hours = 8760) of small size (60 kb) and aggregate values and take average of some values.
Earlier, I have used the below stated code for reading *.csv files:
for a=1:365
for b=1:24
s1=int2str(a);
s2=int2str(b);
s3=strcat('temperature_humidity',s1,'_'s2);
data = load(s3);
% Code for aggregation, etc
end
end
I was able to run this code. However now the file name is little different and I am not sure how to read these files.
Files are named like this:
2005_M01_D01_0000(UTC-0800)_L00_NOX_1HR_CONC.DAT
where M = Month, so the values are 01, 01, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12;
D = Day, so the values are 01, 02, 03, ..., 31;
Hours is in this format: 0000, 0100, 0200, ..., 1800, ..., 2300.
Please take a look at the attached image for file name . I need to read these files. Please help me.
Thank you very much.
I would use dir:
files=dir('*.dat')
Or you can construct the filenames with
name = sprintf('%d_M%2d etc.',...)

UNIX: Convert Unix Date in Specific format

I have some date other than current date in unix and I want to convert into a specific format
Original Format
D="Mon Dec 30 06:35:02 EST 2013"
New Format
E=20131230063502
E=`date +%Y%m%d%H%M%S`
this is the way to format the output of the date command and save it in the variable E
Using python:
def data(dstr):
m = {'Jan': '01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}
val = dstr.split(' ')
month = m[val[1]]
time = val[3].split(':')
return '{}{}{}{}{}{}'.format(val[-1],month,val[2],time[0],time[1],time[2])
if __name__ == '__main__':
print data("Mon Dec 30 06:35:02 EST 2013")
In: Mon Dec 30 06:35:02 EST 2013
Out: 20131230063502