Session 8 - Advanced Programming Techniques Flashcards
We know how to read data from a single CSV file.
But often we want read data from
many files
We know how to read data from a single CSV file.
But often we want read data from
For example, we might run the same experiment on 100 participants
Each experiment generated a data file and now you want to process them all in one script.
How do we know which files to read from?
One good way is to put all the files into a single directory like /home/alex/subject_data and then find all the files in that directory that match a certain pattern (e.g. ending in .csv).
What does this ‘real directory’ show? - (3)
Here, then directory stores lots of different files from a single experiment.
The ones we care about are .csv files but there are also some other ones in there as well (like the .log files).
In the analysis we want to find and load in all the .csv files and ignore the other ones.
Two ways of listening contents that is in a directory - (2)
- glob
- listdir command (part of os)
One way to find out what is in a directory is with the
listdir command (part of os).
One way of listing contents of a directory is using
os.listdir
To download data from YNiC to practice on is using the command ‘git’
Git is a
a free protocol that allows you to manage files that are synchronized across the internet.
The very common use for ‘Git’ is for
distributing software source code and data files
One website that runs ‘git’ is caled ‘github.com’ which - (2)
a favourite place for people to store their software projects and has become almost synonymous with ‘git’ itself.
Currently, Github says that they host over 100 million developers and over 420 million software projects
YNiC runs its own git server and use to download
some useful data files like this
Example of downloading some useful data files from YNiC git server
What does this code do? - (6)
- This code is used to download a smaller version of a repository called ‘pin-material’ from a specific URL.
- The ‘!cd /content’ command changes the directory to ‘/content’.
- ‘!git clone –branch small –depth 1 https://vcs.ynic.york.ac.uk/cn/pin-material.git’ is the main command.
- It clones the ‘small’ branch of the repository with a depth of 1, meaning it only gets the latest version of the files, not the entire history.
- The comment explains that this smaller version doesn’t include neuroimaging data, making it much smaller than the full repository.
- The ‘!ls -lat’ command lists the contents of the current directory in detail, showing the latest changes first.
Git is a protocol of getting source files, text files from a server in a
particular order
os.listdir() is a function from the …. module
os
We can use the command listdir to see what is in a
particular directory
We can check current working directory using module os by using
os.getcwd()
We can lists contents of current working directory using function part of os called
os.listdir()
We can list the contents of a different directory by passing its path to `os.listdir() e.g.,
For instance, os.listdir('/content/pin-material')
lists the contents of the ‘/content/pin-material’ directory.
Explain this code (using YNiC’s git server to download useful files like pin-material-git) - (8)
- This code uses the
os
module -
os.getcwd()
prints the current working directory. -
os.listdir('.')
lists the contents of the current directory.and stores in variable called ‘contents’ -
'.'
represents the current directory. -
type(contents)
prints the type of the variablecontents
. -
print(contents)
prints the contents of the current directory. -
os.listdir('/content/pin-material')
lists the contents of the ‘/content/pin-material’ directory - different directory and stores into variable ‘newcontents’ - contents of ‘newcontents’ variable is printed out
Output of this code
Both os.listdir('.')
and os.listdir()
refer to the same thing,, listening
listing the contents of the current directory.
Remember .. means ‘
‘go up one directory’
’.’ means
‘this directory’
This is what pin-material directory looks like in file format:
os.listdir()
includes hidden files, which start with a…
Hidden files may….
You may need to fitler… - (3)
with a dot (e.g., .DS_Store
)
- Hidden files may not be useful and can clutter the list.
- You may need to filter out hidden files from the list returned by
os.listdir()
Example of os.listdir() including hidden files (e.g., .DS_Store)
A more useful function than os.listdir is
glob function from glob module
What does ‘glob’ stand for?
It is short for ‘global pattern match’
- The glob function from the glob module is used to
find files and directories matching a specific pattern.
The ‘glob’ function from glob module allows you to use special characters such as ‘*’ and ‘?’ to
search for strings that match certain patterns.
Example of using glob on YNiC pin material
Example of using YNiC pin material directory
Explain the code - (5)
- Importing the glob function is achieved with
from glob import glob
. -
filelist = glob('/content/pin-material/*.jpg')
finds all .jpg files in the ‘pin-material’ directory. -
print(filelist)
displays the list of .jpg files found. -
pyFiles= glob('/content/pin-material/*.py')
finds all Python script files. -
print(sorted(pyFiles))
prints the Python script files as a sorted list - in ascending order
Output of this code:
We see in this code that glob returns whatever path we used in the arguement
Therefore if we use the full path (as we did above) we now have a set of full paths
In other words:
- When provided with the full path as an argument, glob returns a list of full paths.
We could then use this list in loop to open multiple files and load the data from
each one in turn
Can use sorted function to find these hidden files first when using os.listdir
What are wildcard characters in the context of glob?
Wildcard characters are special symbols used in glob patterns to match filenames or paths.
List all the wildcard characters using in glob function - (4)
- (an asterix)
- ? (a question mark)
- [1234] a list of characters -
- [1-9] a range of characters -
Explain the wildcard ‘*’ in glob - (2)
- It matches any set of characters, including no characters at all.
- For example, ‘file*.txt’ matches ‘file.txt’, ‘file123.txt’,
What does the ‘?’ wildcard match in glob? - (2)
- It matches any single character.
- For example, ‘file?.txt’ matches ‘file1.txt’, ‘fileA.txt’, but not ‘file12.txt’.
How does the wildcard ‘[1234]’ work in glob? - (2)
- ‘[1234]’ is a wildcard character in glob that matches any single character from the list [1234].
- For example, ‘file[1234].txt’ matches ‘file1.txt’, ‘file2.txt’, but not ‘file5.txt’.
Explain the ‘[1-9]’ wildcard in glob - (2)
- ‘[1-9]’ is a wildcard character in glob that matches any single character in the range from 1 to 9.
- For example, ‘file[1-9].txt’ matches ‘file1.txt’, ‘file2.txt’, but not ‘file10.txt’.
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/fft*’) print? - (4)
The glob pattern ‘/content/pin-material/fft*’ matches all files in the ‘/content/pin-material’ directory that start with ‘fft’.
- From the given list of files:
- ‘fft_colour.jpg’ and ‘fft_bw.jpg’ match the pattern.
- Therefore, glob(‘/content/pin-material/fft*’) would print [‘fft_colour.jpg’, ‘fft_bw.jpg’].
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/*md’) print? - (4)
- The glob pattern ‘/content/pin-material/*md’ matches all files in the ‘/content/pin-material’ directory that end with ‘md’.
- Based on the * wildcard, which matches any set of characters, it will find files ending with ‘md’.
- From the given list of files, ‘README.md’ matches the pattern.
- Therefore, glob(‘/content/pin-material/*md’) would print [‘README.md’].
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/pop?_*’) print? - (7)
The glob pattern ‘/content/pin-material/pop?_’ utilizes two wildcard characters: ‘?’ and ‘’.
- ’?’ matches any single character, allowing for flexibility in matching filenames.
- ‘*’ matches any set of characters, including no characters at all.
- Therefore, the pattern matches files in the ‘/content/pin-material’ directory that start with ‘pop’, followed by any single character, and then an underscore, and then any set of characters.
- Based on this pattern:
- Files like ‘pop2_tidy_script2.py’, ‘pop2_tidy_script1.py’, ‘pop2_debug_script2.py’, and ‘pop2_debug_script1.py’ would match.
- Therefore, glob(‘/content/pin-material/pop?_*’) would print [‘pop2_tidy_script2.py’, ‘pop2_tidy_script1.py’, ‘pop2_debug_script2.py’, ‘pop2_debug_script1.py’].
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/pop*’) print? - (4)
- The glob pattern ‘/content/pin-material/pop*’ matches all files in the ‘/content/pin-material’ directory that start with ‘pop’.
- Based on the ‘*’ wildcard, which matches any set of characters, it will find files that start with ‘pop’.
- From the given list of files, ‘pop2_tidy_script2.py’, ‘pop2_tidy_script1.py’, ‘pop2_debug_script2.py’, ‘pop2_debug_script1.py’, and ‘pop3_test_script.py’ match the pattern.
- Therefore, glob(‘/content/pin-material/pop*’) would print [‘pop2_tidy_script2.py’, ‘pop2_tidy_script1.py’, ‘pop2_debug_script2.py’, ‘pop2_debug_script1.py’, ‘pop3_test_script.py’].
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/pop?_tidy_script[1-2]*’) print? - (6)
- The glob pattern ‘/content/pin-material/pop?_tidy_script[1-2]*’ matches files in the ‘/content/pin-material’ directory that start with ‘pop’, followed by any single character, then ‘_tidy_script’, then either ‘1’ or ‘2’, and then any set of characters.
- ’?’ matches any single character, allowing flexibility in matching filenames.
- ‘[1-2]’ matches either ‘1’ or ‘2’.
- ‘*’ matches any set of characters, including no characters at all.
- From the given list of files, ‘pop2_tidy_script2.py’ and ‘pop2_tidy_script1.py’ match the pattern.
- Therefore, glob(‘/content/pin-material/pop?_tidy_script[1-2]*’) would print [‘pop2_tidy_script2.py’, ‘pop2_tidy_script1.py’].
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/fft*.jpg’’) print? - (4)
- The glob pattern ‘/content/pin-material/fft*.jpg’ matches files in the ‘/content/pin-material’ directory that start with ‘fft’, followed by any set of characters, and end with ‘.jpg’.
- ‘*’ matches any set of characters, including no characters at all.
- From the given list of files, ‘fft_colour.jpg’ and ‘fft_bw.jpg’ match the pattern.
- Therefore, glob(‘/content/pin-material/fft*.jpg’) would print [‘fft_colour.jpg’, ‘fft_bw.jpg’].
There are cases where you might have full paths (e.g. from glob above) and need to split them up into directory and filename. You may also want to split out the extension of a file from the main part of it (i.e
turn myfile.txt into myfile and txt).
here are cases where you might have full paths (e.g. from glob above) and need to split them up into directory and filename. You may also want to split out the extension of a file from the main part of it (i.e. turn myfile.txt into myfile and txt).
You are already thinking of the split() function right? Well that can work but in addition, there are three os functions that can help you with that - (3)
1) basename
2) dirname
3) splitext
How to import three os functions that help you with spilting full file paths using os module?
basename, dirname, split text?
Functions like basename, dirname, and splitext from the os.path module can help split full paths into directory, filename, and file extension.
- These functions provide a convenient way to
extract different parts of a file path.
Explain the basename function from the os.path module - (3)
- The basename function, from the os.path module, extracts the filename from a full path.
- It returns the last component of the path, excluding the directory.
- For example, basename(‘/content/pin-contents/s4/s4_rt_data_part01.hdf5’) would return ‘s4_rt_data_part01.hdf5’.
Explain the dirname function from the os.path module.
- The dirname function, from the os.path module, extracts the directory name from a full path.
- It returns the directory component of the path, excluding the filename.
- For example, dirname(‘/content/pin-contents/s4/s4_rt_data_part01.hdf5’) would return ‘/content/pin-contents/s4’.
Explain the splitext function from the os.path module - (3)
- The splitext function, from the os.path module, splits a filename into its base name and extension.
- It returns a tuple containing the base name and the extension separately.
- For example, splitext(‘/content/pin-contents/s4/s4_rt_data_part01.hdf5’) would return (‘/content/pin-contents/s4/s4_rt_data_part01’, ‘.hdf5’)
The splitext function returns a tuple (you can treat it as a list) of two items. - (2)
The first element is everything except the extension of the file and the second element is the extension (including the leading .).
Can use basename, dirname and splittext on variables - (3)
e.g., my_path = ‘/content/pin-contents/s4/s4_rt_data_part01.hdf5’
dname = dirname(my_path)
fname = basename(my_path)
print(splitext(my_path))
What does the splitext function do when applied to the full path? - (3)
- The splitext function splits the full path into its base name and extension.
- When applied to the full path, it returns a tuple containing the base name and the extension separately.
- For example, splitext(‘/content/pin-contents/s4/s4_rt_data_part01.hdf5’) would return (‘/content/pin-contents/s4/s4_rt_data_part01’, ‘.hdf5’).
What does the splitext function do when applied to just the filename? - (3)
- When applied to just the filename, the splitext function splits the filename into its base name and extension.
- It returns a tuple containing the base name and the extension separately.
- For example, if fname is ‘s4_rt_data_part01.hdf5’, splitext(fname) would return (‘s4_rt_data_part01’, ‘.hdf5’).
Produce a code that Using glob, find all of the files in /content/pin-contents/s4 that end in .hdf5. Sort this list, loop over it and print out just the filename without the extension. Your output should look like:
Explain this code - (8)
The code first imports necessary modules from glob module and basename, split text and dirname functions from os module
Use glob to find files:
The glob function searches for all files ending with ‘.hdf5’ in the ‘/content/pin-material/s4/’ directory.
The resulting list of file paths is stored in the fileList variable.
A for loop iterates over each file path in fileList.
In each iteration of element in fileList,
fNameOnly stores the filename extracted from the full path thisFileName (e.g., if thisFileName is ‘/content/pin-material/s4/s4_rt_data_part04.hdf5’, then fNameOnly will store ‘s4_rt_data_part04.hdf5’.
parts variable = splitext (fNameOnly) so splitext splits the filename stored in fNameOnly into its base name and extension. The base name is stored in parts[0].
For example, if fNameOnly is ‘s4_rt_data_part04.hdf5’, then after splitting:
parts[0] will store ‘s4_rt_data_part04’ (the base name).
print(parts[0]):Only the base name stored in parts[0] is printed. For example, if parts[0] is ‘s4_rt_data_part04’, then this base name will be printed.
For loop continues until each element of list in fileList is covered
Output of this code
There are two additional things we can do with lists which can make our code more concise and easier to read and write
These are list comprehensions and list enumerating .
We know how to make a list both by hand and by the range function
Explain this code - (3)
-
list1=[0,1,2,3,4,5]
: Defines a list namedlist1
containing integers 0 through 5, entered manually. -
list2=list(range(6))
: Creates a list namedlist2
using therange()
function to generate integers from 0 to 5. - Prints both list1 and list2
We often need to manipulate the contents of data in lists and have learned to do this by using
for loops
Example of manipulate the contents of data in lists and have learned to do this by using loops:
Explain this code - (8)
-
input_list = range(10)
: Creates a range object containing integers from 0 to 9 (not including 10), assigned toinput_list
. -
output_list = []
: Initializes an empty list namedoutput_list
. -
for value in input_list:
: Iterates over each value ininput_list
.- Inside the loop:
-
value
takes on each value frominput_list
in sequence. -
output_list.append(value * 2)
: Multiplies each value by 2 and appends the result tooutput_list
.
-
- Inside the loop:
-
print(list(input_list))
: Prints the contents ofinput_list
, displaying integers from 0 to 9. -
print(output_list)
: Prints the contents ofoutput_list
, displaying each element multiplied by 2.
What would be its output?
For cases where we need to implement a simple transformation like this (such as multiplying by a number or calling a function on each member of a list), like in this example,
Python gives us an alternative: the list comprehension.
What is list comprehension mean in python?
A list comprehension is simply a statement inside of square brackets which tells Python how to contruct the list.
How to write this list ‘outputlist’ into list comprehension?
Explain this code - (2)
The example above therefore reads as (x * 2) for each value (x) in range(10). i.e., for each value in the list produced by range(10), put it in the variable x, then put the value x*2 into the list.
Note that the variable x is just a placeholder and could be called anything.
What would be output of this code?
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
The trick with list comprehensions is to read them out
loud to yourself.
List comprehension works with
any sort of list and any sort of data, e.g.,
Explain this code - (6)
-
original_data = ['Alex', 'Bob', 'Catherine', 'Dina']
: Defines a list namedoriginal_data
containing four strings. -
new_list = ['Hello ' + x for x in original_data]
: Utilizes list comprehension to create a new list namednew_list
.- For each element
x
inoriginal_data
, the expression'Hello ' + x
concatenates ‘Hello ‘ with the value ofx
, which represents each name in original_data. - The resulting strings are added to
new_list
.
- For each element
-
print(original_data)
: Prints the contents oforiginal_data
, displaying the original list of names. -
print(new_list)
: Prints the contents ofnew_list
, displaying each name prefixed with ‘Hello ‘.
What would be output of this code?
We can also call functions in
list comprehension
e.g.,
Explain this code - (6)
-
original_data = ['This', 'is', 'a', 'test']
: Defines a list namedoriginal_data
containing four strings. -
new_list = [len(x) for x in original_data]
: Utilizes list comprehension to create a new list namednew_list
.- For each element
x
inoriginal_data
, the expressionlen(x)
calculates the length of the stringx
. - The resulting lengths are added to
new_list
.
- For each element
-
print(new_list)
: Prints the contents ofnew_list
, displaying the length of each string inoriginal_data
. - For example, ‘This’ has 4 characters, ‘is’ has 2 characters, ‘a’ has 1 character, and ‘test’ has 4 characters.