去年的笔记现在导出成 HTML文件:
Python Basics¶
This is an introductory guide to python. Idea for beginners.
Let's create a simple bank program¶
P.S. To run a cell and create the next simitaneously press Shift+Enter
Also when outside of a cell (the left side of the cell goes blue) you can press 'a' to make a cell above and 'b' to make a cell below
Variables¶
action = "Withdraw"
# Action we're doing
balance = 50000
# Our original balance
interest = 0.4
# Interest of our account (4%)
amount = 50
# Amount used in a action
Notice how in the code cell above a small [1] appears. This means this is the first line of code executed in the program. Keep an eye on this in case you run lines in the wrong order
print(type(action))
print(type(balance))
print(type(interest))
print(type(amount))
This command is useful for telling us the datatype python has set as we don't declare this, but we can convert between relevant datatypes...
balance = float(balance)
print(type(balance))
Selection (If/Else)¶
if action == "Withdraw":
print("You're withdrawing funds")
Note that selection statements don't require brackets
if action == "Withdraw" and amount <= balance:
print("You've decided to withdraw and have sufficent funds")
Chaining logic uses key words 'and' / 'or'. Lets also copy this cell
if action == "Withdraw" and amount <= balance:
balance -= amount
print(balance)
Let's add other cases such as 'Deposit' and 'Calculate Interest'
if action == "Withdraw":
print("Withdraw")
elif action == "Deposit":
print("Deposit")
else:
print("Calculate interest")
python uses 'elif' rather than 'else if'
Iteration¶
To calculate compound interest lets test both types of loop in python
We will do this by saying for each month that balance increases by the amount we set in interest (4%)
months = 5
for i in range (0, 3):
print(i)
Above is a for (Count-controlled) loop. i is the counter and will increase within the range we set. Our range is 0 to 3 which means it will go up to (but not including) 3.
for j in range(0,months):
balance = balance * (1 + interest)
print("Balance:"+balance)
Be careful with datatypes. Although python will let you print any datatype, if you're using a string, all other varaibles in the print must be converted to string
balance = 50000.0
print("Starting balance = "+str(balance))
for j in range(0,months):
balance = balance * (1 + interest)
print("Balance:"+str(balance))
We could also do this with a while loop (conditioned-controlled) although for a task with a fixed number of repetitions a for loop would typically be preferred
balance = 50000.0
print("Starting balance = "+str(balance))
counter = 0
while counter < months:
balance = balance * (1 + interest)
print("Balance:"+str(balance))
counter += 1
Lists¶
Although understanding variables is essential, your likely to mostly use contructs such as lists and dictionaries
myList = list()
# or
myList = []
Creating an empty list
myList = [1,2,3]
Creating a simple 3 item list
myList = ["abc", 2, "def"]
Not fussy about mixing data
myList.append(4)
print(myList)
Also happy to add items
myList[0] = "xyz"
print(myList)
Changing items in a list also uses standard zero indexing (first item is 0)
for i in range(0, len(myList)):
print(myList[i])
Iterating through a list can be done two ways. The first is simply setting a range up to the length of the list
for i in myList:
print(i)
Even easier is using pythons default list loop which will enumerate all items. Instead of i holding the current loop iteration, it will hold the current item
Functions¶
You may not need to produce many functions in your project but understanding them is essential
def myFunc():
print("abc")
Defining a function uses the key phrase 'def' and the function name with brackets
myFunc()
To call it we simply type its name with brackets. As you can see the function prints "abc" as expected
def myFuncWithParams(x):
print(x.lower())
Functions may also require parameters. In this case our function should print whatever is passed to it in lowercase
myFuncWithParams("I lovE PythoN")
Python Essentials¶
This should build upon the last code demo with a few more advanced concepts.
Imports¶
import numpy as np # importing numpy as np
You can import a Python module using the import command. You can also rename it (i.e. numpy to np)
data_new = [6, 7, 8, 0, 1]
data = np.array(data_new) # accessing numpy as np. Here I am converting a list to array
print(data)
Above we've used numpy to create a numpy array out of the list. This will be useful later as numpy arrays are used by modules later in your project
More Strings¶
a = 'Big data'
print(type(a))
print(isinstance(a, str))
You can return type of an object using type command. You can check whether an object is an instance of a particular type using isinstance function.
x = ' This is big data examiner'
x[10] = 'f'
x = x[0:9] + "a lecture"
print(x)
Strings cannot be editied by character index but can be edited by using functions such as slicing
x = 'Java is a powerful programming language'
y = x.replace('Java', 'Python')
print(y)
Replace is another useful function to replace characters and words
a = 'Python'
print(list(a))
print(a[:3])
print(a[3:])
Python is also quite flexible in converting between datatypes. For example it will turn a string into a list of characters with ease
# String concentation is very important
p = "Python is the best programming language"
q = ", I have ever seen"
print(p+q)
String concatenation
print("Costs £%.3f for a %s" %(1.35, 'bag of sweets'))
print("Costs £%.2f for a %s" %(0.73, 'apple'))
print("Costs £%.d for a %s" %(1.13, 'chococlate bar'))
You have to do lot of string formatting while doing data analysis. You can format an argument as a string using %s, %d for an integer, %.3f for a number with 3 decimal points. To do more with string look into string formatting in python
Date-time¶
# Python date and time module provides datetime, date and time types
from datetime import datetime, date, time
td = datetime(1989,6,9,5,1,30) # do not write number 6 as 06, you will get an invalid token error.
print(td.day)
print(td.minute)
print(td.date())
print(td.time())
print (td.strftime('%d/%m/%y %H:%M:%S'))
Datetime is a useful module for date and time formatting. It allows you to create a datetime object and then print and format elements as you wish.
Note that pressing shift + tab on a function should tell you its parameters
td = datetime(1989,6,9,5,1, 30)
td1 = datetime(1988,8, 31, 11, 2, 23)
new_time =td1 - td # you can subtract two different date and time functions
print(new_time)
Dates and times can also be subtracted from one another to calculate difference
Handling Exceptions¶
print (float('7.968'))
print (float('Big data'))
For obvious reasons a string cannot be converted to a float (numeric datatype). To avoid hitting this error we should use a try-except statement (just like a try-catch in Java)
def return_float(x):
try:
return float(x)
except:
return 0
print (return_float('4.55'))
print (return_float('big data'))
The error for converting a string has been handled in the except section of the statement, so instead of printing an error, it returns 0
Tuples¶
deep_learning = ('SkLearn', 'Open cv', 'Torch') # you can un pack a tuple
print(deep_learning[0])
Tuples are immutable which means their length can't be changed, but just like lists items can be fetched by index
x,y,z= deep_learning
print (x)
print (y)
print (z)
Because our tuple is 3 items long it can also be converted into 3 seperate variables using x,y,z. (Same can be done with a list)
More Lists¶
countries = ['Usa', 'Russia', 'Usa', 'Germany', 'France', 'Italy']
countries.count('Usa') # .count can be used to count how many values are ther in a list/tuple
Use of the count function
x = [3,2,3]
x.extend([4,9,6])
print(x)
When adding multiple items extend is used rather than append
x.sort()
print(x)
Python also has a handy sort function
countries.sort()
print(countries)
countries.sort(key=len) # countries are sorted according to number of characters
print(countries)
You can also define the sort type if its not default (i.e. sorting by length rather than alphabet)
languages = ['Python', 'Pandas', 'Keras', 'Tensorflow']
for i,val in enumerate(languages):
print (i,val)
When iterating over a sequence; to keep track of the index of the current element, you can use 'enumerate' which gives the counter (i) and the item (val)
first_name = ['Ben', 'John', 'Kevin']
last_name = ['Andrew', 'Bustard', 'McLaughlin']
combined = zip(first_name, last_name)
for i in combined:
print(i)
Zipping is also useful for combining lists into tuples (grouping items from seperate lists)
list(reversed(range(20)))
Reversed list
Dictionaries¶
myDict = {'a' : 3, 'b' : 6}
# key : value
Dictionaries are another important construct. Dictionaries allow you to map keys to values, which allows us to get items by id rather than index
print(myDict.get('a'))
How you get an item by key from a dictionary
for value in myDict:
print(value)
Printing values (same can be done for keys by using key instead of value)
for key, value in myDict.items():
print(key)
print(value)
Printing both key and value in loop
myDict.update({'a' : 4})
print(myDict)
Updating an item by key
myDict.update({'c' : 12})
print(myDict)
Adding an item is the same (will add if item is not found)
print(myDict.pop('a'))
print(myDict)
Delete using pop (which also returns the value of the item)
Cleaning data¶
Raw data is messy, so you have to clean the data set to make it ready for analysis. Here we have a list of countries that consist of unnecessary punctuations, capilitalization and white space.
- First, I am importing a python module called [regular expression](https://docs.python.org/2/library/re.html)
- Second, I am creating a funtion called removeBadCharacters, to remove the unnecessary punctuations
- Third, I am using some of python's inbuilt functions to clean text
countries = [' Argentina', '$USA$', 'france', 'GerMany', 'Kenya!', 'India##', 'Spain(www.spain.com)']
import re
Above is an typical example of the kind of data formatting you may need to clean. Creating functions to apply cleaning is a very good idea (especially when you start using pandas)
def removeBadCharacters(text):
return re.sub(r'[^\w\s]','',text)
for i in range(0, len(countries)):
countries[i] = removeBadCharacters(countries[i])
print(countries)
the function removeBadCharacters uses re to apply a regex (a special chracter sequence) that substitues all punctuation with null (removes them). It returns this new format as listed above
for i in range(0, len(countries)):
countries[i] = countries[i].strip()
print(countries)
Strip is one of python's inbuilt functions. It removes all leading and ending whitespace
for i in range(0, len(countries)):
countries[i] = countries[i].lower().capitalize()
print(countries)
lower() makes all characters lowercase and .capitalize() makes the first letter a uppercase
The formatting is nearly complete but the last item which had a URL in brackets is still incorrect. Try creating a function called removeUrl which turns a string such as "Spain(www.spain.com)" to "Spain"
Visualisation¶
import matplotlib.pyplot as plt
plt.plot([1,2,3,4])
plt.ylabel('Some numbers')
plt.show()
y = [3, 10, 7, 5, 3, 4.5, 6, 8.1]
x = range(len(y))
print(x)
plt.bar(x, y, color="blue")
plt.show()
myDict = {'PersonA':26, 'PersonB': 17, 'PersonC':30}
plt.bar(list(myDict.keys()), list(myDict.values()))
x = np.linspace(0,10,10)
y1 = x
y2 = x**2
y3 = x**3
y4 = np.sqrt(x)
fig = plt.figure() # an empty figure with no axes
fig, ax_lst = plt.subplots(2, 2)
plt.subplot(2,2,1)
plt.plot(x, y1, 'ro')
plt.subplot(2,2,2)
plt.plot(x, y2, 'bo')
plt.subplot(2,2,3)
plt.plot(x, y3, 'go')
plt.subplot(2,2,4)
plt.plot(x, y4, 'yo')
The Live EDA Demo¶
This demo extends what we've covered about python and also introduces the basics of some other essential library's such as pandas. This guide should act as a simplification of the kind of notebook you will produce through your project.
Importing Data & Cleaning¶
import pandas as pd
Pandas is the package you will be learning next. It's used for accessing and modifying datasets. For now we will just use it to open our 'games' dataset and print
df = pd.read_csv('games.csv')
df.head()
# df = pd.read_csv('_____.tab', sep='\t') --> Needed for household dataset tab files
df2 = pd.read_csv("example.tab", sep='\t')
df2.head()
Pandas allows us to open a csv/tab file and save it as a dataframe (called 'df'). This is extremely useful as it will mostly organise the data as we want, and allows us to easily get columns/rows. df.head() is used to print the top of the table just to have a quick peek.
print(df.shape)
Shape tells us its structure. This dataset has 16598 rows and 11 columns.
print(df.dtypes)
dtypes tells us datatypes of the columns
print(df.loc[0])
Using df.loc (locate) gets row data by index, so here we've selected the first row and printed out all of its detials
for i in range(0, 20):
print(df.loc[i]["Genre"])
Above is a basic example of iterating through rows in our data. At each iteration we've used df.loc[i] to get the current row's data, and then specify we only want "Genre" returned. Through learning pandas you will find more efficient ways of doing this
for i in range(0, df.shape[0]):
if df.loc[i]["Genre"] == "Misc":
df.drop(i, inplace=True)
print(df["Genre"])
Above we use drop in the loop to remove any games with a "Misc" game category. It is a simple (and by no means the most effiencent way) of removing rows based on a check
You'll have to do a lot more cleaning than this so follow the codecademy pandas closely to learn how to use it more effectively and efficiently than this
Analysing data¶
Once your data is clean and you've set up your dataframe for analysis, begin testing and further understanding your data
platforms = dict(df["Platform"].value_counts())
print(platforms)
Pandas has some simple tools to see your data. For example by counting we can simply see how popular categories of data are, but vizualisations are much better...
import matplotlib.pyplot as plt
Matplotlib is a basic graphical package for python. We'll use it to make some simple graphs and draw some meaning from our data
plt.plot(platforms.keys(), platforms.values())
As you can see the default can be messy. To clear it up lets make it bigger and change to a bar chart
plt.figure(figsize=(15,9))
plt.plot(platforms.keys(), platforms.values())
We've added a parameter for figsize when we initilize the figure. Now its a little easier to read but the chart type still doesn't make sense
plt.figure(figsize=(15,9))
plt.bar(list(platforms.keys()), list(platforms.values()))
Great, now we can see the data and understand it effectively. By creating multiple charts and comparing variables we can start to draw greater meaning from the data and help us make a discovery
platform = []
popularity = []
numberOfPlatforms = 9
for key in sorted(platforms, key=platforms.get, reverse=True):
platform.append(key)
popularity.append(platforms.get(key))
platform_sample = platform[0:numberOfPlatforms]
popularity_sample = popularity[0:numberOfPlatforms]
print(platform_sample)
print(popularity_sample)
Before moving onto the next visualisation I'm sorting the keys/values we have for each platform and splitting these into lists. This way I can get the data for the 9 most popular platforms which will help simplify the chart
other = 0
for i in range(numberOfPlatforms, len(popularity)):
other += popularity[i]
platform_sample.append("Other")
popularity_sample.append(other)
print(platform_sample)
print(popularity_sample)
I'm also including a loop to sum all other items not included in these lists, which is named the "Other" category. This is important to reflect the fact that there are more than 9 platforms
fig1, ax1 = plt.subplots()
ax1.pie(popularity_sample, labels=platform_sample, autopct='%1.1f%%', shadow=False, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
Now we have a pie chart the accurately reflects our data, with some summarisation in the "other" category.
For a challenge, when you've got a better understadning of pandas, try creating dataframes for different years and build a chart to compare how popular certain gaming platforms were in each year side-by-side
What else should you include in your notebook¶
- More advanced visualisations; There are more more effective and advanced charts to use in matplotlib and seaborn
- Extra datasets; Gather more insight and collect more data to build a greater understanding in your analysis
- Look from a different angle; Analyse all relevant data you have in several different ways and think outisde the box to make a useful discovery
- Consider using machine learning; Use machine learning to build a model which can use your analysis to predict future trends
Don't worry about this too much yet, we'll cover all of this throughout the course. For now get more familiar with python and begin learning pandas and then matplotlib/seaborn
The Live Pandas Demo¶
This demo will show you how pandas works.
import pandas as pd
data
%time