Final Exam: Olympics!¶Version 1.0.0
All of the header information is important. Please read it.
Topics number of exercises: This problem builds on your knowledge of Pandas, Python data structures, SQL, string manipulation, feature engineering, train/test split, model performance metrics. It has 11 exercises numbered 0 to 10. There are 23 available points. However to earn 100% the threshold is 14 points. (Therefore once you hit 14 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 1 point(s) - FREE
Exercise 1 - : 4 point(s)
Exercise 2 - : 1 point(s)
Exercise 3 - : 3 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 2 point(s)
Exercise 6 - : 2 point(s)
Exercise 7 - : 3 point(s)
Exercise 8 - : 2 point(s)
Exercise 9 - : 2 point(s)
Exercise 10 - : 1 point(s)
Final reminders:
### Global imports
import dill
from cse6040_devkit import plugins, utils
For this exam, you are given Olympic athlete data and you are tasked with analyzing historical data from the Olympic Games to uncover insights and building a predictive model. The primary dataset contains information on athletes, including their demographics, events, medals, and countries across 120 years of Olympic Games (Summer and Winter). A secondary dataset contains information about the countries hosting the Olympics including information about the year, cities, and costs.
🔍 Part 1: Exploratory Data Analysis (EDA)
🧠 Part 2: Feature Engineering
🔮 Part 3: Predictive Modeling
The primary dataset includes 2 components from Kaggle:
results_df which includes athlete level data for over 120 years of olympic eventnoc_regions_df which contains country level enrichment.The secondary dataset (included as SQL tables) includes:
olympic_costs which details Olympiad budgets and costs from Wikipediainflation which comprises annual inflation data from 1929-2023 from InvestopediaRun the cell below to load the data!
### Run Me!!!
import pandas as pd
import numpy as np
from pprint import pprint
import sqlite3
results_df=utils.load_object_from_publicdata('results_df.dill')
noc_regions_df=utils.load_object_from_publicdata('noc_regions_df.dill')
conn=sqlite3.connect('resource/asnlib/publicdata/olympic.db')
explore_olympics_data__FREE
Example: we have defined explore_olympics_data__FREE as follows:
This is a free exercise!
Please run the test cell below to collect your FREE point!
We encourage you to review the samples of both results_df and noc_regions_df in the cell below:
results_df A DataFrame of historic results from athletes competing in both summer and winter Olympics for over 120 years.noc_regions_df A DataFrame mapping NOCs to their respective country. A National Olympic Committee (NOC) is a country's organization that is responsible for its participation in the Olympic Games.### Solution - Exercise 0
def explore_olympics_data__FREE(results_df: pd.DataFrame, noc_regions_df: pd.DataFrame, headsize:int=10) -> tuple:
results_df_preview = results_df.head(n=headsize)
noc_regions_df_preview = noc_regions_df.head(n=headsize)
return (results_df_preview,noc_regions_df_preview)
### Demo function call
results_df_preview, noc_regions_df_preview = explore_olympics_data__FREE(results_df,noc_regions_df)
print(f'results_df DataFrame preview: size {results_df.shape}')
display(results_df.head(5))
print(f'\n\nnoc_regions_df DataFrame preview: size {noc_regions_df.shape}')
display(noc_regions_df_preview.head(5))
#
### Test Cell - Exercise 0
print('Passed! Please submit.')
cleanse_results
Your task: define cleanse_results as follows:
'With great risk comes great reward' - Thomas Jefferson. Approach this problem with caution.
Input: results_df: A Pandas DataFrame, as described in Exercise 0
Return: sorted_df: A Pandas DataFrame that has been cleansed to remove duplicate values and sorted
Requirements:
To begin, we need to clean up results_df by cleaning names, selecting relevant columns, removing duplicate rows, and sorting the data as described below.
Name column:
Given that most, but not all, of the names are of the form: First Middle Last or First "nickname" Middle Last, Suffix (Maiden), follow the below rules:., -, or ' if present)., -, or ' if present)Jr, Sr, II, III, IV, V)(Maiden)"Susie", without a suffix. After the comma in this case often represents titles of nobility. Only keep the names prior to the comma. John Bartling Pearson, Jr. -> John Pearson Jr [Rules A, B, C]Susan "Susie" O'Neill -> Susan O'Neill [Rules A, B, E]Alfrd (Arnold-) Hajs (Guttmann-) -> Alfrd Hajs [Rules A, B, D]Britt-Marie Louise Smedh (-Alshammar, -Blomberg) -> Britt-Marie Smedh [Rules A, B, D]Liston Donneal Bochette, III -> Liston Bochette III [Rules A, B, C]H. MacHenry -> H. MacHenry [Rules A, B]Abudoureheman -> Abudoureheman [Rule A]' Gabrielle Marie "Gabby" Adcock (White-)' -> Gabrielle Adcock #note whitespace at beginning [Rules A, B, D, E, F]Druart, Jr. -> Druart Jr [Rules A, C]Sndor Bernt Ede Blint Kzmr, Count Trk de Szendr -> Sndor Kzmr [Rules A, B, G]['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Year', 'Season', 'City', 'Sport', 'Event', 'Medal']'Year', 'Season', 'NOC', 'Name', 'Event'. All columns should be sorted in ascending order### Solution - Exercise 1
def cleanse_results(results_df: pd.DataFrame) -> pd.DataFrame:
###
### YOUR CODE HERE
###
import re
# acceptable suffixes
#accepted_suffix = ['Jr', 'Sr', 'II', 'III', 'IV', 'V']
accepted_suffix = ['Jr', 'Sr', 'III', 'II', 'IV', 'V']
# define function that cleans up each athlete name
def clean_name(name: str) -> str:
name = re.sub(r'"[^"]+"', '', name) # remove anything in quotes entirely (rule E)
name = re.sub(r'\([^)]*\)', '', name) # remove anything in parentheses entirely (rule D)
name = name.strip() # remove whitespaces (rule F)
# extract suffix (rule c)
suffix = None
for s in accepted_suffix:
if f', {s}' in name:
suffix = s
name = name.replace(f', {s}\.?', '')
break
name = name.split(',')[0] # remove comma and anything after (rule G)
parts = name.split()
if not parts:
return ''
first = parts[0]
last = parts[-1] if len(parts) > 1 else ''
full_name = f"{first} {last}".strip()
if suffix:
full_name += f" {suffix}"
return full_name
# good practice
results_df2 = results_df.copy()
results_df2['Name'] = results_df2['Name'].apply(clean_name)
cols = ['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight',
'Team', 'NOC', 'Year', 'Season', 'City', 'Sport', 'Event', 'Medal']
results_df2 = results_df2[cols]
results_df2 = results_df2.drop_duplicates()
results_df2 = results_df2.sort_values(by = ['Year', 'Season', 'NOC', 'Name', 'Event']).reset_index(drop=True)
return results_df2
### Demo function call
cleanse_results_demo = utils.load_object_from_publicdata('cleanse_results_demo.dill')
print(f'cleanse_results_demo shape: {cleanse_results_demo.shape}')
demo_result_sorted_df = cleanse_results(cleanse_results_demo)
demo_result_sorted_df
Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.
cleanse_results_demo_CORRECT=utils.load_object_from_publicdata('cleanse_results_demo_CORRECT.dill')
#display(cleanse_results_demo_CORRECT)
The demo should display this output.
Successfully loaded cleanse_results_demo.dill.
cleanse_results_demo shape: (13, 15)
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Year | Season | City | Sport | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 104736 | Johan Salila | M | 27.0 | 176.0 | 92.0 | Finland | FIN | 1912 | Summer | Stockholm | Wrestling | Wrestling Men's Middleweight B, Greco-Roman | NaN |
| 1 | 8090 | Charles Barrett | M | NaN | NaN | NaN | Ireland | IRL | 1924 | Summer | Paris | Water Polo | Water Polo Men's Water Polo | NaN |
| 2 | 31036 | Ren Dybkr | M | 26.0 | NaN | NaN | Denmark | DEN | 1952 | Summer | Helsinki | Fencing | Fencing Men's epee, Team | NaN |
| 3 | 883 | Charles Addo-Odametey | M | 27.0 | 170.0 | 67.0 | Ghana | GHA | 1964 | Summer | Tokyo | Football | Football Men's Football | NaN |
| 4 | 883 | Charles Addo-Odametey | M | 31.0 | 170.0 | 67.0 | Ghana | GHA | 1968 | Summer | Mexico City | Football | Football Men's Football | NaN |
| 5 | 60416 | Kenji Kimura | M | 23.0 | 185.0 | 83.0 | Japan | JPN | 1968 | Summer | Mexico City | Volleyball | Volleyball Men's Volleyball | Silver |
| 6 | 94203 | Norbert Petschel | M | 23.0 | 184.0 | 70.0 | Austria | AUT | 1984 | Summer | Los Angeles | Sailing | Sailing Mixed Multihull | NaN |
| 7 | 27536 | Dudley Dulk | M | 20.0 | 180.0 | 82.0 | Netherlands Antilles | AHO | 1992 | Winter | Albertville | Bobsleigh | Bobsleigh Men's Two | NaN |
| 8 | 56067 | Rebecca Joyce | F | 25.0 | 178.0 | 59.0 | Australia | AUS | 1996 | Summer | Atlanta | Rowing | Rowing Women's Lightweight Double Sculls | Bronze |
| 9 | 5360 | Jos Artze | M | 63.0 | 175.0 | 87.0 | Puerto Rico | PUR | 1996 | Summer | Atlanta | Shooting | Shooting Men's Double Trap | NaN |
| 10 | 95952 | Vladislav Polyakov | M | 20.0 | 191.0 | 78.0 | Kazakhstan | KAZ | 2004 | Summer | Athina | Swimming | Swimming Men's 100 metres Breaststroke | NaN |
The cell below will test your solution for cleanse_results (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=cleanse_results,
ex_name='cleanse_results',
key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to cleanse_results did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
update_team
Your task: define update_team as follows:
Historically, teams competing in the Olympics did not always share the same name as the country they represented. For example, a team named "Vesper Boat Club" competed for the USA in the 1900 Summer Olympics. These unique team names make it difficult to analyze a country's performance over time. Let's fix this by updating the Team name in sorted_df to accurately reflect the region name for each.
Input:
sorted_df: The result of Exercise 1 - A sorted Pandas DataFrame containing results by athlete (provided in solution/demo below)noc_regions_df: Contains region information for each NOC, as described in Exercise 0Return: updated_df: A Pandas DataFrame containing the information in sorted_df with the Team column updated to contain the region name found in noc_regions_df
Requirements:
Team column in sorted_df to match the corresponding region from noc_regions_df, using a left join on the NOC column.noc_regions_df to find the region name for each NOC. Use a left join on the column NOC with noc_regions_df, keeping all rows in sorted_dfsorted_df where the Team name is overwritten with the region value corresponding to each NOC### Solution - Exercise 2
def update_team(sorted_df: pd.DataFrame, noc_regions_df: pd.DataFrame) -> pd.DataFrame:
###
### YOUR CODE HERE
###
# merge the two dfs on 'NOC'
# use left merge to retain all rows in sorted_df
merged_df = sorted_df.merge(noc_regions_df, how= 'left', on = 'NOC')
# replace values in Team column with values from region column
merged_df['Team'] = merged_df['region']
# drop region column
merged_df = merged_df.drop(columns = ['region'])
return merged_df
### Demo function call
update_team_sorted_df_demo, update_team_noc_regions_df_demo = utils.load_object_from_publicdata('update_team_demo.dill')
demo_result_updated_df = update_team(update_team_sorted_df_demo, update_team_noc_regions_df_demo)
demo_result_updated_df
Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.
update_team_demo_CORRECT=utils.load_object_from_publicdata('update_team_demo_CORRECT.dill')
#display(update_team_demo_CORRECT)
The demo should display this output.
Successfully loaded update_team_demo.dill.
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Year | Season | City | Sport | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 94945 | Nikolaos Pindos | M | NaN | NaN | NaN | Greece | GRE | 1906 | Summer | Athina | Football | Football Men's Football | Bronze |
| 1 | 404 | Louis Abell | M | 19.0 | NaN | NaN | USA | USA | 1904 | Summer | St. Louis | Rowing | Rowing Men's Coxed Eights | Gold |
| 2 | 66847 | Tonje Larsen | F | 25.0 | 184.0 | 72.0 | Norway | NOR | 2000 | Summer | Sydney | Handball | Handball Women's Handball | Bronze |
| 3 | 89899 | Knut stby | M | 25.0 | NaN | NaN | Norway | NOR | 1948 | Summer | London | Canoeing | Canoeing Men's Kayak Doubles, 10,000 metres | Silver |
| 4 | 124227 | George Cleaf | M | 24.0 | 173.0 | 66.0 | USA | USA | 1904 | Summer | St. Louis | Swimming | Swimming Men's 4 x 50 Yard Freestyle Relay | NaN |
| 5 | 89640 | Martina Orzan | F | 24.0 | 180.0 | 62.0 | Italy | ITA | 1996 | Summer | Atlanta | Rowing | Rowing Women's Lightweight Double Sculls | NaN |
The cell below will test your solution for update_team (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 2
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=update_team,
ex_name='update_team',
key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to update_team did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
rank_teams
Your task: define rank_teams as follows:
Now that we've cleaned up our DataFrame, we can start analyzing the data. Let's start with the basics and rank teams by the number of medals earned!
Input:
updated_df: The result of Exercise 2 - A sorted Pandas DataFrame containing results by athlete with the updated Team names (provided in solution/demo below)list_of_teams: A list containing team namesReturn: ranked_df: A Pandas DataFrame ranking the Teams found in list_of_teams by the total number of medals earned.
Requirements:
updated_df to the rows containing the teams listed in list_of_teamsTeam, Bronze, Silver, Gold, Total with the number of medals each team won of each type. Total should be the sum of medals earned by each team. Pivot Table may be helpful.Bronze, Silver, Gold, and Total columns should be of data type int64Total in descending order, with ties broken by the Team name in ascending order### Solution - Exercise 3
def rank_teams(updated_df: pd.DataFrame, list_of_teams: list) -> pd.DataFrame:
###
### YOUR CODE HERE
###
# only keep rows where the team is in provided list
filtered_df = updated_df[updated_df['Team'].isin(list_of_teams)]
# pivot table to count medals
medal_counts = filtered_df.pivot_table(index = 'Team',
columns = 'Medal',
aggfunc = 'size', # counts number of occurences
fill_value=0) # assigns 0 if a team has no medals of a certain type
# remove extra column name 'Medal'
medal_counts.columns.name = None
# add 'Total' columns
medal_counts['Total'] = medal_counts.sum(axis=1)
# reset index, keep Team as column
medal_counts.reset_index(inplace=True)
# sort to set ranking
medal_counts = medal_counts.sort_values(by = ['Total', 'Team'],
ascending = [False, True])
return medal_counts.reset_index(drop=True)
### Demo function call
rank_teams_updated_df_demo,rank_teams_list_of_teams_demo=utils.load_object_from_publicdata('rank_teams_demo.dill')
demo_result_ranked_df = rank_teams(rank_teams_updated_df_demo, rank_teams_list_of_teams_demo)
demo_result_ranked_df
Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.
rank_teams_demo_CORRECT=utils.load_object_from_publicdata('rank_teams_demo_CORRECT.dill')
#display(rank_teams_demo_CORRECT)
The demo should display this output.
Successfully loaded rank_teams_demo.dill.
| Team | Bronze | Gold | Silver | Total | |
|---|---|---|---|---|---|
| 0 | USA | 1 | 5 | 0 | 6 |
| 1 | Germany | 1 | 0 | 2 | 3 |
| 2 | France | 0 | 2 | 0 | 2 |
| 3 | Russia | 1 | 1 | 0 | 2 |
| 4 | Australia | 1 | 0 | 0 | 1 |
| 5 | China | 0 | 1 | 0 | 1 |
| 6 | Hungary | 0 | 0 | 1 | 1 |
| 7 | Sweden | 1 | 0 | 0 | 1 |
The cell below will test your solution for rank_teams (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=rank_teams,
ex_name='rank_teams',
key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to rank_teams did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
roman_numeral
Your task: define roman_numeral as follows:
The Olympic Games use Roman numerals to number each iteration. For example, the Paris 2024 Olympics were officially called the "Games of the XXXIII Olympiad". Sounds fancier than saying "the 33rd Summer Olympic Games", right? Let's now find the corresponding Roman numeral for a given Olympic game!
Input:
updated_df: The result of Exercise 2 - A sorted Pandas DataFrame containing results by athlete with updated Team names (provided in solution/demo below)year: An integer containing the desired yearseason: A string containing the desired season (either 'Summer' or 'Winter')Return: roman_numeral_result: A string containing the Roman numeral for that Olympics.
Requirements:
updated_df and the year/season provided, find how many Olympics have occurred up to the year provided for that season. Let's call this your Olympic countroman_numeral_dict provided below to convert this Olympic count into the correct Roman numeralHint: While other approaches exist, here is a simple algorithm for converting a number to its Roman Numeral equivalent:
roman_numeral_dict in the same order as listedroman_numeral_result string, and then subtract the key from your Olympic countHint: Each season's Olympic counts should be counted separately. Technically, the first Summer Olympics held in 1896 is 1, and the first Winter Olympics held in 1924 has a count of 1. The testcases will have different combinations of seasons and years, so your function should account for those combinations and calculate the Roman numeral based on what is in the input updated_df.
### Solution - Exercise 4
def roman_numeral(updated_df: pd.DataFrame, year: int, season: str) -> str:
roman_numeral_dict = {1000: "M", 900: "CM", 500: "D", 400: "CD", 100: "C", 90: "XC", 50: "L", 40: "XL", 10: "X", 9: "IX", 5: "V", 4: "IV", 1: "I"}
###
### YOUR CODE HERE
###
# Task: given a season and year, output the corresponding roman numeral as a string
# only keep rows from provided season
roman_df = updated_df[updated_df['Season'] == season]
# filter by desired year (inclusive)
roman_df = roman_df[roman_df['Year'] <= year]
# prune the df down to just year and season columns; remove duplicates
roman_df = roman_df[['Year', 'Season']].drop_duplicates()
# determine what number the given olympics is
olympics_num = roman_df.shape[0] # number of rows
# use provided algorithm to deremine roman numeral string
# loop will automatically break once olympics_num hits 0
roman_numeral = ''
for value, symbol in roman_numeral_dict.items():
while olympics_num >= value:
roman_numeral += symbol
olympics_num -= value
return roman_numeral
### Demo function call
roman_numeral_updated_df_demo,roman_numeral_year_demo,roman_numeral_season_demo=utils.load_object_from_publicdata('roman_numeral_demo.dill')
demo_result_roman_numeral_result = roman_numeral(roman_numeral_updated_df_demo, roman_numeral_year_demo, roman_numeral_season_demo)
display(demo_result_roman_numeral_result)
The demo should display this printed output.
Successfully loaded roman_numeral_demo.dill.
II
The cell below will test your solution for roman_numeral (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=roman_numeral,
ex_name='roman_numeral',
key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to roman_numeral did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
going_for_gold
Your task: define going_for_gold as follows:
Many athletes have won multiple gold medals. The real question is: Who has won the most gold medals relative to the number of times they competed? Calculate each athlete's "gold percentage" as the number of gold medals earned divided by the number of times they competed in any event at the Olympics.
Input:
updated_df: The result of Exercise 2 - A sorted Pandas DataFrame containing results by athlete with updated Team names (provided in solution/demo below)Return: A new Pandas DataFrame containing the following columns: Name, Opportunity Count, Golds, Gold Pct
Requirements:
updated_df). Call this column Opportunity CountGolds, and this should be of data type int64Gold Pct, and this should be of data type float64Gold Pct in descending order, with ties broken by Golds in descending order and then by Name in ascending order### Solution - Exercise 5
def going_for_gold(updated_df: pd.DataFrame) -> pd.DataFrame:
###
### YOUR CODE HERE
###
# only need the following columns
df_gold = updated_df[['Name', 'Medal']]
# use groubby to create new columns
df_grouped = df_gold.groupby('Name').agg({'Medal': ['size', # total number of opportunities
lambda x: (x== 'Gold').sum() # total number of gold medals
]
}).reset_index()
# rename columns per instructions
df_grouped.columns = ['Name', 'Opportunity Count', 'Golds']
# calculate percentage and store in new column
df_grouped['Gold Pct'] = df_grouped['Golds'] / df_grouped['Opportunity Count']
# convert NaN values to 0
df_grouped['Gold Pct'] = df_grouped['Gold Pct'].fillna(0.0)
# sort and rank
df_grouped = df_grouped.sort_values(by = ['Gold Pct', 'Golds', 'Name'],
ascending = [False, False, True]
)
return df_grouped.reset_index(drop=True)
### Demo function call
going_for_gold_updated_df_demo = utils.load_object_from_publicdata('going_for_gold_demo.dill')
demo_result_gold_rankings_df = going_for_gold(going_for_gold_updated_df_demo)
demo_result_gold_rankings_df
Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.
going_for_gold_demo_CORRECT=utils.load_object_from_publicdata('going_for_gold_demo_CORRECT.dill')
#display(going_for_gold_demo_CORRECT)
The demo should display this output.
Successfully loaded going_for_gold_demo.dill.
| Name | Opportunity Count | Golds | Gold Pct | |
|---|---|---|---|---|
| 0 | Kristin Otto | 6 | 6 | 1.000000 |
| 1 | Danuta Kozk | 6 | 5 | 0.833333 |
| 2 | Yin Jian | 2 | 1 | 0.500000 |
The cell below will test your solution for going_for_gold (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 5
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=going_for_gold,
ex_name='going_for_gold',
key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to going_for_gold did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
olympic_costs
Your task: define olympic_costs as follows:
Hosting the Olympics can be staggeringly expensive, with recent editions costing billions of dollars. But has it always been this way? Let's use inflation data to convert historic olympic costs and find out!
Input: None
Return: A Python string containing a SQLite query that converts olympic costs to most recent year USD, using inflation data
Requirements:
year, country, city, cost)year, interest_rate, cpi_index)year: from olympic_costs tablecountry: from olympic_costs tablecost: from olympic_costs table. Exclude NULL/NaN valuesmaxyear_cpi_index: this represents the cpi_index for the max(year) in the inflation table. In other words, the cpi_index for the latest year in the inflation table.olympicyear_cpi_index: this represents the cpi_index for the year of the Olympics in the inflation tablerecentyearUSD: this represents $$recentyearUSD = cost * maxyear\_cpi\_index \div olympicyear\_cpi\_index$$recentyearUSD in descending orderyear in ascending orderHint: Subqueries/CTEs might be helpful on this exercise!
### Solution - Exercise 6
def olympic_costs() ->str:
###
### YOUR CODE HERE
###
query = f"""
SELECT
c.year,
c.country,
c.cost,
m.cpi_index AS maxyear_cpi_index,
i.cpi_index AS olympicyear_cpi_index,
1.0 * c.cost * m.cpi_index / i.cpi_index as recentyearUSD
FROM olympic_costs c
JOIN inflation i on i.year = c.year
JOIN(
SELECT a.cpi_index
FROM inflation a
JOIN (SELECT MAX(year) AS maxyear FROM inflation) b ON a.year = b.maxyear
) m
WHERE c.cost IS NOT NULL
ORDER BY recentyearUSD DESC, c.year ASC
LIMIT 10
"""
return query
### Demo function call
demo_result_olympic_costs_results = pd.read_sql(olympic_costs(),conn)
demo_result_olympic_costs_results
Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.
olympic_costs_demo_CORRECT=utils.load_object_from_publicdata('olympic_costs_demo_CORRECT.dill')
#display(olympic_costs_demo_CORRECT)
The demo should display this output.
| year | country | cost | maxyear_cpi_index | olympicyear_cpi_index | recentyearUSD | |
|---|---|---|---|---|---|---|
| 0 | 2014 | Russia | 5.100000e+10 | 1780.850595 | 1362.086956 | 6.667958e+10 |
| 1 | 2008 | China | 4.400000e+10 | 1780.850595 | 1219.224262 | 6.426826e+10 |
| 2 | 2004 | Greece | 1.500000e+10 | 1780.850595 | 1103.962662 | 2.419716e+10 |
| 3 | 2020 | Japan | 1.540000e+10 | 1780.850595 | 1511.379619 | 1.814574e+10 |
| 4 | 2016 | Brazil | 1.310000e+10 | 1780.850595 | 1400.425618 | 1.665861e+10 |
| 5 | 2012 | United Kingdom | 8.361285e+09 | 1780.850595 | 1331.307135 | 1.118465e+10 |
| 6 | 1988 | South Korea | 4.000000e+09 | 1780.850595 | 698.705866 | 1.019514e+10 |
| 7 | 1972 | West Germany | 1.062908e+09 | 1780.850595 | 246.928322 | 7.665708e+09 |
| 8 | 2000 | Australia | 4.290000e+09 | 1780.850595 | 1008.059612 | 7.578767e+09 |
| 9 | 2022 | China | 3.900000e+09 | 1780.850595 | 1722.292645 | 4.032600e+09 |
The cell below will test your solution for olympic_costs (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(olympic_costs),
ex_name='olympic_costs',
key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to olympic_costs did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
olympic_countryrank
Your task: define olympic_countryrank as follows:
Input: None
Return: A SQL query that returns the 3 most recent years and cities for each host country. Additionally, it ranks the results ordinally and tallies a running cost for each host country.
Requirements:
countryrank: for each host country, this represents an ordinal ranking of that host country's most recent Olympicscountryrank in ascending orderHints:
- USD 4.5 billion in 2032
- USD 4.29 billion in 2000
- USD 3.51 million in 1956
- 4.5e9 for 2032
- 4.5e9+4.29e9=8.79e9 for 2000
- 4.5e9+4.29e9+3.51e6=8.79351e9 for 1956
### Solution - Exercise 7
def olympic_countryrank() ->str:
###
### YOUR CODE HERE
###
query = f"""
SELECT
country,
year,
city,
countryrank,
countryrunningcosts
FROM (
SELECT
country,
year,
city,
RANK() OVER (PARTITION BY country
ORDER BY year DESC)
AS countryrank,
SUM(cost) OVER(PARTITION BY country
ORDER BY year DESC)
AS countryrunningcosts
FROM olympic_costs
WHERE cost IS NOT NULL
) AS ranked_costs
WHERE countryrank <= 3
ORDER BY country, countryrank
"""
return query
### Demo function call
demo_result_olympic_countryrank_results = pd.read_sql(olympic_countryrank(),conn)
# display(demo_result_olympic_countryrank)
demo_result_olympic_countryrank_results
Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.
olympic_countryrank_demo_CORRECT=utils.load_object_from_publicdata('olympic_countryrank_demo_CORRECT.dill')
#display(olympic_countryrank_demo_CORRECT)
The demo should display this output.
| country | year | city | countryrank | countryrunningcosts | |
|---|---|---|---|---|---|
| 0 | Australia | 2032 | Brisbane | 1 | 4.500000e+09 |
| 1 | Australia | 2000 | Sydney | 2 | 8.790000e+09 |
| 2 | Australia | 1956 | Melbourne | 3 | 8.793510e+09 |
| 3 | Brazil | 2016 | Rio de Janeiro | 1 | 1.310000e+10 |
| 4 | Canada | 2010 | Vancouver | 1 | 1.260000e+09 |
| 5 | Canada | 1988 | Calgary | 2 | 1.570980e+09 |
| 6 | Canada | 1976 | Montreal | 3 | 1.717950e+09 |
| 7 | China | 2022 | Beijing | 1 | 3.900000e+09 |
| 8 | China | 2008 | Beijing | 2 | 4.790000e+10 |
| 9 | France | 2030 | French Alps | 1 | 2.300000e+09 |
| 10 | France | 2024 | Paris | 2 | 1.050000e+10 |
| 11 | Greece | 2004 | Athens | 1 | 1.500000e+10 |
| 12 | Italy | 2026 | Milan and Cortina d'Ampezzo | 1 | 1.600000e+09 |
| 13 | Italy | 2006 | Turin | 2 | 2.300000e+09 |
| 14 | Japan | 2020 | Tokyo | 1 | 1.540000e+10 |
| 15 | Japan | 1964 | Tokyo | 2 | 1.547200e+10 |
| 16 | Mexico | 1968 | Mexico City | 1 | 1.760000e+08 |
| 17 | Norway | 1994 | Lillehammer | 1 | 1.100000e+09 |
| 18 | Russia | 2014 | Sochi | 1 | 5.100000e+10 |
| 19 | Socialist Federal Republic of Yugoslavia | 1984 | Sarajevo | 1 | 5.540000e+07 |
| 20 | South Korea | 2018 | Pyeongchang | 1 | 2.190000e+09 |
| 21 | South Korea | 1988 | Seoul | 2 | 6.190000e+09 |
| 22 | Soviet Union | 1980 | Moscow | 1 | 2.310000e+08 |
| 23 | Spain | 1992 | Barcelona | 1 | 8.500000e+08 |
| 24 | United Kingdom | 2012 | London | 1 | 8.361285e+09 |
| 25 | United Kingdom | 1948 | London | 2 | 8.362252e+09 |
| 26 | United States | 2034 | Salt Lake City–Utah | 1 | 3.900000e+09 |
| 27 | United States | 2028 | Los Angeles | 2 | 1.070000e+10 |
| 28 | United States | 2002 | Salt Lake City | 3 | 1.270000e+10 |
| 29 | West Germany | 1972 | Munich | 1 | 1.062908e+09 |
The cell below will test your solution for olympic_countryrank (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(olympic_countryrank),
ex_name='olympic_countryrank',
key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to olympic_countryrank did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
label_encoding
Your task: define label_encoding as follows:
There are different techniques for converting categorical data, such as strings, into a numerical representation. One such technique is label encoding, in which each unique category is assigned a unique integer, thereby converting categorical data into numerical values. Let's practice this encoding technique on our updated_df, which contains athlete results over time.
Input:
updated_df: The result of Exercise 2 - A sorted Pandas DataFrame containing results by athlete with updated Team names (provided in solution/demo below)Return: A tuple containing the following:
encoding_map: A dictionary of dictionaries where the outermost keys are the categorical column names, and their values are dictionaries mapping the unique values in the column to their numerical encoding.encoded_df: A new Pandas DataFrame where all categorical columns have been encoded using the mapping you generated. All column names should remain the same.Requirements:
objectupdated_df### Solution - Exercise 8
def label_encoding(updated_df: pd.DataFrame) -> tuple:
###
### YOUR CODE HERE
###
# task: replace categorical values with integers
# better for modeling
df = updated_df.copy()
encoding_map = {}
for col in df.columns:
if df[col].dtype == 'object':
df[col] = df[col].fillna('NONE')
unique_values = sorted(df[col].unique())
# string becomes key, index becomes value
col_mapping = {val: idx for idx, val in enumerate(unique_values)}
# replace string with index value (int)
df[col] = df[col].map(col_mapping)
# store mapping
encoding_map[col] = col_mapping
# reset index in case
return encoding_map, df.reset_index(drop=True)
### Demo function call
label_encoding_updated_df_demo = utils.load_object_from_publicdata('label_encoding_demo.dill')
(demo_result_encoding_map, demo_result_encoded_df) = label_encoding(label_encoding_updated_df_demo)
print('Your Encoding Map: ')
pprint(demo_result_encoding_map)
print('\nYour Encoded DataFrame: ')
demo_result_encoded_df
Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.
encoding_map_demo_CORRECT, encoded_df_demo_CORRECT=utils.load_object_from_publicdata('label_encoding_demo_CORRECT.dill')
# pprint(encoding_map_demo_CORRECT)
# display(encoded_df_demo_CORRECT)
The demo should display this output.
Successfully loaded label_encoding_demo.dill.
Your Encoding Map:
{'City': {'Atlanta': 0,
'London': 1,
'Los Angeles': 2,
'Rio de Janeiro': 3,
'Seoul': 4,
'Stockholm': 5,
'Sydney': 6},
'Event': {"Athletics Men's Shot Put, Both Hands": 0,
"Athletics Women's 100 metres": 1,
"Athletics Women's Javelin Throw": 2,
"Canoeing Men's Kayak Fours, 1,000 metres": 3,
"Cycling Women's Team Sprint": 4,
"Tennis Men's Singles": 5},
'Medal': {'Gold': 0, 'NONE': 1, 'Silver': 2},
'NOC': {'AUS': 0, 'FRA': 1, 'FRG': 2, 'GER': 3, 'KIR': 4, 'THA': 5, 'USA': 6},
'Name': {'Fabrice Santoro': 0,
'Karitaake Tewaaki': 1,
'Lawrence Whitney': 2,
'Louise McPaul-Currey': 3,
'Miriam Welte': 4,
'Thomas Reineck': 5,
'Walapa Tangjitsusorn': 6},
'Season': {'Summer': 0},
'Sex': {'F': 0, 'M': 1},
'Sport': {'Athletics': 0, 'Canoeing': 1, 'Cycling': 2, 'Tennis': 3},
'Team': {'Australia': 0,
'France': 1,
'Germany': 2,
'Kiribati': 3,
'Thailand': 4,
'USA': 5}}
Your Encoded DataFrame:
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Year | Season | City | Sport | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 119428 | 1 | 0 | 18.0 | 155.0 | 52.0 | 3 | 4 | 2016 | 0 | 3 | 0 | 1 | 1 |
| 1 | 129570 | 4 | 0 | 25.0 | 171.0 | 67.0 | 2 | 3 | 2012 | 0 | 1 | 2 | 4 | 0 |
| 2 | 105609 | 0 | 1 | 27.0 | 177.0 | 68.0 | 1 | 1 | 2000 | 0 | 6 | 3 | 5 | 1 |
| 3 | 77929 | 3 | 0 | 27.0 | 175.0 | 68.0 | 0 | 0 | 1996 | 0 | 0 | 0 | 2 | 2 |
| 4 | 118447 | 6 | 0 | 31.0 | 155.0 | 48.0 | 4 | 5 | 1984 | 0 | 2 | 0 | 1 | 1 |
| 5 | 130080 | 2 | 1 | 21.0 | 180.0 | 86.0 | 5 | 6 | 1912 | 0 | 5 | 0 | 0 | 1 |
| 6 | 99779 | 5 | 1 | 20.0 | 191.0 | 84.0 | 2 | 2 | 1988 | 0 | 4 | 1 | 3 | 1 |
The cell below will test your solution for label_encoding (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=label_encoding,
ex_name='label_encoding',
key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to label_encoding did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
build_traintest
Your task: define build_traintest as follows:
Now that we've encoded our Pandas DataFrame in the prior exercise, we can split our data into random training and testing subsets.
Input:
encoded_df: The result of Exercise 8 - A label encoded Pandas DataFrame (provided in solution/demo below)featurecol: a string representing which column contains the response variabletestsize: a float representing the train/test splitrandom_state: an integer representing a random seedReturn: A tuple containing the following:
Requirements:
train_test_split function within sklearn.model_selection modulefeaturecol must be excluded from the independent variables and included in the dependent variables### Solution - Exercise 9
def build_traintest(encoded_df: pd.DataFrame, featurecol: str, testsize: float = .2, random_state: int = 6040) -> tuple:
from sklearn.model_selection import train_test_split
###
### YOUR CODE HERE
###
X = encoded_df.drop(featurecol, axis= 1)
Y = encoded_df[featurecol]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size = testsize,
random_state = random_state)
# tuple is anything enclosed by parentheses
return (X_train, X_test, list(Y_train), list(Y_test))
### Demo function call
build_traintest_encoded_df_demo, featurecol_demo, testsize_demo, random_state_demo = utils.load_object_from_publicdata('build_traintest_demo.dill')
(X_train, X_test, Y_train, Y_test) = build_traintest(build_traintest_encoded_df_demo, featurecol_demo, testsize_demo, random_state_demo)
print('Your X_train: ')
display(X_train)
print('\nYour X_test: ')
display(X_test)
print('\nYour Y_train: ')
print(Y_train)
print('\nYour Y_test: ')
print(Y_test)
Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.
build_traintest_demo_CORRECT=utils.load_object_from_publicdata('build_traintest_demo_CORRECT.dill')
# print('Demo Correct X_train: ')
# display(build_traintest_demo_CORRECT[0])
# print('\nDemo Correct X_test: ')
# display(build_traintest_demo_CORRECT[1])
# print('\nDemo Correct Y_train: ')
# print(build_traintest_demo_CORRECT[2])
# print('\nDemo Correct Y_test: ')
# print(build_traintest_demo_CORRECT[3])
The demo should display this printed output.
Successfully loaded build_traintest_demo.dill.
Your X_train:
Sex Age Height Weight NOC Year Season City Sport Event
189804 0 27.0 170.0 58.0 74 2000 0 38 60 708
28047 1 39.0 NaN NaN 69 1932 0 18 4 50
102686 1 22.0 179.0 107.0 32 1976 0 21 64 718
259561 1 22.0 201.0 96.0 52 2016 0 27 62 710
88999 1 22.0 168.0 65.0 109 1968 1 12 53 612
187971 0 25.0 176.0 59.0 53 2000 0 38 5 145
79598 1 53.0 188.0 72.0 12 1968 0 20 21 323
29818 0 21.0 NaN NaN 216 1932 0 18 20 302
178014 0 22.0 170.0 55.0 171 1996 0 4 6 155
9839 1 20.0 NaN NaN 74 1912 0 37 27 379
122260 1 24.0 190.0 90.0 29 1984 0 18 62 710
236755 0 22.0 159.0 58.0 162 2010 1 41 1 10
50189 0 26.0 NaN NaN 162 1952 0 13 5 135
261211 1 21.0 188.0 90.0 69 2016 0 27 54 644
1172 1 48.0 NaN NaN 69 1900 0 26 47 520
188844 1 22.0 171.0 69.0 63 2000 0 38 27 375
Your X_test:
Sex Age Height Weight NOC Year Season City Sport Event
235312 1 27.0 178.0 82.0 69 2010 1 41 51 601
233031 1 26.0 193.0 96.0 216 2008 0 6 8 159
177214 1 24.0 180.0 78.0 163 1996 0 4 24 348
85948 1 29.0 170.0 58.0 178 1968 0 20 47 571
Your Y_train:
[2, 2, 3, 2, 2, 2, 0, 0, 2, 0, 3, 2, 2, 2, 2, 2]
Your Y_test:
[2, 1, 2, 2]
The cell below will test your solution for build_traintest (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 9
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=build_traintest,
ex_name='build_traintest',
key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to build_traintest did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
evaluation_results
Your task: define evaluation_results as follows:
Input:
Return: A dictionary {key:value} containing the following:
Requirements:
sklearn.metrics, return the evaluation results mentioned above### Solution - Exercise 10
def evaluation_results(ytest: list,ypred: list) -> dict:
Y_test=pd.Series(ytest)
Y_pred=pd.Series(ypred)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
###
### YOUR CODE HERE
###
acc = round(accuracy_score(Y_test, Y_pred), 3)
prec = round(precision_score(Y_test, Y_pred,
average = 'weighted',
zero_division = 0), 3)
recall = round(recall_score(Y_test, Y_pred,
average='weighted'), 3)
f1 = round(f1_score(Y_test, Y_pred,
average='weighted'), 3)
conf = confusion_matrix(Y_test, Y_pred).tolist()
metrics = {
'accuracy': acc,
'precision': prec,
'recall': recall,
'f1_score': f1,
'confusion_matrix': conf
}
return metrics
### Demo function call
ypred, ytest = utils.load_object_from_publicdata('evaluation_results_demo.dill')
demo_results = evaluation_results(ytest, ypred)
pprint(demo_results)
evaluation_results_demo_CORRECT=utils.load_object_from_publicdata('evaluation_results_demo_CORRECT.dill')
#pprint(evaluation_results_demo_CORRECT)
The demo should display this printed output.
Successfully loaded evaluation_results_demo.dill.
{'accuracy': 0.8,
'confusion_matrix': [[6, 1], [1, 2]],
'f1_score': 0.8,
'precision': 0.8,
'recall': 0.8}
The cell below will test your solution for evaluation_results (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 10
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=evaluation_results,
ex_name='evaluation_results',
key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to evaluation_results did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Congratulations. Remember to submit your notebook.
We wrote all these functions, so what? It's possible to make numerous predictions but let's try to predict who would win medals at the 2016 Olympics using our data that we've manipulated. In particular, we're going to use the columns ['Sex','Age','Height','Weight','NOC','Year','Season','City','Sport','Event'] as our independent or feature variables. And we'll use the type of medal as the dependent or response variable. Code inspiration drawn from Kaggle and Kaggle.
Like above, we use cleansed data to start. Using data from 1992-2016, we then did one-hot-encoding for the categorical variables (somewhat similar to the function you wrote above). Next, we created train/test sets from 1992-2014 and imputed any variables that may have been missing values using sklearn's IterativeImputer and scaled the data using sklearn's StandardScaler. We then ran the data through a robust suite of algorithms including: Decision Trees, Random Forests, XGBoost, LightGBM, Logistic Regression, K-Nearest Neighbors, Naive Bayes, AdaBoost, and Gradient Boosting.
Random Forests and LightGBM appear to have performed the best.

The 2016 Olympics served as our true holdout test set. Once the models were properly tuned on data from 1992 to 2014, we used them to make predictions on the 2016 data. It is great to see that there isn't much deviation from the evaluation metrics, indicating our models may be robust. Once again, Random Forests and LightGBM appear to perform best.

The next step is to tune the models' hyperparameters. We'll leave that for future work. The code for this analysis is mostly below, with the One-Hot Encoding, Train/Test, and Evaluation Metrics functionalities missing.
modeldata=sorted_df.copy(deep=True)
keepcols=['Sex','Age','Height','Weight','NOC','Year','Season','City','Sport','Event','Medal']
modeldata=modeldata[keepcols]
modeldata=modeldata[(modeldata['Year']>=1992)]
#########################################
#One-Hot Encoding code HERE
###
#########################################
#Train/Test sets
olympics_non2016_medal_winners=alldata[alldata['Year']!=2016].copy(deep=True)
X = olympics_non2016_medal_winners.drop('Medal', axis=1)
Y = olympics_non2016_medal_winners['Medal']
###
#Train/Test Set code HERE
#########################################
#Data Imputation 1992-2014
from sklearn.preprocessing import StandardScaler
from sklearn.impute import IterativeImputer
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
import numpy as np
import pandas as pd
from tqdm import tqdm
# **Check for NaN values before imputation**
print("NaN values in X_train before imputation:", np.isnan(X_train).sum())
print("NaN values in X_test before imputation:", np.isnan(X_test).sum())
# **Advanced Imputation using IterativeImputer**
imputer = IterativeImputer(random_state=6040)
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
# **Check for NaN values after imputation**
print("NaN values in X_train after imputation:", np.isnan(X_train_imputed).sum())
print("NaN values in X_test after imputation:", np.isnan(X_test_imputed).sum())
# Scale the data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)
#########################################
#Data Imputation and 2016 split
olympics_2016_medal_winners=alldata[alldata['Year']==2016].copy(deep=True)
olympics_2016_medal_winners.head()
# Define the categorical columns
categorical_columns = ['Sex', 'NOC', 'Season', 'City', 'Sport', 'Event']
# Filter the 2016 data for prediction
X_test_2016 = olympics_2016_medal_winners.drop('Medal', axis=1)
actual_medals_2016 = olympics_2016_medal_winners['Medal']
# **Check for NaN values before imputation**
print("NaN values in X_test_2016 before imputation:", np.isnan(X_test_2016).sum())
# **Advanced Imputation using IterativeImputer**
imputer = IterativeImputer(random_state=6040)
a = imputer.fit_transform(X_train)
X_test_2016_imputed = imputer.transform(X_test_2016)
# **Check for NaN values after imputation**
print("NaN values in X_test_2016 after imputation:", np.isnan(X_test_2016_imputed).sum())
# Scale the data using StandardScaler
scaler = StandardScaler()
b = scaler.fit_transform(X_train_imputed)
X_test_2016_scaled = scaler.transform(X_test_2016_imputed)
#########################################
# Modeling
#########################################
# Initialize the models with hyperparameters to be tuned
models = {
"Decision Tree": DecisionTreeClassifier(random_state=6040, class_weight='balanced'),
"Random Forest": RandomForestClassifier(random_state=6040, class_weight='balanced'),
"XGBoost": XGBClassifier(random_state=6040, use_label_encoder=False, eval_metric='logloss'),
"LightGBM": lgb.LGBMClassifier(random_state=6040),
"Logistic Regression": LogisticRegression(random_state=6040, max_iter=1000, class_weight='balanced'),
"K-Nearest Neighbors": KNeighborsClassifier(),
"Naive Bayes": GaussianNB(),
"AdaBoost": AdaBoostClassifier(random_state=6040),
"Gradient Boosting": GradientBoostingClassifier(random_state=6040)
}
# Hyperparameter grids for GridSearchCV
param_grids = {
"Decision Tree": {
'max_depth': [3, 5],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 4]
},
"Random Forest": {
'n_estimators': [100],
'max_depth': [None, 10],
'min_samples_split': [2, 5],
'min_samples_leaf': [1]
},
"XGBoost": {
'n_estimators': [100],
'max_depth': [3, 5],
'learning_rate': [0.01, 0.1]
},
"LightGBM": {
'n_estimators': [100],
'num_leaves': [31, 50],
'learning_rate': [0.01, 0.1]
},
"Logistic Regression": {
'C': [0.1, 1.0],
'solver': ['lbfgs', 'liblinear']
},
"K-Nearest Neighbors": {
'n_neighbors': [3, 5],
'weights': ['uniform', 'distance']
},
"Neural Network": {
'hidden_layer_sizes': [(50, 50)],
'activation': ['relu', 'tanh'],
'solver': ['adam'],
'alpha': [0.001]
},
"AdaBoost": {
'n_estimators': [50],
'learning_rate': [0.01, 0.1]
},
"Gradient Boosting": {
'n_estimators': [100],
'max_depth': [3, 5],
'learning_rate': [0.01, 0.1]
}
}
# Cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=6040)
#########################################
#Store Evaluation Matricies as lists
# Training the models with hyperparameter tuning using GridSearchCV
for model_name, model in models.items():
print(f"\nTuning {model_name}...")
with tqdm(total=1, desc=model_name) as pbar:
grid_search = GridSearchCV(model, param_grids.get(model_name, {}), cv=cv, scoring='accuracy')
grid_search.fit(X_train_scaled, Y_train)
# Get the best model after hyperparameter tuning
best_model = grid_search.best_estimator_
# Evaluate train accuracy with cross-validation
train_accuracy = grid_search.best_score_
# Predict on the test data
Y_pred = best_model.predict(X_test_scaled)
#########################################
# Calculate and store evaluation metrics for 1992-2014 data as well as 2016 holdout data.
# YOUR CODE from evaluation metrics function would go HERE
# Append to lists initialized above
# Update progress bar
pbar.update(1)
#Show Evaluation Metrics as Pandas Dataframe
dd={}
for i in range(len(evaluation_metrics['Model'])):
dd[i]={v:evaluation_metrics[v][i] for v in ['Model','Test Accuracy','Precision','Recall','F1-Score']}
d=pd.DataFrame.from_dict(dd,orient='index')
display(d)