Write a Pandas program to find and drop the missing values from World alcohol consumption dataset

In [1]:

# conventional way to import pandas import pandas as pd # get Pansda's vesrion # print ('Pandas version', pd.__version__)

('Pandas version', u'0.18.1')

2. How do I read a tabular data file into pandas?

In [2]:

# read a dataset of Chipotle orders directly from a URL and store the results in a DataFrame orders = pd.read_table('data/chipotle.tsv')

In [3]:

# examine the first 5 rows orders.head()

Out[3]:

Documentation for read_table

In [4]:

users = pd.read_table('data/u.user') # examine the first 5 rows users.head()

Out[4]:

In [5]:

users = pd.read_table('data/u.user', sep='|') # examine the first 5 rows users.head()

Out[5]:

In [6]:

users = pd.read_table('data/u.user', sep='|', header=None) # examine the first 5 rows users.head()

Out[6]:

In [7]:

user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code'] users = pd.read_table('data/u.user', sep='|', header=None, names=user_cols) # examine the first 5 rows users.head()

Out[7]:

3. How do I select a pandas Series from a DataFrame?

In [7]:

# read a dataset of UFO reports into a DataFrame ufo = pd.read_table('data/ufo.csv', sep=',')

In [8]:

# read_csv is equivalent to read_table, except it assumes a comma separator ufo = pd.read_csv('data/ufo.csv') type(ufo)

Out[8]:

pandas.core.frame.DataFrame

In [9]:

# examine the first 5 rows ufo.head()

Out[9]:

In [10]:

# select the 'City' Series using bracket notation ufo['Colors Reported']

Out[10]:

0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN 10 NaN 11 NaN 12 RED 13 NaN 14 NaN 15 NaN 16 NaN 17 NaN 18 NaN 19 RED 20 NaN 21 NaN 22 NaN 23 NaN 24 NaN 25 NaN 26 NaN 27 NaN 28 NaN 29 NaN ... 18211 NaN 18212 NaN 18213 GREEN 18214 NaN 18215 NaN 18216 ORANGE 18217 NaN 18218 NaN 18219 NaN 18220 BLUE 18221 NaN 18222 NaN 18223 NaN 18224 NaN 18225 NaN 18226 NaN 18227 NaN 18228 NaN 18229 NaN 18230 NaN 18231 NaN 18232 NaN 18233 RED 18234 NaN 18235 NaN 18236 NaN 18237 NaN 18238 NaN 18239 RED 18240 NaN Name: Colors Reported, dtype: object

In [16]:

Out[16]:

pandas.core.series.Series

In [17]:

# or equivalently, use dot notation - see notes below ufo.City

Out[17]:

0 Ithaca 1 Willingboro 2 Holyoke 3 Abilene 4 New York Worlds Fair 5 Valley City 6 Crater Lake 7 Alma 8 Eklutna 9 Hubbard 10 Fontana 11 Waterloo 12 Belton 13 Keokuk 14 Ludington 15 Forest Home 16 Los Angeles 17 Hapeville 18 Oneida 19 Bering Sea 20 Nebraska 21 NaN 22 NaN 23 Owensboro 24 Wilderness 25 San Diego 26 Wilderness 27 Clovis 28 Los Alamos 29 Ft. Duschene ... 18211 Holyoke 18212 Carson 18213 Pasadena 18214 Austin 18215 El Campo 18216 Garden Grove 18217 Berthoud Pass 18218 Sisterdale 18219 Garden Grove 18220 Shasta Lake 18221 Franklin 18222 Albrightsville 18223 Greenville 18224 Eufaula 18225 Simi Valley 18226 San Francisco 18227 San Francisco 18228 Kingsville 18229 Chicago 18230 Pismo Beach 18231 Pismo Beach 18232 Lodi 18233 Anchorage 18234 Capitola 18235 Fountain Hills 18236 Grant Park 18237 Spirit Lake 18238 Eagle River 18239 Eagle River 18240 Ybor Name: City, dtype: object

Bracket notation will always work, whereas dot notation has limitations:

  • Dot notation doesn't work if there are spaces in the Series name
  • Dot notation doesn't work if the Series has the same name as a DataFrame method or attribute (like 'head' or 'shape')
  • Dot notation can't be used to define the name of a new Series (see below)

In [10]:

# create a new 'Location' Series (must use bracket notation to define the Series name) ufo['Location'] = ufo.City + ', ' + ufo.State ufo.head()

Out[10]:

4. Why do some pandas commands end with parentheses (and others don't)?

In [11]:

# read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('data/imdb_1000.csv')

Methods end with parentheses, while attributes don't:

In [12]:

# example method: show the first 5 rows movies.head()

Out[12]:

In [13]:

# example method: calculate summary statistics movies.describe()

Out[13]:

In [14]:

# example attribute: number of rows and columns movies.shape

In [15]:

# example attribute: data type of each column movies.dtypes

Out[15]:

star_rating float64 title object content_rating object genre object duration int64 actors_list object dtype: object

In [16]:

# use an optional parameter to the describe method to summarize only 'object' columns movies.describe(include=['object'])

Out[16]:

Documentation for describe

[Back to top]

5. How do I rename columns in a pandas DataFrame?

In [11]:

# read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('data/ufo.csv')

In [12]:

# examine the column names ufo.columns

Out[12]:

Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time'], dtype='object')

In [13]:

# rename two of the columns by using the 'rename' method ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True) ufo.columns

Out[13]:

Index([u'City', u'Colors_Reported', u'Shape_Reported', u'State', u'Time'], dtype='object')

In [22]:

# replace all of the column names by overwriting the 'columns' attribute ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time'] ufo.columns = ufo_cols ufo.columns

Out[22]:

Index([u'city', u'colors reported', u'shape reported', u'state', u'time'], dtype='object')

In [23]:

# replace the column names during the file reading process by using the 'names' parameter ufo = pd.read_csv('data/ufo.csv', header=0, names=ufo_cols) ufo.head()

Out[23]:

Documentation for read_csv

In [24]:

# replace all spaces with underscores in the column names by using the 'str.replace' method ufo.columns = ufo.columns.str.replace(' ', '_') ufo.columns

Out[24]:

Index([u'city', u'colors_reported', u'shape_reported', u'state', u'time'], dtype='object')

6. How do I remove columns from a pandas DataFrame?

In [25]:

# read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('data/ufo.csv') ufo.head()

Out[25]:

In [24]:

# remove a single column (axis=1 refers to columns) ufo.drop('Colors Reported', axis=1, inplace=True) ufo.head()

Out[24]:

In [25]:

# remove multiple columns at once ufo.drop(['City', 'State'], axis=1, inplace=True) ufo.head()

Out[25]:

In [26]:

# remove multiple rows at once (axis=0 refers to rows) ufo.drop([0, 1], axis=0, inplace=True) ufo.head()

Out[26]:

7. How do I sort a pandas DataFrame or a Series?

In [28]:

# read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('data/imdb_1000.csv') movies.head()

Out[28]:

Note: None of the sorting methods below affect the underlying data. (In other words, the sorting is temporary).

In [29]:

# sort the 'title' Series in ascending order (returns a Series) movies.title.sort_values()

Out[29]:

542 (500) Days of Summer 5 12 Angry Men 201 12 Years a Slave 698 127 Hours 110 2001: A Space Odyssey 910 2046 596 21 Grams 624 25th Hour 708 28 Days Later... 60 3 Idiots 225 3-Iron 570 300 555 3:10 to Yuma 427 4 Months, 3 Weeks and 2 Days 824 42 597 50/50 203 8 1/2 170 A Beautiful Mind 941 A Bridge Too Far 571 A Bronx Tale 266 A Christmas Story 86 A Clockwork Orange 716 A Few Good Men 750 A Fish Called Wanda 276 A Fistful of Dollars 612 A Hard Day's Night 883 A History of Violence 869 A Nightmare on Elm Street 865 A Perfect World 426 A Prophet ... 207 What Ever Happened to Baby Jane? 562 What's Eating Gilbert Grape 719 When Harry Met Sally... 649 Where Eagles Dare 33 Whiplash 669 Who Framed Roger Rabbit 219 Who's Afraid of Virginia Woolf? 127 Wild Strawberries 497 Willy Wonka & the Chocolate Factory 270 Wings of Desire 483 Withnail & I 920 Witness 65 Witness for the Prosecution 970 Wonder Boys 518 Wreck-It Ralph 954 X-Men 248 X-Men: Days of Future Past 532 X-Men: First Class 871 X2 695 Y Tu Mama Tambien 403 Ying xiong 235 Yip Man 96 Yojimbo 280 Young Frankenstein 535 Zelig 955 Zero Dark Thirty 677 Zodiac 615 Zombieland 526 Zulu 864 [Rec] Name: title, dtype: object

In [29]:

# sort in descending order instead movies.title.sort_values(ascending=False).head()

Out[29]:

864 [Rec] 526 Zulu 615 Zombieland 677 Zodiac 955 Zero Dark Thirty Name: title, dtype: object

Documentation for sort_values for a Series. (Prior to version 0.17, use order instead.)

In [31]:

# sort the entire DataFrame by the 'title' Series (returns a DataFrame) movies.sort_values('title').head()

Out[31]:

In [33]:

# sort in descending order instead movies.sort_values('title', ascending=False).head()

Out[33]:

Documentation for sort_values for a DataFrame. (Prior to version 0.17, use sort instead.)

In [32]:

# sort the DataFrame first by 'content_rating', then by 'duration' movies.sort_values(['content_rating', 'duration']).head()

Out[32]:

8. How do I filter rows of a pandas DataFrame by column value?

In [35]:

# read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('data/imdb_1000.csv') movies.head()

Out[35]:

In [34]:

# examine the number of rows and columns movies.shape

Goal: Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes.

In [35]:

# create a list in which each element refers to a DataFrame row: True if the row satisfies the condition, False otherwise booleans = [] for length in movies.duration: if length >= 200: booleans.append(True) else: booleans.append(False)

In [36]:

# confirm that the list has the same length as the DataFrame len(booleans)

In [37]:

# examine the first five list elements booleans[0:5]

Out[37]:

[False, False, True, False, False]

In [38]:

# convert the list to a Series is_long = pd.Series(booleans) is_long.head()

Out[38]:

0 False 1 False 2 True 3 False 4 False dtype: bool

In [39]:

# use bracket notation with the boolean Series to tell the DataFrame which rows to display movies[is_long]

Out[39]:

In [40]:

# simplify the steps above: no need to write a for loop to create 'is_long' since pandas will broadcast the comparison is_long = movies.duration >= 200 movies[is_long] # or equivalently, write it in one line (no need to create the 'is_long' object) movies[movies.duration >= 200]

Out[40]:

In [41]:

# select the 'genre' Series from the filtered DataFrame is_long = movies.duration >= 200 movies[is_long].genre # or equivalently, use the 'loc' method movies.loc[movies.duration >= 200, 'genre']

Out[41]:

2 Crime 7 Adventure 17 Drama 78 Crime 85 Adventure 142 Adventure 157 Drama 204 Adventure 445 Adventure 476 Drama 630 Biography 767 Action Name: genre, dtype: object

Documentation for loc

[Back to top]

9. How do I apply multiple filter criteria to a pandas DataFrame?

In [36]:

# read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('data/imdb_1000.csv') movies.head()

Out[36]:

In [37]:

# filter the DataFrame to only show movies with a 'duration' of at least 200 minutes movies[movies.duration >= 200]

Out[37]:

Understanding logical operators:

  • and: True only if both sides of the operator are True
  • or: True if either side of the operator is True

In [44]:

# demonstration of the 'and' operator print(True and True) print(True and False) print(False and False)

In [45]:

# demonstration of the 'or' operator print(True or True) print(True or False) print(False or False)

Rules for specifying multiple filter criteria in pandas:

  • use & instead of and
  • use | instead of or
  • add parentheses around each condition to specify evaluation order

Goal: Further filter the DataFrame of long movies (duration >= 200) to only show movies which also have a 'genre' of 'Drama'

In [46]:

# CORRECT: use the '&' operator to specify that both conditions are required movies[(movies.duration >=200) & (movies.genre == 'Drama')]

Out[46]:

In [47]:

# INCORRECT: using the '|' operator would have shown movies that are either long or dramas (or both) movies[(movies.duration >=200) | (movies.genre == 'Drama')].head()

Out[47]:

Goal: Filter the original DataFrame to show movies with a 'genre' of 'Crime' or 'Drama' or 'Action'

In [44]:

# use the '|' operator to specify that a row can match any of the three criteria movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')].tail(20) # or equivalently, use the 'isin' method #movies[movies.genre.isin(['Crime', 'Drama', 'Action'])].head(10)

Out[44]:

Documentation for isin

[Back to top]

10. Your pandas questions answered!

Question: When reading from a file, how do I read in only a subset of the columns?

In [49]:

# read a dataset of UFO reports into a DataFrame, and check the columns ufo = pd.read_csv('data/ufo.csv') ufo.columns

Out[49]:

Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time'], dtype='object')

In [50]:

# specify which columns to include by name ufo = pd.read_csv('data/ufo.csv', usecols=['City', 'State']) # or equivalently, specify columns by position ufo = pd.read_csv('data/ufo.csv', usecols=[0, 4]) ufo.columns

Out[50]:

Index([u'City', u'Time'], dtype='object')

Question: When reading from a file, how do I read in only a subset of the rows?

In [51]:

# specify how many rows to read ufo = pd.read_csv('data/ufo.csv', nrows=3) ufo

Out[51]:

Documentation for read_csv

Question: How do I iterate through a Series?

In [52]:

# Series are directly iterable (like a list) for c in ufo.City: print(c)

Ithaca Willingboro Holyoke

Question: How do I iterate through a DataFrame?

In [53]:

# various methods are available to iterate through a DataFrame for index, row in ufo.iterrows(): print(index, row.City, row.State)

(0, 'Ithaca', 'NY') (1, 'Willingboro', 'NJ') (2, 'Holyoke', 'CO')

Documentation for iterrows

Question: How do I drop all non-numeric columns from a DataFrame?

In [45]:

# read a dataset of alcohol consumption into a DataFrame, and check the data types drinks = pd.read_csv('data/drinks.csv') drinks.dtypes

Out[45]:

country object beer_servings int64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 continent object dtype: object

In [55]:

# only include numeric columns in the DataFrame import numpy as np drinks.select_dtypes(include=[np.number]).dtypes

Out[55]:

beer_servings int64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 dtype: object

Documentation for select_dtypes

Question: How do I know whether I should pass an argument as a string or a list?

In [56]:

# describe all of the numeric columns drinks.describe()

Out[56]:

In [57]:

# pass the string 'all' to describe all columns drinks.describe(include='all')

Out[57]:

In [58]:

# pass a list of data types to only describe certain types drinks.describe(include=['object', 'float64'])

Out[58]:

In [59]:

# pass a list even if you only want to describe a single data type drinks.describe(include=['object'])

Out[59]:

Documentation for describe

[Back to top]

11. How do I use the "axis" parameter in pandas?

In [60]:

# read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('data/drinks.csv') drinks.head()

Out[60]:

In [61]:

# drop a column (temporarily) drinks.drop('continent', axis=1).head()

Out[61]:

In [62]:

# drop a row (temporarily) drinks.drop(2, axis=0).head()

Out[62]:

When referring to rows or columns with the axis parameter:

  • axis 0 refers to rows
  • axis 1 refers to columns

In [46]:

# calculate the mean of each numeric column drinks.mean() # or equivalently, specify the axis explicitly drinks.mean(axis=0)

Out[46]:

beer_servings 106.160622 spirit_servings 80.994819 wine_servings 49.450777 total_litres_of_pure_alcohol 4.717098 dtype: float64

In [64]:

# calculate the mean of each row drinks.mean(axis=1).head()

Out[64]:

0 0.000 1 69.975 2 9.925 3 176.850 4 81.225 dtype: float64

When performing a mathematical operation with the axis parameter:

  • axis 0 means the operation should "move down" the row axis
  • axis 1 means the operation should "move across" the column axis

In [65]:

# 'index' is an alias for axis 0 drinks.mean(axis='index')

Out[65]:

beer_servings 106.160622 spirit_servings 80.994819 wine_servings 49.450777 total_litres_of_pure_alcohol 4.717098 dtype: float64

In [66]:

# 'columns' is an alias for axis 1 drinks.mean(axis='columns').head()

Out[66]:

0 0.000 1 69.975 2 9.925 3 176.850 4 81.225 dtype: float64

12. How do I use string methods in pandas?

In [67]:

# read a dataset of Chipotle orders into a DataFrame orders = pd.read_table('data/chipotle.tsv') orders.head()

Out[67]:

In [68]:

# normal way to access string methods in Python 'hello'.upper()

In [69]:

# string methods for pandas Series are accessed via 'str' orders.item_name.str.upper().head()

Out[69]:

0 CHIPS AND FRESH TOMATO SALSA 1 IZZE 2 NANTUCKET NECTAR 3 CHIPS AND TOMATILLO-GREEN CHILI SALSA 4 CHICKEN BOWL Name: item_name, dtype: object

In [70]:

# string method 'contains' checks for a substring and returns a boolean Series orders.item_name.str.contains('Chicken').head()

Out[70]:

0 False 1 False 2 False 3 False 4 True Name: item_name, dtype: bool

In [71]:

# use the boolean Series to filter the DataFrame orders[orders.item_name.str.contains('Chicken')].head()

Out[71]:

In [72]:

# string methods can be chained together orders.choice_description.str.replace('[', '').str.replace(']', '').head()

Out[72]:

0 NaN 1 Clementine 2 Apple 3 NaN 4 Tomatillo-Red Chili Salsa (Hot), Black Beans, ... Name: choice_description, dtype: object

In [73]:

# many pandas string methods support regular expressions (regex) orders.choice_description.str.replace('[\[\]]', '').head()

Out[73]:

0 NaN 1 Clementine 2 Apple 3 NaN 4 Tomatillo-Red Chili Salsa (Hot), Black Beans, ... Name: choice_description, dtype: object

13. How do I change the data type of a pandas Series?

In [74]:

# read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('data/drinks.csv') drinks.head()

Out[74]:

In [75]:

# examine the data type of each Series drinks.dtypes

Out[75]:

country object beer_servings int64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 continent object dtype: object

In [76]:

# change the data type of an existing Series drinks['beer_servings'] = drinks.beer_servings.astype(float) drinks.dtypes

Out[76]:

country object beer_servings float64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 continent object dtype: object

In [77]:

# alternatively, change the data type of a Series while reading in a file drinks = pd.read_csv('data/drinks.csv', dtype={'beer_servings':float}) drinks.dtypes

Out[77]:

country object beer_servings float64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 continent object dtype: object

In [78]:

# read a dataset of Chipotle orders into a DataFrame orders = pd.read_table('data/chipotle.tsv') orders.head()

Out[78]:

In [79]:

# examine the data type of each Series orders.dtypes

Out[79]:

order_id int64 quantity int64 item_name object choice_description object item_price object dtype: object

In [80]:

# convert a string to a number in order to do math orders.item_price.str.replace('$', '').astype(float).mean()

Out[80]:

In [81]:

# string method 'contains' checks for a substring and returns a boolean Series orders.item_name.str.contains('Chicken').head()

Out[81]:

0 False 1 False 2 False 3 False 4 True Name: item_name, dtype: bool

In [82]:

# convert a boolean Series to an integer (False = 0, True = 1) orders.item_name.str.contains('Chicken').astype(int).head()

Out[82]:

0 0 1 0 2 0 3 0 4 1 Name: item_name, dtype: int32

14. When should I use a "groupby" in pandas?

In [83]:

# read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('data/drinks.tsv') drinks.head()

Out[83]:

In [84]:

# calculate the mean beer servings across the entire dataset drinks.beer_servings.mean()

Out[84]:

In [85]:

# calculate the mean beer servings just for countries in Africa drinks[drinks.continent=='Africa'].beer_servings.mean()

Out[85]:

In [86]:

# calculate the mean beer servings for each continent drinks.groupby('continent').beer_servings.mean()

Out[86]:

continent Africa 61.471698 Asia 37.045455 Europe 193.777778 North America 145.434783 Oceania 89.687500 South America 175.083333 Name: beer_servings, dtype: float64

Documentation for groupby

In [87]:

# other aggregation functions (such as 'max') can also be used with groupby drinks.groupby('continent').beer_servings.max()

Out[87]:

continent Africa 376 Asia 247 Europe 361 North America 285 Oceania 306 South America 333 Name: beer_servings, dtype: int64

In [88]:

# multiple aggregation functions can be applied simultaneously drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max'])

Out[88]:

In [89]:

# specifying a column to which the aggregation function should be applied is not required drinks.groupby('continent').mean()

Out[89]:

In [90]:

# allow plots to appear in the notebook %matplotlib inline

In [91]:

# side-by-side bar plot of the DataFrame directly above drinks.groupby('continent').mean().plot(kind='bar')

Out[91]:

<matplotlib.axes._subplots.AxesSubplot at 0x9a4fd30>

Documentation for plot

[Back to top]

15. How do I explore a pandas Series?

In [92]:

# read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('data/imbd_1000.csv') movies.head()

Out[92]:

In [93]:

# examine the data type of each Series movies.dtypes

Out[93]:

star_rating float64 title object content_rating object genre object duration int64 actors_list object dtype: object

Exploring a non-numeric Series:

In [94]:

# count the non-null values, unique values, and frequency of the most common value movies.genre.describe()

Out[94]:

count 979 unique 16 top Drama freq 278 Name: genre, dtype: object

Documentation for describe

In [95]:

# count how many times each value in the Series occurs movies.genre.value_counts()

Out[95]:

Drama 278 Comedy 156 Action 136 Crime 124 Biography 77 Adventure 75 Animation 62 Horror 29 Mystery 16 Western 9 Thriller 5 Sci-Fi 5 Film-Noir 3 Family 2 Fantasy 1 History 1 Name: genre, dtype: int64

Documentation for value_counts

In [96]:

# display percentages instead of raw counts movies.genre.value_counts(normalize=True)

Out[96]:

Drama 0.283963 Comedy 0.159346 Action 0.138917 Crime 0.126660 Biography 0.078652 Adventure 0.076609 Animation 0.063330 Horror 0.029622 Mystery 0.016343 Western 0.009193 Thriller 0.005107 Sci-Fi 0.005107 Film-Noir 0.003064 Family 0.002043 Fantasy 0.001021 History 0.001021 Name: genre, dtype: float64

In [97]:

# 'value_counts' (like many pandas methods) outputs a Series type(movies.genre.value_counts())

Out[97]:

pandas.core.series.Series

In [98]:

# thus, you can add another Series method on the end movies.genre.value_counts().head()

Out[98]:

Drama 278 Comedy 156 Action 136 Crime 124 Biography 77 Name: genre, dtype: int64

In [99]:

# display the unique values in the Series movies.genre.unique()

Out[99]:

array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography', 'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi', 'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)

In [100]:

# count the number of unique values in the Series movies.genre.nunique()

Documentation for unique and nunique

In [101]:

# compute a cross-tabulation of two Series pd.crosstab(movies.genre, movies.content_rating)

Out[101]:

Documentation for crosstab

Exploring a numeric Series:

In [102]:

# calculate various summary statistics movies.duration.describe()

Out[102]:

count 979.000000 mean 120.979571 std 26.218010 min 64.000000 25% 102.000000 50% 117.000000 75% 134.000000 max 242.000000 Name: duration, dtype: float64

In [103]:

# many statistics are implemented as Series methods movies.duration.mean()

Out[103]:

In [104]:

# 'value_counts' is primarily useful for categorical data, not numerical data movies.duration.value_counts().head()

Out[104]:

112 23 113 22 102 20 101 20 129 19 Name: duration, dtype: int64

In [105]:

# allow plots to appear in the notebook %matplotlib inline

In [106]:

# histogram of the 'duration' Series (shows the distribution of a numerical variable) movies.duration.plot(kind='hist')

Out[106]:

<matplotlib.axes._subplots.AxesSubplot at 0x9ead668>

In [107]:

# bar plot of the 'value_counts' for the 'genre' Series movies.genre.value_counts().plot(kind='bar')

Out[107]:

<matplotlib.axes._subplots.AxesSubplot at 0xa1a6240>

Documentation for plot

[Back to top]

16. How do I handle missing values in pandas?

In [108]:

# read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('data/ufo.csv') ufo.tail()

Out[108]:

What does "NaN" mean?

  • "NaN" is not a string, rather it's a special value: numpy.nan.
  • It stands for "Not a Number" and indicates a missing value.
  • read_csv detects missing values (by default) when reading the file, and replaces them with this special value.

Documentation for read_csv

In [109]:

# 'isnull' returns a DataFrame of booleans (True if missing, False if not missing) ufo.isnull().tail()

Out[109]:

In [110]:

# 'nonnull' returns the opposite of 'isnull' (True if not missing, False if missing) ufo.notnull().tail()

Out[110]:

Documentation for isnull and notnull

In [111]:

# count the number of missing values in each Series ufo.isnull().sum()

Out[111]:

City 25 Colors Reported 15359 Shape Reported 2644 State 0 Time 0 dtype: int64

This calculation works because:

  1. The sum method for a DataFrame operates on axis=0 by default (and thus produces column sums).
  2. In order to add boolean values, pandas converts True to 1 and False to 0.

In [112]:

# use the 'isnull' Series method to filter the DataFrame rows ufo[ufo.City.isnull()].head()

Out[112]:

How to handle missing values depends on the dataset as well as the nature of your analysis. Here are some options:

In [113]:

# examine the number of rows and columns ufo.shape

In [114]:

# if 'any' values are missing in a row, then drop that row ufo.dropna(how='any').shape

In [115]:

# 'inplace' parameter for 'dropna' is False by default, thus rows were only dropped temporarily ufo.shape

In [116]:

# if 'all' values are missing in a row, then drop that row (none are dropped in this case) ufo.dropna(how='all').shape

In [117]:

# if 'any' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape

In [118]:

# if 'all' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape

In [119]:

# 'value_counts' does not include missing values by default ufo['Shape Reported'].value_counts().head()

Out[119]:

LIGHT 2803 DISK 2122 TRIANGLE 1889 OTHER 1402 CIRCLE 1365 Name: Shape Reported, dtype: int64

In [120]:

# explicitly include missing values ufo['Shape Reported'].value_counts(dropna=False).head()

Out[120]:

LIGHT 2803 NaN 2644 DISK 2122 TRIANGLE 1889 OTHER 1402 Name: Shape Reported, dtype: int64

Documentation for value_counts

In [121]:

# fill in missing values with a specified value ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

In [122]:

# confirm that the missing values were filled in ufo['Shape Reported'].value_counts().head()

Out[122]:

VARIOUS 2977 LIGHT 2803 DISK 2122 TRIANGLE 1889 OTHER 1402 Name: Shape Reported, dtype: int64

17. What do I need to know about the pandas index?

In [123]:

# read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('data/drinks.csv') drinks.head()

Out[123]:

In [124]:

# every DataFrame has an index (sometimes called the "row labels") drinks.index

Out[124]:

RangeIndex(start=0, stop=193, step=1)

In [125]:

# column names are also stored in a special "index" object drinks.columns

Out[125]:

Index([u'country', u'beer_servings', u'spirit_servings', u'wine_servings', u'total_litres_of_pure_alcohol', u'continent'], dtype='object')

In [126]:

# neither the index nor the columns are included in the shape drinks.shape

In [127]:

# index and columns both default to integers if you don't define them pd.read_table('data/imbd_1000.csv', header=None, sep='|').head()

Out[127]:

What is the index used for?

  1. identification
  2. selection
  3. alignment (covered in the next video)

In [128]:

# identification: index remains with each row when filtering the DataFrame drinks[drinks.continent=='South America']

Out[128]:

In [129]:

# selection: select a portion of the DataFrame using the index drinks.loc[23, 'beer_servings']

In [130]:

# set an existing column as the index drinks.set_index('country', inplace=True) drinks.head()

Out[130]:

Documentation for set_index

In [131]:

# 'country' is now the index drinks.index

Out[131]:

Index([u'Afghanistan', u'Albania', u'Algeria', u'Andorra', u'Angola', u'Antigua & Barbuda', u'Argentina', u'Armenia', u'Australia', u'Austria', ... u'Tanzania', u'USA', u'Uruguay', u'Uzbekistan', u'Vanuatu', u'Venezuela', u'Vietnam', u'Yemen', u'Zambia', u'Zimbabwe'], dtype='object', name=u'country', length=193)

In [132]:

# 'country' is no longer a column drinks.columns

Out[132]:

Index([u'beer_servings', u'spirit_servings', u'wine_servings', u'total_litres_of_pure_alcohol', u'continent'], dtype='object')

In [133]:

# 'country' data is no longer part of the DataFrame contents drinks.shape

In [134]:

# country name can now be used for selection drinks.loc['Brazil', 'beer_servings']

In [135]:

# index name is optional drinks.index.name = None drinks.head()

Out[135]:

In [136]:

# restore the index name, and move the index back to a column drinks.index.name = 'country' drinks.reset_index(inplace=True) drinks.head()

Out[136]:

Documentation for reset_index

In [137]:

# many DataFrame methods output a DataFrame drinks.describe()

Out[137]:

In [138]:

# you can interact with any DataFrame using its index and columns drinks.describe().loc['25%', 'beer_servings']

18. What do I need to know about the pandas index?

In [139]:

# read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('data/drinks.csv') drinks.head()

Out[139]:

In [140]:

# every DataFrame has an index drinks.index

Out[140]:

RangeIndex(start=0, stop=193, step=1)

In [141]:

# every Series also has an index (which carries over from the DataFrame) drinks.continent.head()

Out[141]:

0 Asia 1 Europe 2 Africa 3 Europe 4 Africa Name: continent, dtype: object

In [142]:

# set 'country' as the index drinks.set_index('country', inplace=True)

Documentation for set_index

In [143]:

# Series index is on the left, values are on the right drinks.continent.head()

Out[143]:

country Afghanistan Asia Albania Europe Algeria Africa Andorra Europe Angola Africa Name: continent, dtype: object

In [144]:

# another example of a Series (output from the 'value_counts' method) drinks.continent.value_counts()

Out[144]:

Africa 53 Europe 45 Asia 44 North America 23 Oceania 16 South America 12 Name: continent, dtype: int64

Documentation for value_counts

In [145]:

# access the Series index drinks.continent.value_counts().index

Out[145]:

Index([u'Africa', u'Europe', u'Asia', u'North America', u'Oceania', u'South America'], dtype='object')

In [146]:

# access the Series values drinks.continent.value_counts().values

Out[146]:

array([53, 45, 44, 23, 16, 12], dtype=int64)

In [147]:

# elements in a Series can be selected by index (using bracket notation) drinks.continent.value_counts()['Africa']

In [148]:

# any Series can be sorted by its values drinks.continent.value_counts().sort_values()

Out[148]:

South America 12 Oceania 16 North America 23 Asia 44 Europe 45 Africa 53 Name: continent, dtype: int64

In [149]:

# any Series can also be sorted by its index drinks.continent.value_counts().sort_index()

Out[149]:

Africa 53 Asia 44 Europe 45 North America 23 Oceania 16 South America 12 Name: continent, dtype: int64

Documentation for sort_values and sort_index

What is the index used for?

  1. identification (covered in the previous video)
  2. selection (covered in the previous video)
  3. alignment

In [150]:

# 'beer_servings' Series contains the average annual beer servings per person drinks.beer_servings.head()

Out[150]:

country Afghanistan 0 Albania 89 Algeria 25 Andorra 245 Angola 217 Name: beer_servings, dtype: int64

In [151]:

# create a Series containing the population of two countries people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population') people

Out[151]:

Albania 3000000 Andorra 85000 Name: population, dtype: int64

In [152]:

# calculate the total annual beer servings for each country (drinks.beer_servings * people).head()

Out[152]:

Afghanistan NaN Albania 267000000.0 Algeria NaN Andorra 20825000.0 Angola NaN dtype: float64

  • The two Series were aligned by their indexes.
  • If a value is missing in either Series, the result is marked as NaN.
  • Alignment enables us to easily work with incomplete data.

In [153]:

# concatenate the 'drinks' DataFrame with the 'population' Series (aligns by the index) pd.concat([drinks, people], axis=1).head()

Out[153]:

19. How do I select multiple rows and columns from a pandas DataFrame?

In [154]:

# read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('data/ufo.csv') ufo.head(3)

Out[154]:

The loc method is used to select rows and columns by label. You can pass it:

  • A single label
  • A list of labels
  • A slice of labels
  • A boolean Series
  • A colon (which indicates "all labels")

In [155]:

# row 0, all columns ufo.loc[0, :]

Out[155]:

City Ithaca Colors Reported NaN Shape Reported TRIANGLE State NY Time 6/1/1930 22:00 Name: 0, dtype: object

In [156]:

# rows 0 and 1 and 2, all columns ufo.loc[[0, 1, 2], :]

Out[156]:

In [157]:

# rows 0 through 2 (inclusive), all columns ufo.loc[0:2, :]

Out[157]:

In [158]:

# this implies "all columns", but explicitly stating "all columns" is better ufo.loc[0:2]

Out[158]:

In [159]:

# rows 0 through 2 (inclusive), column 'City' ufo.loc[0:2, 'City']

Out[159]:

0 Ithaca 1 Willingboro 2 Holyoke Name: City, dtype: object

In [160]:

# rows 0 through 2 (inclusive), columns 'City' and 'State' ufo.loc[0:2, ['City', 'State']]

Out[160]:

In [161]:

# accomplish the same thing using double brackets - but using 'loc' is preferred since it's more explicit ufo[['City', 'State']].head(3)

Out[161]:

In [162]:

# rows 0 through 2 (inclusive), columns 'City' through 'State' (inclusive) ufo.loc[0:2, 'City':'State']

Out[162]:

In [163]:

# accomplish the same thing using 'head' and 'drop' ufo.head(3).drop('Time', axis=1)

Out[163]:

In [164]:

# rows in which the 'City' is 'Oakland', column 'State' ufo.loc[ufo.City=='Oakland', 'State']

Out[164]:

1694 CA 2144 CA 4686 MD 7293 CA 8488 CA 8768 CA 10816 OR 10948 CA 11045 CA 12322 CA 12941 CA 16803 MD 17322 CA Name: State, dtype: object

In [165]:

# accomplish the same thing using "chained indexing" - but using 'loc' is preferred since chained indexing can cause problems ufo[ufo.City=='Oakland'].State

Out[165]:

1694 CA 2144 CA 4686 MD 7293 CA 8488 CA 8768 CA 10816 OR 10948 CA 11045 CA 12322 CA 12941 CA 16803 MD 17322 CA Name: State, dtype: object

The iloc method is used to select rows and columns by integer position. You can pass it:

  • A single integer position
  • A list of integer positions
  • A slice of integer positions
  • A colon (which indicates "all integer positions")

In [166]:

# rows in positions 0 and 1, columns in positions 0 and 3 ufo.iloc[[0, 1], [0, 3]]

Out[166]:

In [167]:

# rows in positions 0 through 2 (exclusive), columns in positions 0 through 4 (exclusive) ufo.iloc[0:2, 0:4]

Out[167]:

In [168]:

# rows in positions 0 through 2 (exclusive), all columns ufo.iloc[0:2, :]

Out[168]:

In [169]:

# accomplish the same thing - but using 'iloc' is preferred since it's more explicit ufo[0:2]

Out[169]:

The ix method is used to select rows and columns by label or integer position, and should only be used when you need to mix label-based and integer-based selection in the same call.

In [170]:

# read a dataset of alcohol consumption into a DataFrame and set 'country' as the index drinks = pd.read_csv('data/drinks.csv', index_col='country') drinks.head()

Out[170]:

In [171]:

# row with label 'Albania', column in position 0 drinks.ix['Albania', 0]

In [172]:

# row in position 1, column with label 'beer_servings' drinks.ix[1, 'beer_servings']

Rules for using numbers with ix:

  • If the index is strings, numbers are treated as integer positions, and thus slices are exclusive on the right.
  • If the index is integers, numbers are treated as labels, and thus slices are inclusive.

In [173]:

# rows 'Albania' through 'Andorra' (inclusive), columns in positions 0 through 2 (exclusive) drinks.ix['Albania':'Andorra', 0:2]

Out[173]:

In [174]:

# rows 0 through 2 (inclusive), columns in positions 0 through 2 (exclusive) ufo.ix[0:2, 0:2]

Out[174]:

20. When should I use the "inplace" parameter in pandas?

In [175]:

# read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('data/ufo.csv') ufo.head()

Out[175]:

In [177]:

# remove the 'City' column (doesn't affect the DataFrame since inplace=False) ufo.drop('City', axis=1).head()

Out[177]:

In [178]:

# confirm that the 'City' column was not actually removed ufo.head()

Out[178]:

In [179]:

# remove the 'City' column (does affect the DataFrame since inplace=True) ufo.drop('City', axis=1, inplace=True)

In [180]:

# confirm that the 'City' column was actually removed ufo.head()

Out[180]:

In [181]:

# drop a row if any value is missing from that row (doesn't affect the DataFrame since inplace=False) ufo.dropna(how='any').shape

In [182]:

# confirm that no rows were actually removed ufo.shape

In [183]:

# use an assignment statement instead of the 'inplace' parameter ufo = ufo.set_index('Time') ufo.tail()

Out[183]:

In [184]:

# fill missing values using "backward fill" strategy (doesn't affect the DataFrame since inplace=False) ufo.fillna(method='bfill').tail()

Out[184]:

In [185]:

# compare with "forward fill" strategy (doesn't affect the DataFrame since inplace=False) ufo.fillna(method='ffill').tail()

Out[185]:

21. How do I make my pandas DataFrame smaller and faster?

In [186]:

# read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('data/drinks.csv') drinks.head()

Out[186]:

In [187]:

# exact memory usage is unknown because object columns are references elsewhere drinks.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 193 entries, 0 to 192 Data columns (total 6 columns): country 193 non-null object beer_servings 193 non-null int64 spirit_servings 193 non-null int64 wine_servings 193 non-null int64 total_litres_of_pure_alcohol 193 non-null float64 continent 193 non-null object dtypes: float64(1), int64(3), object(2) memory usage: 9.1+ KB

In [188]:

# force pandas to calculate the true memory usage drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'> RangeIndex: 193 entries, 0 to 192 Data columns (total 6 columns): country 193 non-null object beer_servings 193 non-null int64 spirit_servings 193 non-null int64 wine_servings 193 non-null int64 total_litres_of_pure_alcohol 193 non-null float64 continent 193 non-null object dtypes: float64(1), int64(3), object(2) memory usage: 24.4 KB

In [189]:

# calculate the memory usage for each Series (in bytes) drinks.memory_usage(deep=True)

Out[189]:

Index 72 country 9500 beer_servings 1544 spirit_servings 1544 wine_servings 1544 total_litres_of_pure_alcohol 1544 continent 9244 dtype: int64

Documentation for info and memory_usage

In [190]:

# use the 'category' data type (new in pandas 0.15) to store the 'continent' strings as integers drinks['continent'] = drinks.continent.astype('category') drinks.dtypes

Out[190]:

country object beer_servings int64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 continent category dtype: object

In [191]:

# 'continent' Series appears to be unchanged drinks.continent.head()

Out[191]:

0 Asia 1 Europe 2 Africa 3 Europe 4 Africa Name: continent, dtype: category Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]

In [192]:

# strings are now encoded (0 means 'Africa', 1 means 'Asia', 2 means 'Europe', etc.) drinks.continent.cat.codes.head()

Out[192]:

0 1 1 2 2 0 3 2 4 0 dtype: int8

In [193]:

# memory usage has been drastically reduced drinks.memory_usage(deep=True)

Out[193]:

Index 72 country 9500 beer_servings 1544 spirit_servings 1544 wine_servings 1544 total_litres_of_pure_alcohol 1544 continent 488 dtype: int64

In [194]:

# repeat this process for the 'country' Series drinks['country'] = drinks.country.astype('category') drinks.memory_usage(deep=True)

Out[194]:

Index 72 country 9886 beer_servings 1544 spirit_servings 1544 wine_servings 1544 total_litres_of_pure_alcohol 1544 continent 488 dtype: int64

In [195]:

# memory usage increased because we created 193 categories drinks.country.cat.categories

Out[195]:

Index([u'Afghanistan', u'Albania', u'Algeria', u'Andorra', u'Angola', u'Antigua & Barbuda', u'Argentina', u'Armenia', u'Australia', u'Austria', ... u'United Arab Emirates', u'United Kingdom', u'Uruguay', u'Uzbekistan', u'Vanuatu', u'Venezuela', u'Vietnam', u'Yemen', u'Zambia', u'Zimbabwe'], dtype='object', length=193)

The category data type should only be used with a string Series that has a small number of possible values.

In [196]:

# create a small DataFrame from a dictionary df = pd.DataFrame({'ID':[100, 101, 102, 103], 'quality':['good', 'very good', 'good', 'excellent']}) df

Out[196]:

In [197]:

# sort the DataFrame by the 'quality' Series (alphabetical order) df.sort_values('quality')

Out[197]:

In [198]:

# define a logical ordering for the categories df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True) df.quality

Out[198]:

0 good 1 very good 2 good 3 excellent Name: quality, dtype: category Categories (3, object): [good < very good < excellent]

In [199]:

# sort the DataFrame by the 'quality' Series (logical order) df.sort_values('quality')

Out[199]:

In [200]:

# comparison operators work with ordered categories df.loc[df.quality > 'good', :]

Out[200]:

In [201]:

# read the training dataset from Kaggle's Titanic competition into a DataFrame train = pd.read_csv('http://bit.ly/kaggletrain') train.head()

Out[201]:

In [202]:

# create a feature matrix 'X' by selecting two DataFrame columns feature_cols = ['Pclass', 'Parch'] X = train.loc[:, feature_cols] X.shape

In [203]:

# create a response vector 'y' by selecting a Series y = train.Survived y.shape

Note: There is no need to convert these pandas objects to NumPy arrays. scikit-learn will understand these objects as long as they are entirely numeric and the proper shapes.

In [204]:

# fit a classification model to the training data from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() logreg.fit(X, y)

Out[204]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [205]:

# read the testing dataset from Kaggle's Titanic competition into a DataFrame test = pd.read_csv('http://bit.ly/kaggletest') test.head()

Out[205]:

In [206]:

# create a feature matrix from the testing data that matches the training data X_new = test.loc[:, feature_cols] X_new.shape

In [207]:

# use the fitted model to make predictions for the testing set observations new_pred_class = logreg.predict(X_new)

In [208]:

# create a DataFrame of passenger IDs and testing set predictions pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).head()

Out[208]:

Documentation for the DataFrame constructor

In [209]:

# ensure that PassengerID is the first column by setting it as the index pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').head()

Out[209]:

In [210]:

# write the DataFrame to a CSV file that can be submitted to Kaggle pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').to_csv('sub.csv')

In [211]:

# save a DataFrame to disk ("pickle it") train.to_pickle('train.pkl')

In [212]:

# read a pickled object from disk ("unpickle it") pd.read_pickle('train.pkl').head()

Out[212]:

23. More of your pandas questions answered!

Question: What is the difference between ufo.isnull() and pd.isnull(ufo)?

In [213]:

# read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('http://bit.ly/uforeports') ufo.head()

Out[213]:

In [214]:

# use 'isnull' as a top-level function pd.isnull(ufo).head()

Out[214]:

In [215]:

# equivalent: use 'isnull' as a DataFrame method ufo.isnull().head()

Out[215]:

Question: Why are DataFrame slices inclusive when using .loc, but exclusive when using .iloc?

In [216]:

# label-based slicing is inclusive of the start and stop ufo.loc[0:4, :]

Out[216]:

In [217]:

# position-based slicing is inclusive of the start and exclusive of the stop ufo.iloc[0:4, :]

Out[217]:

Documentation for loc and iloc

In [218]:

# 'iloc' is simply following NumPy's slicing convention... ufo.values[0:4, :]

Out[218]:

array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'], ['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'], ['Holyoke', nan, 'OVAL', 'CO', '2/15/1931 14:00'], ['Abilene', nan, 'DISK', 'KS', '6/1/1931 13:00']], dtype=object)

In [219]:

# ...and NumPy is simply following Python's slicing convention 'python'[0:4]

In [220]:

# 'loc' is inclusive of the stopping label because you don't necessarily know what label will come after it ufo.loc[0:4, 'City':'State']

Out[220]:

Question: How do I randomly sample rows from a DataFrame?

In [221]:

# sample 3 rows from the DataFrame without replacement (new in pandas 0.16.1) ufo.sample(n=3)

Out[221]:

In [222]:

# use the 'random_state' parameter for reproducibility ufo.sample(n=3, random_state=42)

Out[222]:

In [223]:

# sample 75% of the DataFrame's rows without replacement train = ufo.sample(frac=0.75, random_state=99)

In [224]:

# store the remaining 25% of the rows in another DataFrame test = ufo.loc[~ufo.index.isin(train.index), :]

Documentation for isin

[Back to top]

24. How do I create dummy variables in pandas?

In [225]:

# read the training dataset from Kaggle's Titanic competition train = pd.read_csv('http://bit.ly/kaggletrain') train.head()

Out[225]:

In [226]:

# create the 'Sex_male' dummy variable using the 'map' method train['Sex_male'] = train.Sex.map({'female':0, 'male':1}) train.head()

Out[226]:

In [227]:

# alternative: use 'get_dummies' to create one column for every possible value pd.get_dummies(train.Sex).head()

Out[227]:

Generally speaking:

  • If you have "K" possible values for a categorical feature, you only need "K-1" dummy variables to capture all of the information about that feature.
  • One convention is to drop the first dummy variable, which defines that level as the "baseline".

In [228]:

# drop the first dummy variable ('female') using the 'iloc' method pd.get_dummies(train.Sex).iloc[:, 1:].head()

Out[228]:

In [229]:

# add a prefix to identify the source of the dummy variables pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:].head()

Out[229]:

In [230]:

# use 'get_dummies' with a feature that has 3 possible values pd.get_dummies(train.Embarked, prefix='Embarked').head(10)

Out[230]:

In [231]:

# drop the first dummy variable ('C') pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:].head(10)

Out[231]:

How to translate these values back to the original 'Embarked' value:

  • 0, 0 means C
  • 1, 0 means Q
  • 0, 1 means S

In [232]:

# save the DataFrame of dummy variables and concatenate them to the original DataFrame embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:] train = pd.concat([train, embarked_dummies], axis=1) train.head()

Out[232]:

In [233]:

# reset the DataFrame train = pd.read_csv('http://bit.ly/kaggletrain') train.head()

Out[233]:

In [234]:

# pass the DataFrame to 'get_dummies' and specify which columns to dummy (it drops the original columns) pd.get_dummies(train, columns=['Sex', 'Embarked']).head()

Out[234]:

In [235]:

# use the 'drop_first' parameter (new in pandas 0.18) to drop the first dummy variable for each feature pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True).head()

Out[235]:

In [236]:

# read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('http://bit.ly/uforeports') ufo.head()

Out[236]:

In [237]:

# 'Time' is currently stored as a string ufo.dtypes

Out[237]:

City object Colors Reported object Shape Reported object State object Time object dtype: object

In [238]:

# hour could be accessed using string slicing, but this approach breaks too easily ufo.Time.str.slice(-5, -3).astype(int).head()

Out[238]:

0 22 1 20 2 14 3 13 4 19 Name: Time, dtype: int32

In [239]:

# convert 'Time' to datetime format ufo['Time'] = pd.to_datetime(ufo.Time) ufo.head()

Out[239]:

Out[240]:

City object Colors Reported object Shape Reported object State object Time datetime64[ns] dtype: object

Documentation for to_datetime

In [241]:

# convenient Series attributes are now available ufo.Time.dt.hour.head()

Out[241]:

0 22 1 20 2 14 3 13 4 19 Name: Time, dtype: int64

In [242]:

ufo.Time.dt.weekday_name.head()

Out[242]:

0 Sunday 1 Monday 2 Sunday 3 Monday 4 Tuesday Name: Time, dtype: object

In [243]:

ufo.Time.dt.dayofyear.head()

Out[243]:

0 152 1 181 2 46 3 152 4 108 Name: Time, dtype: int64

In [244]:

# convert a single string to datetime format (outputs a timestamp object) ts = pd.to_datetime('1/1/1999') ts

Out[244]:

Timestamp('1999-01-01 00:00:00')

In [245]:

# compare a datetime Series with a timestamp ufo.loc[ufo.Time >= ts, :].head()

Out[245]:

In [246]:

# perform mathematical operations with timestamps (outputs a timedelta object) ufo.Time.max() - ufo.Time.min()

Out[246]:

Timedelta('25781 days 01:59:00')

In [247]:

# timedelta objects also have attributes you can access (ufo.Time.max() - ufo.Time.min()).days

In [248]:

# allow plots to appear in the notebook %matplotlib inline

In [249]:

# count the number of UFO reports per year ufo['Year'] = ufo.Time.dt.year ufo.Year.value_counts().sort_index().head()

Out[249]:

1930 2 1931 2 1933 1 1934 1 1935 1 Name: Year, dtype: int64

In [250]:

# plot the number of UFO reports per year (line plot is the default) ufo.Year.value_counts().sort_index().plot()

Out[250]:

<matplotlib.axes._subplots.AxesSubplot at 0xd571278>

26. How do I find and remove duplicate rows in pandas?

In [251]:

# read a dataset of movie reviewers into a DataFrame user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code'] users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols, index_col='user_id') users.head()

Out[251]:

In [253]:

# detect duplicate zip codes: True if an item is identical to a previous item users.zip_code.duplicated().tail()

Out[253]:

user_id 939 False 940 True 941 False 942 False 943 False Name: zip_code, dtype: bool

In [254]:

# count the duplicate items (True becomes 1, False becomes 0) users.zip_code.duplicated().sum()

In [255]:

# detect duplicate DataFrame rows: True if an entire row is identical to a previous row users.duplicated().tail()

Out[255]:

user_id 939 False 940 False 941 False 942 False 943 False dtype: bool

In [256]:

# count the duplicate rows users.duplicated().sum()

Logic for duplicated:

  • keep='first' (default): Mark duplicates as True except for the first occurrence.
  • keep='last': Mark duplicates as True except for the last occurrence.
  • keep=False: Mark all duplicates as True.

In [257]:

# examine the duplicate rows (ignoring the first occurrence) users.loc[users.duplicated(keep='first'), :]

Out[257]:

In [258]:

# examine the duplicate rows (ignoring the last occurrence) users.loc[users.duplicated(keep='last'), :]

Out[258]:

In [259]:

# examine the duplicate rows (including all duplicates) users.loc[users.duplicated(keep=False), :]

Out[259]:

In [260]:

# drop the duplicate rows (inplace=False by default) users.drop_duplicates(keep='first').shape

In [261]:

users.drop_duplicates(keep='last').shape

In [262]:

users.drop_duplicates(keep=False).shape

Documentation for drop_duplicates

In [263]:

# only consider a subset of columns when identifying duplicates users.duplicated(subset=['age', 'zip_code']).sum()

In [264]:

users.drop_duplicates(subset=['age', 'zip_code']).shape

27. How do I avoid a SettingWithCopyWarning in pandas?

In [265]:

# read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('http://bit.ly/imdbratings') movies.head()

Out[265]:

In [266]:

# count the missing values in the 'content_rating' Series movies.content_rating.isnull().sum()

In [267]:

# examine the DataFrame rows that contain those missing values movies[movies.content_rating.isnull()]

Out[267]:

In [268]:

# examine the unique values in the 'content_rating' Series movies.content_rating.value_counts()

Out[268]:

R 460 PG-13 189 PG 123 NOT RATED 65 APPROVED 47 UNRATED 38 G 32 PASSED 7 NC-17 7 X 4 GP 3 TV-MA 1 Name: content_rating, dtype: int64

Goal: Mark the 'NOT RATED' values as missing values, represented by 'NaN'.

In [269]:

# first, locate the relevant rows movies[movies.content_rating=='NOT RATED'].head()

Out[269]:

In [270]:

# then, select the 'content_rating' Series from those rows movies[movies.content_rating=='NOT RATED'].content_rating.head()

Out[270]:

5 NOT RATED 6 NOT RATED 41 NOT RATED 63 NOT RATED 66 NOT RATED Name: content_rating, dtype: object

In [271]:

# finally, replace the 'NOT RATED' values with 'NaN' (imported from NumPy) import numpy as np movies[movies.content_rating=='NOT RATED'].content_rating = np.nan

c:\Users\Kevin\Anaconda\lib\site-packages\pandas\core\generic.py:2701: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self[name] = value

Problem: That statement involves two operations, a __getitem__ and a __setitem__. pandas can't guarantee whether the __getitem__ operation returns a view or a copy of the data.

  • If __getitem__ returns a view of the data, __setitem__ will affect the 'movies' DataFrame.
  • But if __getitem__ returns a copy of the data, __setitem__ will not affect the 'movies' DataFrame.

In [272]:

# the 'content_rating' Series has not changed movies.content_rating.isnull().sum()

Solution: Use the loc method, which replaces the 'NOT RATED' values in a single __setitem__ operation.

In [273]:

# replace the 'NOT RATED' values with 'NaN' (does not cause a SettingWithCopyWarning) movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan

In [274]:

# this time, the 'content_rating' Series has changed movies.content_rating.isnull().sum()

Summary: Use the loc method any time you are selecting rows and columns in the same statement.

More information: Modern Pandas (Part 1)

In [275]:

# create a DataFrame only containing movies with a high 'star_rating' top_movies = movies.loc[movies.star_rating >= 9, :] top_movies

Out[275]:

Goal: Fix the 'duration' for 'The Shawshank Redemption'.

In [276]:

# overwrite the relevant cell with the correct duration top_movies.loc[0, 'duration'] = 150

c:\Users\Kevin\Anaconda\lib\site-packages\pandas\core\indexing.py:465: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[item] = s

Problem: pandas isn't sure whether 'top_movies' is a view or a copy of 'movies'.

In [277]:

# 'top_movies' DataFrame has been updated top_movies

Out[277]:

In [278]:

# 'movies' DataFrame has not been updated movies.head(1)

Out[278]:

Solution: Any time you are attempting to create a DataFrame copy, use the copy method.

In [279]:

# explicitly create a copy of 'movies' top_movies = movies.loc[movies.star_rating >= 9, :].copy()

In [280]:

# pandas now knows that you are updating a copy instead of a view (does not cause a SettingWithCopyWarning) top_movies.loc[0, 'duration'] = 150

In [281]:

# 'top_movies' DataFrame has been updated top_movies

Out[281]: