- Sun 01 May 2016
- Data Science
- M Hendra Herviawan
- #Data Wrangling, #Python, #Pandas
In [1]:
# import modules
import pandas as pd
In [2]:
# Create dataframe
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
Out[2]:
In [3]:
# Create a groupby variable that groups preTestScores by regiment
groupby_regiment = df['preTestScore'].groupby(df['regiment'])
groupby_regiment
Out[3]:
"This grouped variable is now a GroupBy object. It has not actually computed anything yet except for some intermediate data about the group key df['key1']. The idea is that this object has all of the information needed to then apply some operation to each of the groups." - Python for Data Analysis
In [5]:
# Display the mean value of the each regiment's pre-test score
groupby_regiment.mean()
Out[5]:
1. Descriptive statistics by group¶
In [5]:
df['preTestScore'].groupby(df['regiment']).describe()
Out[5]:
1.1 Mean of each regiment's preTestScore¶
In [6]:
groupby_regiment.mean()
Out[6]:
1.2 Mean preTestScores grouped by regiment and company¶
In [7]:
df['preTestScore'].groupby([df['regiment'], df['company']]).mean()
Out[7]:
1.3 Mean preTestScores grouped by regiment and company without heirarchical indexing¶
In [8]:
df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
Out[8]:
1.4 Group the entire dataframe by regiment and company¶
In [9]:
df.groupby(['regiment', 'company']).mean()
Out[9]:
1.5 Number of observations in each regiment and company¶
In [10]:
df.groupby(['regiment', 'company']).size()
Out[10]:
1.6 prefix¶
In [13]:
df.groupby('regiment').mean().add_prefix('mean_')
Out[13]:
2. Iterate an operations over groups¶
In [11]:
# Group the dataframe by regiment, and for each regiment,
for name, group in df.groupby('regiment'):
# print the name of the regiment
print(name)
# print the data of that regiment
print(group)
2.1 View a grouping¶
Use list() to show what a grouping looks like
In [4]:
list(df['preTestScore'].groupby(df['regiment']))
Out[4]:
2.2 Group by columns¶
Specifically in this case: group by the data types of the columns (i.e. axis=1) and then use list() to view what that grouping looks like
In [12]:
list(df.groupby(df.dtypes, axis=1))
Out[12]:
2.6 Apply the get_stats() function to each postTestScore bin¶
In [14]:
# Create a function to get the stats of a group
def get_stats(group):
return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}
In [4]:
# Create bins and bin up postTestScore by those pins
bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']
df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)
In [16]:
df['postTestScore'].groupby(df['categories']).apply(get_stats).unstack()
Out[16]: