02 Pandas

Pandas I + II

Data Structures

Data Structures in picture

lifecycle lifecycle

  • Data Frame: 2D tabular data (collection of series)
  • Series: 1D data (columnar data)
  • Index: A sequence of row labels
    • can be non-numeric, can have a name (nonunique)
    • column names almost always unique
    • lifecycle

Indexing with the [ ] Operator

  • Our dataframe: elections
 CandidateParty%YearResult
0ObamaDemocratic52.92008win
1McCainRepublican45.72008loss
2ObamaDemocratic51.12012win
3RomneyRepublican47.22012loss
4ClintonDemocratic48.22016loss
5TrumpRepublican46.12016win
  • indexing by column names
    column name argument elections['Candidate'].head(6): series
      0 Obama
      1 McCain
      2 Obama
      3 Romney
      4 Clinton
      5 Trump
      Name: Candidate, dtype: object
    
    list argument (even one argument) elections[["Candidate", "Party"]].head(6) : DF
    #CandidateParty
    0ObamaDemocratic
    1McCainRepublican
    2ObamaDemocratic
    3RomneyRepublican
    4ClintonDemocratic
    5TrumpRepublican
  • indexing by row slices
    use row numberselections[0:3] : DF
     CandidateParty%yearResult
    0ObamaDemocratic52.92008win
    1McCainRepublican45.72008loss
    2ObamaDemocratic51.12012win
    • [참고]:
      • elections[0] \(\rightarrow\) error (requires slicing!)
      • elections[0:] \(\rightarrow\) all dataframe
      • elections[0:1] \(\rightarrow\) 1 line of df
  • Example Question:

    • weird = pd.DataFrame({1:['topdog', 'botdog'], "1":['topcat', 'botcat']})
    • wierd:

       11
      0topdogtopcat
      1botdogbotcat
    weird[1]
      0 topdog
      1 botdog
      Name: 1, dtype: object
    
    weird["1"]
      0 topcat
      1 botcat
      Name: 1, dtype: object
    
    weird[1:]
     11
    1botdogbotcat

Boolean Array Selection and Querying

  • Boolean Array

    elections[[True, False, True, False, False, False]] (`True` on index 0 and 2)
     CandidateParty%yearResult
    0ObamaDemocratic52.92008win
    2ObamaDemocratic51.12012win
  • Boolean Array using Logical Operators

    elections[elections['Party']=='Democratic']
     CandidateParty%yearResult
    0ObamaDemocratic52.92008win
    2ObamaDemocratic51.12012win
    4ClintonDemocratic48.22016loss
    • [참고] elections['Party']=='Democratic' :
    0     True
    1    False
    2     True
    3    False
    4     True
    5    False
    Name: Party, dtype: bool // 대충 1D array of True and False
    
  • Boolean Series combined by & Operator

    elections[(elections['Result']=='win') & (elections['%']<50)]
     CandidateParty%yearResult
    5TrumpRepulican46.12016win
    • [참고] (elections['Result']=='win'), (elections['%']<50) :
    0    True                   0    False
    1    False                  1    False
    2    True                   2    False
    3    False                  3     True
    4    False                  4     True
    5    True                   5     True
    Name: Party, dtype: bool    Name: Party, dtype: bool
    

_isin function

  • makes it more convenient & neat to find rows that match one of many
elections[ (elections['Party']=='Democratic') | (elections['Party'] =='Republican') ]
# == is equal to
elections[elections['Party'].isin(['Republican', 'Democratic'])] # better

Query Command

elections.query( "Result=='win' and year < 2010" )
 CandidateParty%yearResult
0ObamaDemocratic52.92008win

Indexing with .loc and .iloc

  • loc and iloc: alternatives to index into DF

    loc : location based indexing iloc: integer-based indexing

_loc

  1. access values by labels
  2. access values using boolean array
  • Most basic: provide list of row and col labels \(\Rightarrow\) DF

    elections.loc[ [0,1,2,3,4], ['Candidate', 'Party'] ]
     CandidateParty
    0ObamaDemocratic
    1McCainRepublican
    2ObamaDemocratic
    3RomneyRepublican
    4ClintonDemocratic
  • when row label is not a number (index)
    elections.loc[ [ 1980, 1984 ], ['Candidate', 'Party'] ]

    lifecycle

  • loc with slices

    • BTW: slicing works with all label types
    • Slicing with loc are inclusive, not exclusive
    elections.loc[ 0:4 , 'Candidate':'Party' ]
     CandidateParty
    0ObamaDemocratic
    1McCainRepublican
    2ObamaDemocratic
    3RomneyRepublican
    4ClintonDemocratic
    • elections.loc[ 1980:1984 , 'Candidate':'Party' ] would also work.
  • single label as column argument \(\Rightarrow\) Series
    elections.loc[ 0:4, 'Candidate' ]
      0      Obama
      1     McCain
      2      Obama
      3     Romney
      4    Clinton
      Name: Candidate, dtype: object
    
  • single column argument in a list \(\Rightarrow\) DataFrame
    elections.loc[ 0:4, ['Candidate'] ]
     Candidate
    0Obama
    1McCain
    2Obama
    3Romney
    4Clinton
  • single label as row argument \(\Rightarrow\) Series
    elections.loc[ 0, 'Candidate':'Party' ]
    Candidate         Obama
    Party Democratic
    Name: 0, dtype: object
    
    </div></details>
    
  • single row argument in a list \(\Rightarrow\) DataFrame
    elections.loc[ 0:4, ['Candidate'] ]
     CandidateParty
    0ObamaDemocratic
  • loc supports Boolean Arrays
    elections.loc[(elections['Result']=='win') & (elections['%']<50), 'Candidate':'%' ]
     CandidateParty%
    5TrumpRepublican46.1
    • [참고] (elections['Result']=='win'), (elections['%']<50) :
    0    False
    1    False
    2    False
    3    False
    4    False
    5     True
    dtype: bool
    

_iloc

  • doesn’t think about labels at all
  • returns items that appear in the numberical positions specified

    • even tho the row name is a string (not index) iloc
  • advantages of loc:
    1. harder to make mistakes
    2. easier to read
    3. not vulnerable to changes to the ordering of rows/cols in raw data files
      • still, iloc can be more convenient \(\Rightarrow\) use iloc judiciously

Handy Properties and Utility Functions

_sample

  • DF of random selection of rows
  • default : without replacement (but replace=True allowed)
  • can be chained with selection operators [], loc, iloc

iloc

_numpy opeartions

  • np.mean( winners )
  • max( winners )

descriptions

  • head: displays only the top few rows (5 by default) (df.head(4))
  • size: gives total number of data points (df.size : 30 )
  • shape: gives size of data in r x c (df.shape : (6, 5))
  • describe : provide summary of data
  • index : Returns the index (aka row labels)
    • RangeIndex(start=0, stop=6, step=1)
  • columns : returns labels for the columns
    • Index(['Candidate', 'Party', '%', 'year', 'Result'], dtype='object')

sort

  • sort_values : sort dataframe according to a specific column
    elections.sort_values('%', ascending=False)
     CandidateParty % YearResult
    0ObamaDemocratic 52.9 2008win
    2ObamaDemocratic 51.1 2012win
    4ClintonDemocratic48.2 2016loss
    3RomneyRepublican47.2 2012loss
    5TrumpRepublican46.1 2016win
    1McCainRepublican45.7 2008loss
  • sort_values on series
    elections['Candidate'].sort_values().head(5)
    4    Clinton
    1     McCain
    0      Obama
    2      Obama
    3     Romney
    Name: Candidate, dtype: object
    

values

  • value_counts: creates new Series showing counts of every value
    elections['Party'].value_counts()
    Democratic    3
    Republican    3
    Name: Party, dtype: int64
    
    elections.value_counts()
    Candidate  Party       %     year  Result
    Clinton    Democratic  48.2  2016  loss      1
    McCain     Republican  45.7  2008  loss      1
    Obama      Democratic  51.1  2012  win       1
                           52.9  2008  win       1
    Romney     Republican  47.2  2012  loss      1
    Trump      Republican  46.1  2016  win       1
    dtype: int64
    

_unique

  • must be done at series; returns numpy array
    elections['Party'].unique()
    array(['Democratic', 'Republican'], dtype=object)
    
    • elections.unique() \(\Rightarrow\) AttributeError

_str

  • manipulating string data
  • Scenario: Find all rows where name ends with n

  • Approach #1: Use list comprehnesions

    • create a list of booleans where i-th entry == True if i-th name startswith J

      elections.loc[0, 'Candidate'].endswith('n')
      > False
      
      ends_with_n = [x.endswith('n') for x in elections['Candidate']]
      ends_with_n
      > [False, True, False, False, True, False]
      
      result = elections[[x.endswith('n') for x in elections['Candidate']]]
      
      • result:

         CandidateParty%YearResult
        1McCainRepublican45.72008loss
        4ClintonDemocratic48.22016loss
    • pass this list to [] or loc[]

  • Approach #2: use str from Series class

    result = elections[elections['Candidate'].str.endswith('n')]
    
    • \(\Rightarrow\) more readable & efficient
    • approach #1 is not idiomatic (관용적)
  • str.function_name works on 1-D array (series) and returns series of boolean
    • if you want to change 1D series to Dataframe, use .to_frame()
  • str.startswith('')
  • str.contains('')
  • str.split('o')
    0       [Obama]
    1      [McCain]
    2       [Obama]
    3     [R, mney]
    4    [Clint, n]
    5       [Trump]
    Name: Candidate, dtype: object
    

Adding, Modifying, and Removing Columns

  • Scenario: Sort names by length.
  • sort_values does not provide custom comparison

  • Approach #1: Create temp column and sort

    # create new series of length
    candidate_names = elections['Candidate'].str.len()
    # add that series to dataframe as a column
    elections['name_length'] = candidate_names
    
    elections.sort_values('name_length')
     CandidateParty%YearResultname_length
    0ObamaDemocratic52.92008win5
    2ObamaDemocratic51.12012win5
    5TrumpRepublican46.12016win5
    1McCainRepublican45.72008loss6
    3RomneyRepublican47.22012loss6
    4ClintonDemocratic48.22016loss7
    • since we just made a new column (modified original elections df) drop that temp column
      # axis refers to dropping column (df drops row by default)
      elections = elections.drop('name_length', axis = 1)
      
    • sub-scenario: sort by # of occurences of \(a\) and \(m\)

      • sort by Arbitrary Functions : Series.map
      def count_a_m(name):
        return name.count('a')+name.count('m')
      
      elections['countA_M'] = elections['Candidate'].map(count_a_m)
      
      elections.sort_values(by='countA_M', ascending=False)
       CandidateParty%YearResultcountA_M
      0ObamaDemocratic52.92008win3
      2ObamaDemocratic51.12012win3
      1McCainRepublican45.72008loss1
      3RomneyRepublican47.22012loss1
      5TrumpRepublican46.12016win1
      4ClintonDemocratic48.22016loss0
  • Approach #2: Create sorted index + use loc
    1. Create Series of only lengths of names
      name_length = elections['Candidate'].str.len()
      
    2. Sort series of only name lengths
      name_length = name_length.sort_values()
      
    3. Pass sorted index as argument of .loc to original DF
      indexOf_name_length = name_length.index
      indexOf_name_length
      > Int64Index([0, 2, 5, 1, 3, 4], dtype='int64')
      elections.loc[indexOf_name_length]
      

_groupby .agg

  • example data: baby names inlcuding states

  • too many lines, so babies = babies[babies['Year']>2000] (only the names after 2001)
  • babies.head()

     IdNameYearGenderStateCount
    1073710737Madison2001FAK54
    1073810738Emily2001FAK47
    1073910739Hannah2001FAK46
    1074010740Ashley2001FAK40
    1074110741Abigail2001FAK39
  • Scenario: Find the names that have changed the most in popularity

    • keep it simple and use absolute max/min difference (max(count)-min(count))
    • ex) $$Jennifer$’s abs max/min diff = 2026 - 5 = 2021
    def ammd(series):
      return max(series)-min(series)
    
    jennifer_counts = babies.query("Name == 'Jennifer'")["Count"]
    > 10893      6
      11063      7
      11450      8
      11873      8
      12121      7
                ..
      5585891    8
      5587162    6
      5632370    8
      5632465    9
      5632632    6
      Name: Count, Length: 647, dtype: int64
    
    ammd(jennifer_counts)
    > 2021
    
  • Approach #1: Getting AMMD for every name

    • use python knowledge
    ammd_of_babyname_counts = {}
    for name in sorted(babies['Name'].unique()):
      counts_of_current_name = babies[babies['Name']==name]['Count']
      ammd_of_babyname_counts[name] = ammd(counts_of_current_name)
    
    ammd_of_babyname_counts = pd.Series(ammd_of_babyname_counts)
    
    • extremely slow and complicated
  • Approach #2: Using groupby.agg

    babies.groupby('Name').agg(ammd)
    
    • simpler, faster, more versatile

      ammd_of_babyname_counts.head()
       IdNameYear
      Name   
      Aaban206910
      Aadan436937362
      Aadarsh000
      Aaden54706539153
      Aadhav000
      Zyra437971363
      Zyrah438016321
      Zyren000
      Zyria3680790132
      Zyriah369887385