[파이썬 머신러닝 완벽 가이드] 1장-3.판다스

2023-04-17 9 분 소요

3. 판다스

1) 파일을 DataFrame 객체로 로딩, 기본 API

csv 파일 읽어서 DataFrame 객체로 변환

import pandas as pd

titanic_df = pd.read_csv('/content/gdrive/MyDrive/titanic_train.csv')

print(type(titanic_df))     # <class 'pandas.core.frame.DataFrame'>
print('DataFrame 크기: ', titanic_df.shape)     # (891, 12)

titanic_df.head(3)

titanic_df.info()
- Non_Null Count : 몇 개의 데이터가 null이 아닌지
- Dtype : 컬럼별 데이터 타입

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890          # 전체 row 수
Data columns (total 12 columns):           # 전체 column 수
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64     # 전체 891개의 데이터중 714개만 not null
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)       # 전체 12개의 컬럼들의 타입을 요약한 것
memory usage: 83.7+ KB

titanic_df.describe()

titanic_df['Pclass'].value_counts()

Pclass 컬럼에 해당하는 값들 중 유니크한 값을 센다.

  value_counts = titanic_df['Pclass'].value_counts()    ## value_counts() : 유니크한 값을 센다
  print(value_counts)
  -----------------------------------------------------------
  3    491         # Pclass = 3인 값의 개수 : 491개
  1    216
  2    184
  Name: Pclass, dtype: int64

DataFrame 객체의 열 하나는 Series 타입으로 인식한다.

  titanic_pclass = titanic_df['Pclass']
  print(type(titanic_pclass)) # 열 하나는 Series로 인식한다
  ---------------------------------------------------
  <class 'pandas.core.series.Series'>

2) DataFrame과 리스트, 딕셔너리, 넘파이 ndarray 상호 변환

넘파이 ndarray, 리스트 → DataFrame으로 변환하기

import numpy as np

col_name1=['col1']

list1 = [1, 2, 3]
array1 = np.array(list1)

df_list1 = pd.DataFrame(list1, columns=col_name1)
print('1차원 리스트로 만든 DataFrame:\n', df_list1)

df_array1 = pd.DataFrame(array1, columns=col_name1)
print('1차원 ndarray로 만든 DataFrame:\n', df_array1)

------------------------------------
		col1
0     1
1     2
2     3

# 3개의 컬럼명이 필요함. 
col_name2=['col1', 'col2', 'col3']

# 2행x3열 형태의 리스트와 ndarray 생성 한 뒤 이를 DataFrame으로 변환. 
list2 = [[1, 2, 3],
         [11, 12, 13]]
array2 = np.array(list2)

df_list2 = pd.DataFrame(list2, columns=col_name2)
print('2차원 리스트로 만든 DataFrame:\n', df_list2)

df_array2 = pd.DataFrame(array2, columns=col_name2)
print('2차원 ndarray로 만든 DataFrame:\n', df_array2)
--------------------------------------------------------------

		col1  col2  col3
0     1     2     3
1    11    12    13

Dictionary → DataFrame으로 변환하기

# Key는 컬럼명으로 매핑, Value는 리스트 형(또는 ndarray)
dict = {'col1':[1, 11], 'col2':[2, 22], 'col3':[3, 33]}
df_dict = pd.DataFrame(dict)
print('딕셔너리로 만든 DataFrame:\n', df_dict)
----------------------------------------------------
col1  col2  col3
0     1     2     3
1    11    22    33

DataFrame → ndarray, list, dictionary로 변환

# DataFrame을 ndarray로 변환
array3 = df_dict**.values**
print(type(array3))            # <class 'numpy.ndarray'>
print(array3)
---------------------------------------------------------------
[[ 1  2  3]
 [11 22 33]]

# DataFrame을 리스트로 변환
list3 = df_dict**.values.tolist()**
print(type(list3))                # <class 'list'>
print(list3)
---------------------------------------------------------------
[[1, 2, 3], [11, 22, 33]]

# DataFrame을 딕셔너리로 변환
dict3 = df_dict.to_dict('list')        # 딕셔너리의 값이 list 형으로 반환됨
print(type(dict3))                     # <class 'dict'>
print(dict3)
----------------------------------------------------------------
{'col1': [1, 11], 'col2': [2, 22], 'col3': [3, 33]}

3) DataFrame의 컬럼 데이터 세트 생성과 수정

# 새로운 컬럼 Age_0 추가하고 그 값은 모두 0으로 지정
titanic_df['Age_0']=0
titanic_df.head(3)

# 기존 컬럼에 0 세팅
# titanic_df['Age']=0
# titanic_df.head(3)

titanic_df['Age_by_10'] = titanic_df['Age']*10   
# Age_by_10이라는 컬럼 생성하고, Age 컬럼 값에 10을 곱한 값을 저장
titanic_df['Family_No'] = titanic_df['SibSp'] + titanic_df['Parch']+1
titanic_df.head(3)

titanic_df['Age_by_10'] = titanic_df['Age_by_10'] + 100
titanic_df.head(3)

4) DataFrame 데이터 삭제

데이터 프레임에서 데이터의 삭제는 drop() 메서드를 이용한다.
drop 메서드의 원형은 다음과 같다.
- DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=’raise’)
- axis
  - axis = 0 : row 축 방향으로 drop → labels에 오는 값은 인덱스가 됨
  - axis = 1 : column 축 방향으로 drop → labels에 오는 값은 칼럼이 됨
- inplace
  - inplace=True : 원본 DataFrame이 수정됨
  - inplace=False (default) : 원본은 그대로, 반환된 DataFrame은 drop된 dataFrame이 반환됨

titanic_drop_df = titanic_df**.drop**('Age_0', axis=1 )  
# 연산하고 결과를 titanic_drop_df에 담은 것. 원래 titanic_df에는 Age_0 여전히 존재

titanic_drop_df.head(3)

titanic_df  # Age_0가 여전히 존재함을 확인할 수 있다.

한번에 여러 컬럼 제거

drop_result = titanic_df.drop(['Age_0', 'Age_by_10', 'Family_No'], axis=1, inplace=True)      #titanic_df에서 바로 컬럼 제거 옵션. 결과 반환 X
print(' inplace=True 로 drop 후 반환된 값:',drop_result)   # None
titanic_df.head(3)

행을 drop

titanic_df.drop([0,1,2], axis=0, inplace=True)
titanic_df.head(3)

5) Index 객체

index 객체 추출

indexes = titanic_df**.index**
print(indexes)  # Index 객체는 RDBMS의 Primary Key 처럼 레코드를 고유하게 식별하는 객체임.
----------------------------------------
RangeIndex(start=0, stop=891, step=1)

Index 객체를 실제 값 array로 변환

print(**indexes.values**)
-----------------------------------------
[  0   1   2   3   4   5    ...... 888 889 890]

ndarray와 유사하게 단일 값 반환 및 슬라이싱 가능

print(type(indexes.values))     # <class 'numpy.ndarray'>
print(indexes.values.shape)     # (891,)
print(indexes.values[:5])       # [0 1 2 3 4]

print(indexes[:5].values)       # [0 1 2 3 4]
print(indexes[6])               # 6

한 번 만들어진 DataFrame 및 Series의 Index 객체는 함부로 변경할 수 없다.

indexes[0] = 5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-59-2fe1c3d18d1a> in <module>
----> 1 indexes[0] = 5

/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   5033     @final
   5034     def __setitem__(self, key, value):
-> 5035         raise TypeError("Index does not support mutable operations")
   5036 
   5037     def __getitem__(self, key):

TypeError: Index does not support mutable operations

Series 객체는 Index 객체를 포함하지만 Series 객체에 연산 함수를 적용할 때 Index는 연산에서 제외 됨. Index는 오직 식별용으로만 사용됨

series_fair = titanic_df['Fare']

print('Fair Series max 값:', series_fair.max())
print('Fair Series sum 값:', series_fair.sum())
print('sum() Fair Series:', sum(series_fair))
print()
print('Fair Series + 3:\n',(series_fair + 3).head(3) )

-------------------------------------------------------
Fair Series max 값: 512.3292
Fair Series sum 값: 28693.9493
sum() Fair Series: 28693.949299999967

Fair Series + 3:
 0    10.2500
1    74.2833
2    10.9250
Name: Fare, dtype: float64

reset_index
- 새롭게 인덱스를 연속 숫자 형으로 할당하며, 기존 인덱스는 index라는 새로운 칼럼 명으로 추가됨
```
  titanic_reset_df = titanic_df.reset_index(inplace=False)
  titanic_reset_df.head()
```

인덱스가 연속된 int 숫자형 데이터가 아닐 경우에 다시 이를 연속 int 숫자형 데이터로 만들때 주로 사용

예 : ‘Pclass’ 칼럼 Series의 value_counts()를 수행하면 ‘Pclass’ 고유 값이 식별자 인덱스 역할을 했다. 이보다는 연속 숫자형 인덱스가 고유 식별자로 더 적합해 보임. → reset_index를 이용해 인덱스를 다시 만들면 된다.
단, Series에 reset_index()를 적용하면 Series가 아닌 DataFrame이 반환되니 유의해야 한다. 기존 인덱스가 칼럼으로 추가돼 칼럼이 2개가 되므로 Series가 아닌 DataFrame이 반환된다.
reset_index 이전

  value_counts = titanic_df['Pclass'].value_counts()
  print(value_counts) 
  ------------------
  3    491
  1    216
  2    184
  Name: Pclass, dtype: int64
    
  print('value_counts 객체 변수 타입:',type(value_counts))  # <class 'pandas.core.series.Series'>

reset_index 이후

  new_value_counts = value_counts.reset_index(inplace=False)   ####e, drop=True
  print(new_value_counts)
  -------------------------
  index  Pclass
  0      3     491
  1      1     216
  2      2     184
    
  print('new_value_counts 객체 변수 타입:',type(new_value_counts)) # <class 'pandas.core.frame.DataFrame'>

6) 데이터 셀렉션 및 필터링

넘파이의 경우 [] 연산자 내 단일 값 추출, 슬라이싱, 팬시 인덱싱, 불린 인덱싱을 통해 데이터를 추출함
판다스의 경우 ix[], iloc[], loc[] 연산자를 통해 동일한 작업을 수행한다.
판다스 DataFrame 뒤에 오는 [] 안에는 칼럼 명 문자 or 인덱스로 변환 가능한 표현식이 들어 갈 수 있다.

# 단일 컬럼의 데이터 추출
titanic_df['Pclass'].head(3)    

# 여러 컬럼들의 데이터 추출
titanic_df[['Survived', 'Pclass']].head(3)    

titanic_df[0] # 오류 발생 -> 컬럼명만 올 수 있음

앞에서 DataFrame의 [] 내에 숫자 값을 입력할 경우 오류가 발생한다고 했는데, 판다스의 인덱스 형태로 변환 가능한 표현식은 [] 내에 입력할 수 있다.
- 가령 titanic_df의 처음 2개 데이터를 추출하고자 titanic_df[0:2]와 같은 슬라이싱을 이용할 수 있다. → 별로 좋은 방법은 아니다.
```
  titanic_df[0:2]
```
불린 인덱싱 표현도 가능

titanic_df[ **titanic_df['Pclass'] == 3**].head(3)

iloc[] 연산자 : 위치기반 인덱싱 → 행과 열 값으로 integer 또는 integer 형의 스랄이싱, 팬시 리스트 값을 입력해줘야 한다 .

  data = {'Name': ['Chulmin', 'Eunkyung','Jinwoong','Soobeom'],
          'Year': [2011, 2016, 2015, 2015],
          'Gender': ['Male', 'Female', 'Male', 'Male']
         }
  data_df = pd.DataFrame(data, index=['one','two','three','four'])

  data_df.iloc[0, 0]     # 'Chulmin'

loc[] 연산자 : 명칭 기반 인덱싱 → 행 위치에는 Dataframe index 값을, 열 위치에는 컬럼명을 입력한다.
- 아래는 인덱스 값이 ‘one’인 행의 칼럼명이 ‘name’인 데이터를 추출하는 코드이다.
```
  data_df.loc['one', 'Name']      # 'Chulmin'
```

iloc과 loc 비교

  # 위치기반 iloc slicing
  data_df.iloc[0:1, 0]
  -----------------------
  one    Chulmin
  Name: Name, dtype: object
    
  # 명칭 기반 loc slicing 
  data_df.loc['one':'two', 'Name']    #명칭기반의 경우 명칭은 숫자형이 아닐 수 있으니 -1 연산이 안됨.
  ---------------------------------
  one     Chulmin
  two    Eunkyung
  Name: Name, dtype: object

참고) reset_index() 로 새로운 숫자형 인덱스를 생성 → loc 슬라이싱

  # data_df 를 reset_index() 로 새로운 숫자형 인덱스를 생성
  data_df_reset = data_df.reset_index()
  data_df_reset = data_df_reset.rename(columns={'index':'old_index'})
    
  # index 값에 1을 더해서 1부터 시작하는 새로운 index값 생성
  data_df_reset.index = data_df_reset.index+1
  data_df_reset

  data_df_reset.loc[1:2 , 'Name']
  --------------------------------------
  1     Chulmin
  2    Eunkyung
  Name: Name, dtype: object

불린 인덱싱

  titanic_boolean = titanic_df[titanic_df['Age'] > 60]
  titanic_boolean 

  titanic_df[titanic_df['Age'] > 60][['Name','Age']].head(3)  # Age가 60보다 큰 데이터 중 Name, Age 컬럼에 해당하는 데이터 프레임 반환

불린 인덱싱을 여러 개 조합해서도 사용 가능

titanic_df[ (titanic_df['Age'] > 60) & (titanic_df['Pclass']==1) & (titanic_df['Sex']=='female')]    #### &(and), |(or), ~(not)

#### 조건을 변수에 할당하여 나중에 사용할 수도 있음
cond1 = titanic_df['Age'] > 60
cond2 = titanic_df['Pclass']==1
cond3 = titanic_df['Sex']=='female'
titanic_df[ cond1 & cond2 & cond3]

7) 정렬, Aggregation 함수, Group By 적용

sort_values() : 정렬
- by=[’컬럼’] : 해당 컬럼을 기준으로 정렬 수행
- ascending : True이면 오름차순 정렬, False이면 내림차순 정렬 (default는 True)
- inplace : False이면 원본 데이터 프레임은 그대로 유지, 정렬된 데이터 프레임 반환, True이면 원본 데이터 프레임이 정렬됨 (default는 False)
```
  # Name 컬럼을 기준으로 오름차순 정렬
  titanic_sorted = titanic_df.sort_values(by=['Name'])
    
  # Pclass, Name 컬럼을 기준으로 내림차순 정렬   
  titanic_sorted = titanic_df.sort_values(by=['Pclass', 'Name'], ascending=False)
```

Aggregation 함수

DataFrame에서 min(), max(0, sum(), count(0와 같은 aggregation 함수의 적용은 RDBMS SQL의 aggregation 함수 적용과 유사하다.
다만 DataFrame에서 바로 aggregation을 호출할 경우 모든 칼럼에 해당 aggregation을 적용한다.

  titanic_df.count()
  ---------------------------
  PassengerId    891
  Survived       891
  Pclass         891
  Name           891
  Sex            891
  Age            714
  SibSp          891
  Parch          891
  Ticket         891
  Fare           891
  Cabin          204
  Embarked       889
  dtype: int64

특정 컬럼에 aggregation 함수 적용 → DataFrame에서 대상 컬럼들만 추출해 적용

  titanic_df[['Age', 'Fare']].mean()
  --------------------------------
  Age     29.699118
  Fare    32.204208
  dtype: float64

groupby() 적용
- 입력 파라미터 by 에 칼럼을 입력하면 대상 칼럼으로 groupby 된다.
- DataFrame에 groupby()를 호출하면 DataFrameGroupBy라는 또다른 형태의 DataFrame을 반환한다.
```
  titanic_groupby = titanic_df.groupby(by='Pclass')
  type(titanic_groupby)     # <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
```
- DataFrame에 groupby()를 호출해 반환한 결과에 aggregation 함수를 호출하면 groupby() 대상 컬럼을 제외한 모든 칼럼에 해당 aggregation 함수를 적용한다.
```
  # PClass 별 모든 컬럼들의 count
  titanic_groupby = titanic_df.groupby('Pclass').count()
  titanic_groupby
```
- group by에 특정 칼럼만 aggregation 함수를 적용하려면 groupby로 반환된 DataFrameGroupBy 객체에 해당 컬럼을 필터링 한 뒤 aggregation 함수를 적용한다.
```
  titanic_groupby = titanic_df.groupby('Pclass')[['PassengerId', 'Survived']].count()
  titanic_groupby
```
서로 다른 aggregation 함수를 적용하는 경우
- SQL : select max(Age), min(Age) from titanic_table group by Pclass

titanic_df.groupby('Pclass')['Age'].agg([max, min])

SQL에서 select max(Age), sum(SibSp), avg(Fare) from titanic_table group by Pclass를 DataFrame groupby에서는 어떻게 처리하는가?
```
  agg_format={'Age':'max', 'SibSp':'sum', 'Fare':'mean'}
  titanic_df.groupby('Pclass').agg(agg_format)
```

8) 결손 데이터 처리하기

isna()로 결손 데이터 여부 확인
```
  titanic_df.isna().head(3)
```

각 컬럼 별 결손 데이터 개수 확인

  titanic_df.isna( ).sum( )
  ----------------------------------------
  PassengerId      0
  Survived         0
  Pclass           0
  Name             0
  Sex              0
  Age            177
  SibSp            0
  Parch            0
  Ticket           0
  Fare             0
  Cabin          687
  Embarked         2
  dtype: int64

fillna()로 결손 데이터 대체하기

  titanic_df['Cabin'] = titanic_df['Cabin'].fillna('C000')
  titanic_df.head(3)

  titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())
  titanic_df['Embarked'] = titanic_df['Embarked'].fillna('S')
  titanic_df.isna().sum()
  ------------------------------------------------------
  PassengerId    0
  Survived       0
  Pclass         0
  Name           0
  Sex            0
  Age            0
  SibSp          0
  Parch          0
  Ticket         0
  Fare           0
  Cabin          0
  Embarked       0
  dtype: int64

9) Apply lambda 식으로 데이터 가공

lambda : 함수의 선언과 함수 내의 처리를 한 줄의 식으로 쉽게 변환하는 식

def get_square(a):
    return a**2
get_square(3)

->  lambda_square = lambda x : x ** 2
print('3의 제곱은:',lambda_square(3))
 

DataFrame의 lambda

# titanic_df에서 Name_len이라는 컬럼을 생성하고 Name 컬럼의 길이를 저장해라
titanic_df['Name_len']= titanic_df['Name'].apply(lambda x : len(x))
titanic_df[['Name','Name_len']].head(3)

# titanic_df에 Child_Adult 컬럼을 생성하고 Age 컬럼의 값이 15보다 작거나 같으면 Child를, 크면 Adult를 저장해라 

titanic_df['Child_Adult'] = titanic_df['Age'].apply(lambda x : 'Child' if x <=15 else 'Adult' )
titanic_df[['Age','Child_Adult']].head(8)

titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : 'Child' if x<=15 else ('Adult' if x <= 60 else 'Elderly'))

titanic_df['Age_cat'].value_counts()

# 나이에 따라 세분화된 분류를 수행하는 함수 생성. 
def get_category(age):
    cat = ''
    if age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else : cat = 'Elderly'
    
    return cat

# lambda 식에 위에서 생성한 get_category( ) 함수를 반환값으로 지정. 
# get_category(X)는 입력값으로 ‘Age’ 컬럼 값을 받아서 해당하는 cat 반환
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
titanic_df[['Age','Age_cat']].head()

Twitter Facebook LinkedIn

Woodi

[파이썬 머신러닝 완벽 가이드] 1장-3.판다스

3. 판다스

1) 파일을 DataFrame 객체로 로딩, 기본 API

2) DataFrame과 리스트, 딕셔너리, 넘파이 ndarray 상호 변환

3) DataFrame의 컬럼 데이터 세트 생성과 수정

4) DataFrame 데이터 삭제

5) Index 객체

6) 데이터 셀렉션 및 필터링

7) 정렬, Aggregation 함수, Group By 적용

8) 결손 데이터 처리하기

9) Apply lambda 식으로 데이터 가공

공유하기

참고

[파이썬 머신러닝 완벽 가이드] 2장-사이킷런으로 시작하는 머신러닝-2

[파이썬 머신러닝 완벽 가이드] 2장-사이킷런으로 시작하는 머신러닝-1

[파이썬 머신러닝 완벽 가이드] 1장-2.넘파이

[파이썬 머신러닝 완벽 가이드] 1장-1. 머신러닝 개념1