Pandas 정리(+NumPy)

IT/Note

Pandas 정리(+NumPy)

김솔샤르 2022. 8. 6. 15:11

csv file read

import pandas as pd

train_df = pd.read_csv('taxi_train.zip')

데이터 확인

train_df.head() # 일부 데이터 출력. 파라미터로 건수 지정 가능

train_df.info() # Data frame에 대한 메타 정보 출력

train_df.shape() # 데이터의 축 정보 확인(몇개의 행과 열로 이루어져 있는지)

컬럼 정보

train_df.columns
train_df.dtypes

Python Dictionary 통해 DataFrame 생성

dic_df = pd.DataFrame({'column_name' : train_df.columns,
													'column_type' : train_df.dtypes'})

index 리셋

dic_df = dic_df.reset_index(dtop=True)

boolean indexing(loc)

boolean 조건식에 numpy array나 python list가 모두 들어갈 수 있다
단일 조건만 주면 행 기준으로 indexing 되므로 결과는 series이다

## 특정 컬럼 추출 -> 결과 Series로
trip_duration_sr = train_df['trip_duration']

## 특정 컬럼 추출 -> 결과 DataFrame으로
trip_duration_df = train_df[:,'trip_duration']

## column type이 object인 행만 추출
object_sr = train_df.loc[dic_df.colum_type=='object']

## column type이 object가 아닌 행만 추출
object_non_sr = train_df.dtypes.loc[train_df.dtypes!='object']

## type이 int64가 아니면서 column 이름이 trip_duration이 아닌 행만 추출
int64_sr = train_df.dtypes.loc[(train_df.dtypes=='int64') & (train_df.columns

Serise → DataFrame

## DataFrame의 원본 컬럼 유지
filtered_df = train_df.loc[:,int64_sr.index.values()]

## DataFrame의 특정 컬럼만 추출
filtered_df_2 = train_df['trip_duration'].values.reshape(-1,1)

타입 변환

딕셔너리로 조건 줄 수 있음

filtered_df = filtered_df.astype({'vendor_id':np.int32, 'passenger_count':np.int32})

integer location based indexing(iloc)

iloc는 행번호 기반이므로 헷갈릴 수 있으므로 유의

train_df.iloc[:11, :] ## 11행 이전까지의 데이터 추출

train_df.loc[:10, :] ## 같은 행위를 loc로 할때

데이터 개수 세기

train_df.count() ## 컬럼별로 데이터 건수 Count. NaN은 제외함

train_df.value_count() ## 데이터 종별 값 세기. 각각의 값이 나온 횟수를 셈

데이터 삭제

행, 열 방향 모두 삭제 가능함

## 열 방향 삭제
df_column_removed_1 = train_df.drop('id', axis=1) ## id 컬럼 삭제
df_column_removed_2 = train_df.drop(['id', 'trip_duration'], axis=1) ## 복수개 컬럼 제거

## 행 방향 삭제
target_indexes = train.df[train_df['trip_duration'] >= 1000].index.values
df_row_removed = train_df.drop(target_indexes, axis=0) ## 타겟 행 삭제

정렬

sorted_df = train_df.sort_values(by=['pickup_datetime'], ascending=False)

Group by

기준이 되는 열로 그룹 객체를 만들어줌

train_df.groupby(by='vendor_id')['trip_duration'].mean()

다중 그룹 연산(agg)

train_df.groupby(by='vendor_id').agg(count=('passenger_count','count'), duration_mean=('trip_duration','mean'), duration_max=('trip_duration','max'))

결손 데이터 확인

train_df.isna().sum()

Lambda를 통한 데이터 변환

def isOverLimit(x):
	result = ''
	if x >= 1000: result = 'YES'
	else: result = 'NO'
	return result

train_df['is_over_limit'] = train_df['trip_duration'].apply(lambda x : isOverLimit(x))

참조

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

저작자표시 비영리 변경금지

Pandas 정리(+NumPy)

Pandas 정리(+NumPy)

csv file read

데이터 확인

컬럼 정보

Python Dictionary 통해 DataFrame 생성

index 리셋

boolean indexing(loc)

Serise → DataFrame

타입 변환

integer location based indexing(iloc)

데이터 개수 세기

데이터 삭제

정렬

Group by

다중 그룹 연산(agg)

결손 데이터 확인

Lambda를 통한 데이터 변환

참조

Python Dictionary 통해 DataFrame 생성