4. 분석하기 좋은 데이터 만들기 (2)

문제 설명

자세히 보기

분석하기 나쁜 데이터…?

Messy Data(지저분한 데이터)는 데이터의 연결 관계보다 특정 데이터 셋에 초점이 맞춰진 것입니다.

칼럼 헤더들이 변수 이름이 아닌 값이다.
다중 변수가 한 칼럼 내에 저장된다.
변수들이 행과 열 모두에 저장된다.
관측 유닛의 여러 속성들이 같은 테이블에 저장된다.
단일 관측 유닛이 여러 테이블에 저장된다.
Plain Text
복사

위와 같은 특징들이 있습니다.

다음과 같은 이유로 생성됩니다.

•

데이터분석을 전제로 데이터를 생성하지 않은 것

•

요약과 집계된 데이터가 직관적으로 이해하기 쉬움 (피벗 테이블 선호)

다음은 나쁜 데이터의 예시로, 열 이름이 변수가 아니고 값인 경우에 해당하는 데이터입니다.

year

artist.inverted

track

time

genre

date.entered

date.peaked

x1st.week

x2nd.week

x3rd.week

...

x67th.week

x68th.week

x69th.week

x70th.week

x71st.week

x72nd.week

x73rd.week

x74th.week

x75th.week

x76th.week

2000

Destiny's Child

Independent Women Part I

3:38

Rock

2000-09-23

2000-11-18

63.0

49.0

...

NaN

2000

Santana

Maria, Maria

4:18

Rock

2000-02-12

2000-04-08

8.0

6.0

...

NaN

2000

Savage Garden

I Knew I Loved You

4:07

Rock

1999-10-23

2000-01-29

48.0

43.0

...

NaN

2000

Madonna

Music

3:45

Rock

2000-08-12

2000-09-16

23.0

18.0

...

NaN

2000

Aguilera, Christina

Come On Over Baby (All I Want Is You)

3:38

Rock

2000-08-05

2000-10-14

47.0

45.0

...

NaN

2000

Janet

Doesn't Really Matter

4:17

Rock

2000-06-17

2000-08-26

52.0

43.0

...

NaN

2000

Destiny's Child

Say My Name

4:31

Rock

1999-12-25

2000-03-18

83.0

44.0

...

NaN

2000

Iglesias, Enrique

Be With You

3:36

Latin

2000-04-01

2000-06-24

45.0

34.0

...

NaN

2000

Sisqo

Incomplete

3:52

Rock

2000-06-24

2000-08-12

66.0

61.0

...

NaN

2000

Lonestar

Amazed

4:25

Country

1999-06-05

2000-03-04

54.0

44.0

...

NaN

보기에는 좋지만, 분석하기는 어렵죠.

Pandas의 melt 등을 사용합니다.

# Melting
id_vars = [
  "year",
  "artist.inverted",
  "track",
  "time",
  "genre",
  "date.entered",
  "date.peaked"
]
df = pd.melt(frame=df,id_vars=id_vars, var_name="week", value_name="rank")

# Formatting
# 정규식으로 x1st.week 에서 숫자 1만 추출
df["week"] = df['week'].str.extract('(\d+)', expand=False).astype(int)
df["rank"] = df["rank"].astype(int)

# 필요없는 행을 삭제합니다.
df = df.dropna()

# Create "date" columns
df['date'] = pd.to_datetime(df['date.entered']) + pd.to_timedelta(df['week'], unit='w') - pd.DateOffset(weeks=1)

df = df[["year", "artist.inverted", "track", "time", "genre", "week", "rank", "date"]]
df = df.sort_values(ascending=True, by=["year","artist.inverted","track","week","rank"])

# Assigning the tidy dataset to a variable for future usage
billboard = df

df.head(10)
Python
복사

결과물입니다.

year	artist.inverted	track	time	genre	week	rank	date
246	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	1	87
563	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2	82
880	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	3	72
1197	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	4	77
1514	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	5	87
1831	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	6	94
2148	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	7	99
287	2000	2Ge+her	The Hardest Part Of Breaking Up (Is Getting Ba...	3:15	R&B	1	91
604	2000	2Ge+her	The Hardest Part Of Breaking Up (Is Getting Ba...	3:15	R&B	2	87
921	2000	2Ge+her	The Hardest Part Of Breaking Up (Is Getting Ba...	3:15	R&B	3	92