kaggle datasets download kamilpytlak/personal-key-indicators-of-heart-disease ./data --unzip
5 Subset Data
By subsetting data, it is intended all those operations used to retrieve a specific part of a dataset. For example, we could be interested in getting the last three rows of a table or the columns whose name starts with a specific keyword.
For this chapter, we will consider the Personal key indicators of heart disease Kaggle dataset.
After loading the necessary packages and importing the data, we make use of the clean_names
function of the janitor
and pyjanitor
packages. This function can be used to make lowercase and snake case the dataset column names.
The pyjanitor
module is a porting of the janitor
R package. There are other useful functions in these packages!
library(dplyr)
library(readr)
library(tidyr)
<- read_csv("data/heart_2020_cleaned.csv", show_col_types = FALSE) |>
df ::clean_names() janitor
import pandas as pd
from janitor import clean_names
= pd.read_csv("data/heart_2020_cleaned.csv").clean_names(case_type='snake') df
5.1 Filter rows
5.1.1 Top
Get the first three rows of a dataset:
head(df, n = 3)
# A tibble: 3 × 18
heart_disease bmi smoking alcohol_drinking stroke physical_health
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 No 16.6 Yes No No 3
2 No 20.3 No No Yes 0
3 No 26.6 Yes No No 20
# … with 12 more variables: mental_health <dbl>, diff_walking <chr>, sex <chr>,
# age_category <chr>, race <chr>, diabetic <chr>, physical_activity <chr>,
# gen_health <chr>, sleep_time <dbl>, asthma <chr>, kidney_disease <chr>,
# skin_cancer <chr>
Synonyms
|> slice_head(n = 3) df
=3) df.head(n
heart_disease bmi smoking ... asthma kidney_disease skin_cancer
0 No 16.60 Yes ... Yes No Yes
1 No 20.34 No ... No No No
2 No 26.58 Yes ... Yes No No
[3 rows x 18 columns]
5.1.2 Bottom
Get the last five rows of a dataset:
tail(df, n = 5)
# A tibble: 5 × 18
heart_disease bmi smoking alcohol_drinking stroke physical_health
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 Yes 27.4 Yes No No 7
2 No 29.8 Yes No No 0
3 No 24.2 No No No 0
4 No 32.8 No No No 0
5 No 46.6 No No No 0
# … with 12 more variables: mental_health <dbl>, diff_walking <chr>, sex <chr>,
# age_category <chr>, race <chr>, diabetic <chr>, physical_activity <chr>,
# gen_health <chr>, sleep_time <dbl>, asthma <chr>, kidney_disease <chr>,
# skin_cancer <chr>
Synonyms
|> slice_tail(n = 5) df
=5) df.tail(n
heart_disease bmi smoking ... asthma kidney_disease skin_cancer
319790 Yes 27.41 Yes ... Yes No No
319791 No 29.84 Yes ... Yes No No
319792 No 24.24 No ... No No No
319793 No 32.81 No ... No No No
319794 No 46.56 No ... No No No
[5 rows x 18 columns]
5.1.3 Slicing
Get rows number seven to eleven:
7:11, ] df[
# A tibble: 5 × 18
heart_disease bmi smoking alcohol_drinking stroke physical_health
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 No 21.6 No No No 15
2 No 31.6 Yes No No 5
3 No 26.4 No No No 0
4 No 40.7 No No No 0
5 Yes 34.3 Yes No No 30
# … with 12 more variables: mental_health <dbl>, diff_walking <chr>, sex <chr>,
# age_category <chr>, race <chr>, diabetic <chr>, physical_activity <chr>,
# gen_health <chr>, sleep_time <dbl>, asthma <chr>, kidney_disease <chr>,
# skin_cancer <chr>
Synonyms
|> slice(7:11) df
6:11,] df.iloc[
heart_disease bmi smoking ... asthma kidney_disease skin_cancer
6 No 21.63 No ... Yes No Yes
7 No 31.64 Yes ... Yes No No
8 No 26.45 No ... No Yes No
9 No 40.69 No ... No No No
10 Yes 34.30 Yes ... Yes No No
[5 rows x 18 columns]
5.1.4 By column values
Filter rows by column values:
$bmi < 20) & (df$smoking == "Yes"), ] df[(df
# A tibble: 6,061 × 18
heart_disease bmi smoking alcohol_drinking stroke physical_health
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 No 16.6 Yes No No 3
2 No 19.0 Yes No No 0
3 No 19.5 Yes No No 0
4 No 19.3 Yes No No 0
5 No 18.8 Yes No Yes 0
6 No 18.8 Yes Yes Yes 30
7 No 17.6 Yes No No 2
8 No 19.1 Yes No No 0
9 No 14.7 Yes No No 0
10 No 19.0 Yes No No 10
# … with 6,051 more rows, and 12 more variables: mental_health <dbl>,
# diff_walking <chr>, sex <chr>, age_category <chr>, race <chr>,
# diabetic <chr>, physical_activity <chr>, gen_health <chr>,
# sleep_time <dbl>, asthma <chr>, kidney_disease <chr>, skin_cancer <chr>
Synonyms
|> filter(bmi < 20, smoking == "Yes") df
"bmi"] < 20) & (df["smoking"] == "Yes"),:] df.loc[(df[
heart_disease bmi smoking ... asthma kidney_disease skin_cancer
0 No 16.60 Yes ... Yes No Yes
33 No 19.02 Yes ... No No No
99 No 19.47 Yes ... No No No
189 No 19.31 Yes ... No No No
250 No 18.84 Yes ... No No No
... ... ... ... ... ... ... ...
318688 No 19.20 Yes ... No No No
319455 No 19.77 Yes ... No No No
319602 No 17.94 Yes ... No No No
319678 No 18.42 Yes ... Yes No No
319712 No 17.38 Yes ... No No No
[6061 rows x 18 columns]
Synonyms
'bmi < 20 & smoking == "Yes"') df.query(
5.2 Select two or more columns
5.2.1 By position
Select three columns by their position:
1:3] df[
# A tibble: 319,795 × 3
heart_disease bmi smoking
<chr> <dbl> <chr>
1 No 16.6 Yes
2 No 20.3 No
3 No 26.6 Yes
4 No 24.2 No
5 No 23.7 No
6 Yes 28.9 Yes
7 No 21.6 No
8 No 31.6 Yes
9 No 26.4 No
10 No 40.7 No
# … with 319,785 more rows
Synonyms
|> select(1:3) df
0:3] df.iloc[:,
heart_disease bmi smoking
0 No 16.60 Yes
1 No 20.34 No
2 No 26.58 Yes
3 No 24.21 No
4 No 23.71 No
... ... ... ...
319790 Yes 27.41 Yes
319791 No 29.84 Yes
319792 No 24.24 No
319793 No 32.81 No
319794 No 46.56 No
[319795 rows x 3 columns]
5.2.2 By name
Select three columns by their name:
c("heart_disease", "bmi", "smoking")] df[
# A tibble: 319,795 × 3
heart_disease bmi smoking
<chr> <dbl> <chr>
1 No 16.6 Yes
2 No 20.3 No
3 No 26.6 Yes
4 No 24.2 No
5 No 23.7 No
6 Yes 28.9 Yes
7 No 21.6 No
8 No 31.6 Yes
9 No 26.4 No
10 No 40.7 No
# … with 319,785 more rows
Synonyms
|> select(heart_disease, bmi, smoking)
df |> select(heart_disease:smoking) df
"heart_disease", "bmi", "smoking"]] df.loc[:, [
heart_disease bmi smoking
0 No 16.60 Yes
1 No 20.34 No
2 No 26.58 Yes
3 No 24.21 No
4 No 23.71 No
... ... ... ...
319790 Yes 27.41 Yes
319791 No 29.84 Yes
319792 No 24.24 No
319793 No 32.81 No
319794 No 46.56 No
[319795 rows x 3 columns]
Synonyms
"heart_disease":"smoking"] df.loc[:,
5.2.3 By pattern
Select columns by a regular expression or pattern in their name:
|> select(ends_with("health")) df
# A tibble: 319,795 × 3
physical_health mental_health gen_health
<dbl> <dbl> <chr>
1 3 30 Very good
2 0 0 Very good
3 20 30 Fair
4 0 0 Good
5 28 0 Very good
6 6 0 Fair
7 15 0 Fair
8 5 0 Good
9 0 0 Fair
10 0 0 Good
# … with 319,785 more rows
filter(like="health", axis=1) df.
physical_health mental_health gen_health
0 3.0 30.0 Very good
1 0.0 0.0 Very good
2 20.0 30.0 Fair
3 0.0 0.0 Good
4 28.0 0.0 Very good
... ... ... ...
319790 7.0 0.0 Fair
319791 0.0 0.0 Very good
319792 0.0 0.0 Good
319793 0.0 0.0 Good
319794 0.0 0.0 Good
[319795 rows x 3 columns]
5.2.4 By type
Select columns based on their data type:
|> select(where(is.double)) df
# A tibble: 319,795 × 4
bmi physical_health mental_health sleep_time
<dbl> <dbl> <dbl> <dbl>
1 16.6 3 30 5
2 20.3 0 0 7
3 26.6 20 30 8
4 24.2 0 0 6
5 23.7 28 0 8
6 28.9 6 0 12
7 21.6 15 0 4
8 31.6 5 0 9
9 26.4 0 0 5
10 40.7 0 0 10
# … with 319,785 more rows
float) df.select_dtypes(
bmi physical_health mental_health sleep_time
0 16.60 3.0 30.0 5.0
1 20.34 0.0 0.0 7.0
2 26.58 20.0 30.0 8.0
3 24.21 0.0 0.0 6.0
4 23.71 28.0 0.0 8.0
... ... ... ... ...
319790 27.41 7.0 0.0 6.0
319791 29.84 0.0 0.0 5.0
319792 24.24 0.0 0.0 6.0
319793 32.81 0.0 0.0 12.0
319794 46.56 0.0 0.0 8.0
[319795 rows x 4 columns]
5.3 Remove columns
Remove one column:
|> select(-bmi) df
# A tibble: 319,795 × 17
heart_disease smoking alcohol_drinking stroke physical_health mental_health
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 No Yes No No 3 30
2 No No No Yes 0 0
3 No Yes No No 20 30
4 No No No No 0 0
5 No No No No 28 0
6 Yes Yes No No 6 0
7 No No No No 15 0
8 No Yes No No 5 0
9 No No No No 0 0
10 No No No No 0 0
# … with 319,785 more rows, and 11 more variables: diff_walking <chr>,
# sex <chr>, age_category <chr>, race <chr>, diabetic <chr>,
# physical_activity <chr>, gen_health <chr>, sleep_time <dbl>, asthma <chr>,
# kidney_disease <chr>, skin_cancer <chr>
"bmi", axis=1) df.drop(
heart_disease smoking alcohol_drinking ... asthma kidney_disease skin_cancer
0 No Yes No ... Yes No Yes
1 No No No ... No No No
2 No Yes No ... Yes No No
3 No No No ... No No Yes
4 No No No ... No No No
... ... ... ... ... ... ... ...
319790 Yes Yes No ... Yes No No
319791 No Yes No ... Yes No No
319792 No No No ... No No No
319793 No No No ... No No No
319794 No No No ... No No No
[319795 rows x 17 columns]
Remove more than one column:
|> select(-c(bmi, heart_disease)) df
# A tibble: 319,795 × 16
smoking alcohol_drinking stroke physical_health mental_health diff_walking
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 Yes No No 3 30 No
2 No No Yes 0 0 No
3 Yes No No 20 30 No
4 No No No 0 0 No
5 No No No 28 0 Yes
6 Yes No No 6 0 Yes
7 No No No 15 0 No
8 Yes No No 5 0 Yes
9 No No No 0 0 No
10 No No No 0 0 Yes
# … with 319,785 more rows, and 10 more variables: sex <chr>,
# age_category <chr>, race <chr>, diabetic <chr>, physical_activity <chr>,
# gen_health <chr>, sleep_time <dbl>, asthma <chr>, kidney_disease <chr>,
# skin_cancer <chr>
"bmi", "heart_disease"], axis=1) df.drop([
smoking alcohol_drinking stroke ... asthma kidney_disease skin_cancer
0 Yes No No ... Yes No Yes
1 No No Yes ... No No No
2 Yes No No ... Yes No No
3 No No No ... No No Yes
4 No No No ... No No No
... ... ... ... ... ... ... ...
319790 Yes No No ... Yes No No
319791 Yes No No ... Yes No No
319792 No No No ... No No No
319793 No No No ... No No No
319794 No No No ... No No No
[319795 rows x 16 columns]
5.4 Sorting
5.4.1 Ascending
Sort a dataset by column values in ascending order:
|> arrange(sex) df
# A tibble: 319,795 × 18
heart_disease bmi smoking alcohol_drinking stroke physical_health
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 No 16.6 Yes No No 3
2 No 20.3 No No Yes 0
3 No 24.2 No No No 0
4 No 23.7 No No No 28
5 Yes 28.9 Yes No No 6
6 No 21.6 No No No 15
7 No 31.6 Yes No No 5
8 No 26.4 No No No 0
9 No 28.7 Yes No No 0
10 No 28.2 No No No 7
# … with 319,785 more rows, and 12 more variables: mental_health <dbl>,
# diff_walking <chr>, sex <chr>, age_category <chr>, race <chr>,
# diabetic <chr>, physical_activity <chr>, gen_health <chr>,
# sleep_time <dbl>, asthma <chr>, kidney_disease <chr>, skin_cancer <chr>
"sex") df.sort_values(
heart_disease bmi smoking ... asthma kidney_disease skin_cancer
0 No 16.60 Yes ... Yes No Yes
179722 No 29.29 No ... Yes No No
179725 Yes 36.58 Yes ... No No Yes
179728 No 24.96 No ... No No No
179730 No 25.06 Yes ... No No No
... ... ... ... ... ... ... ...
139706 No 42.91 Yes ... No No No
94088 Yes 38.39 Yes ... No No No
212147 No 37.59 No ... No No No
56348 No 32.89 Yes ... Yes No Yes
109843 No 37.03 Yes ... No No No
[319795 rows x 18 columns]
5.4.2 Descending
Sort a dataset by column values in descending order:
|> arrange(-bmi) df
# A tibble: 319,795 × 18
heart_disease bmi smoking alcohol_drinking stroke physical_health
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 No 94.8 No No No 0
2 No 94.7 No No No 4
3 No 94.0 Yes No No 20
4 No 93.9 Yes Yes No 30
5 No 92.5 Yes No No 7
6 No 91.8 No No No 0
7 No 91.6 Yes No No 0
8 No 91.6 No No No 0
9 No 88.6 No No No 30
10 No 88.2 No No No 0
# … with 319,785 more rows, and 12 more variables: mental_health <dbl>,
# diff_walking <chr>, sex <chr>, age_category <chr>, race <chr>,
# diabetic <chr>, physical_activity <chr>, gen_health <chr>,
# sleep_time <dbl>, asthma <chr>, kidney_disease <chr>, skin_cancer <chr>
"bmi", ascending=False) df.sort_values(
heart_disease bmi smoking ... asthma kidney_disease skin_cancer
126896 No 94.85 No ... No No No
242834 No 94.66 No ... No No No
104267 No 93.97 Yes ... No No No
249715 No 93.86 Yes ... Yes Yes No
156093 No 92.53 Yes ... Yes No No
... ... ... ... ... ... ... ...
81754 No 12.16 No ... No No No
51637 No 12.13 No ... No No No
113373 No 12.08 Yes ... No No No
69662 No 12.02 Yes ... No No No
205511 No 12.02 No ... No No No
[319795 rows x 18 columns]