5 Subset Data

By subsetting data, it is intended all those operations used to retrieve a specific part of a dataset. For example, we could be interested in getting the last three rows of a table or the columns whose name starts with a specific keyword.

For this chapter, we will consider the Personal key indicators of heart disease Kaggle dataset.

kaggle datasets download kamilpytlak/personal-key-indicators-of-heart-disease ./data --unzip

After loading the necessary packages and importing the data, we make use of the clean_names function of the janitor and pyjanitor packages. This function can be used to make lowercase and snake case the dataset column names.

The janitor package

The pyjanitor module is a porting of the janitor R package. There are other useful functions in these packages!

R
Python

library(dplyr)
library(readr)
library(tidyr)

df <- read_csv("data/heart_2020_cleaned.csv", show_col_types = FALSE) |> 
    janitor::clean_names()

import pandas as pd
from janitor import clean_names

df = pd.read_csv("data/heart_2020_cleaned.csv").clean_names(case_type='snake')

5.1 Filter rows

5.1.1 Top

Get the first three rows of a dataset:

R
Python

head(df, n = 3)

# A tibble: 3 × 18
  heart_disease   bmi smoking alcohol_drinking stroke physical_health
  <chr>         <dbl> <chr>   <chr>            <chr>            <dbl>
1 No             16.6 Yes     No               No                   3
2 No             20.3 No      No               Yes                  0
3 No             26.6 Yes     No               No                  20
# … with 12 more variables: mental_health <dbl>, diff_walking <chr>, sex <chr>,
#   age_category <chr>, race <chr>, diabetic <chr>, physical_activity <chr>,
#   gen_health <chr>, sleep_time <dbl>, asthma <chr>, kidney_disease <chr>,
#   skin_cancer <chr>

Synonyms

df |> slice_head(n = 3)

df.head(n=3)

  heart_disease    bmi smoking  ... asthma kidney_disease  skin_cancer
0            No  16.60     Yes  ...    Yes             No          Yes
1            No  20.34      No  ...     No             No           No
2            No  26.58     Yes  ...    Yes             No           No

[3 rows x 18 columns]

5.1.2 Bottom

Get the last five rows of a dataset:

R
Python

tail(df, n = 5)

# A tibble: 5 × 18
  heart_disease   bmi smoking alcohol_drinking stroke physical_health
  <chr>         <dbl> <chr>   <chr>            <chr>            <dbl>
1 Yes            27.4 Yes     No               No                   7
2 No             29.8 Yes     No               No                   0
3 No             24.2 No      No               No                   0
4 No             32.8 No      No               No                   0
5 No             46.6 No      No               No                   0
# … with 12 more variables: mental_health <dbl>, diff_walking <chr>, sex <chr>,
#   age_category <chr>, race <chr>, diabetic <chr>, physical_activity <chr>,
#   gen_health <chr>, sleep_time <dbl>, asthma <chr>, kidney_disease <chr>,
#   skin_cancer <chr>

Synonyms

df |> slice_tail(n = 5)

df.tail(n=5)

       heart_disease    bmi smoking  ... asthma kidney_disease  skin_cancer
319790           Yes  27.41     Yes  ...    Yes             No           No
319791            No  29.84     Yes  ...    Yes             No           No
319792            No  24.24      No  ...     No             No           No
319793            No  32.81      No  ...     No             No           No
319794            No  46.56      No  ...     No             No           No

[5 rows x 18 columns]

5.1.3 Slicing

Get rows number seven to eleven:

R
Python

df[7:11, ]

# A tibble: 5 × 18
  heart_disease   bmi smoking alcohol_drinking stroke physical_health
  <chr>         <dbl> <chr>   <chr>            <chr>            <dbl>
1 No             21.6 No      No               No                  15
2 No             31.6 Yes     No               No                   5
3 No             26.4 No      No               No                   0
4 No             40.7 No      No               No                   0
5 Yes            34.3 Yes     No               No                  30
# … with 12 more variables: mental_health <dbl>, diff_walking <chr>, sex <chr>,
#   age_category <chr>, race <chr>, diabetic <chr>, physical_activity <chr>,
#   gen_health <chr>, sleep_time <dbl>, asthma <chr>, kidney_disease <chr>,
#   skin_cancer <chr>

Synonyms

df |> slice(7:11)

df.iloc[6:11,]

   heart_disease    bmi smoking  ... asthma kidney_disease  skin_cancer
6             No  21.63      No  ...    Yes             No          Yes
7             No  31.64     Yes  ...    Yes             No           No
8             No  26.45      No  ...     No            Yes           No
9             No  40.69      No  ...     No             No           No
10           Yes  34.30     Yes  ...    Yes             No           No

[5 rows x 18 columns]

5.1.4 By column values

Filter rows by column values:

R
Python

df[(df$bmi < 20) & (df$smoking == "Yes"), ]

# A tibble: 6,061 × 18
   heart_disease   bmi smoking alcohol_drinking stroke physical_health
   <chr>         <dbl> <chr>   <chr>            <chr>            <dbl>
 1 No             16.6 Yes     No               No                   3
 2 No             19.0 Yes     No               No                   0
 3 No             19.5 Yes     No               No                   0
 4 No             19.3 Yes     No               No                   0
 5 No             18.8 Yes     No               Yes                  0
 6 No             18.8 Yes     Yes              Yes                 30
 7 No             17.6 Yes     No               No                   2
 8 No             19.1 Yes     No               No                   0
 9 No             14.7 Yes     No               No                   0
10 No             19.0 Yes     No               No                  10
# … with 6,051 more rows, and 12 more variables: mental_health <dbl>,
#   diff_walking <chr>, sex <chr>, age_category <chr>, race <chr>,
#   diabetic <chr>, physical_activity <chr>, gen_health <chr>,
#   sleep_time <dbl>, asthma <chr>, kidney_disease <chr>, skin_cancer <chr>

Synonyms

df |> filter(bmi < 20, smoking == "Yes")

df.loc[(df["bmi"] < 20) & (df["smoking"] == "Yes"),:]

       heart_disease    bmi smoking  ... asthma kidney_disease  skin_cancer
0                 No  16.60     Yes  ...    Yes             No          Yes
33                No  19.02     Yes  ...     No             No           No
99                No  19.47     Yes  ...     No             No           No
189               No  19.31     Yes  ...     No             No           No
250               No  18.84     Yes  ...     No             No           No
...              ...    ...     ...  ...    ...            ...          ...
318688            No  19.20     Yes  ...     No             No           No
319455            No  19.77     Yes  ...     No             No           No
319602            No  17.94     Yes  ...     No             No           No
319678            No  18.42     Yes  ...    Yes             No           No
319712            No  17.38     Yes  ...     No             No           No

[6061 rows x 18 columns]

Synonyms

df.query('bmi < 20 & smoking == "Yes"')

5.2 Select two or more columns

5.2.1 By position

Select three columns by their position:

R
Python

df[1:3]

# A tibble: 319,795 × 3
   heart_disease   bmi smoking
   <chr>         <dbl> <chr>  
 1 No             16.6 Yes    
 2 No             20.3 No     
 3 No             26.6 Yes    
 4 No             24.2 No     
 5 No             23.7 No     
 6 Yes            28.9 Yes    
 7 No             21.6 No     
 8 No             31.6 Yes    
 9 No             26.4 No     
10 No             40.7 No     
# … with 319,785 more rows

Synonyms

df |> select(1:3)

df.iloc[:,0:3]

       heart_disease    bmi smoking
0                 No  16.60     Yes
1                 No  20.34      No
2                 No  26.58     Yes
3                 No  24.21      No
4                 No  23.71      No
...              ...    ...     ...
319790           Yes  27.41     Yes
319791            No  29.84     Yes
319792            No  24.24      No
319793            No  32.81      No
319794            No  46.56      No

[319795 rows x 3 columns]

5.2.2 By name

Select three columns by their name:

R
Python

df[c("heart_disease", "bmi", "smoking")]

# A tibble: 319,795 × 3
   heart_disease   bmi smoking
   <chr>         <dbl> <chr>  
 1 No             16.6 Yes    
 2 No             20.3 No     
 3 No             26.6 Yes    
 4 No             24.2 No     
 5 No             23.7 No     
 6 Yes            28.9 Yes    
 7 No             21.6 No     
 8 No             31.6 Yes    
 9 No             26.4 No     
10 No             40.7 No     
# … with 319,785 more rows

Synonyms

df |> select(heart_disease, bmi, smoking)
df |> select(heart_disease:smoking)

df.loc[:, ["heart_disease", "bmi", "smoking"]]

       heart_disease    bmi smoking
0                 No  16.60     Yes
1                 No  20.34      No
2                 No  26.58     Yes
3                 No  24.21      No
4                 No  23.71      No
...              ...    ...     ...
319790           Yes  27.41     Yes
319791            No  29.84     Yes
319792            No  24.24      No
319793            No  32.81      No
319794            No  46.56      No

[319795 rows x 3 columns]

Synonyms

df.loc[:, "heart_disease":"smoking"]

5.2.3 By pattern

Select columns by a regular expression or pattern in their name:

R
Python

df |> select(ends_with("health"))

# A tibble: 319,795 × 3
   physical_health mental_health gen_health
             <dbl>         <dbl> <chr>     
 1               3            30 Very good 
 2               0             0 Very good 
 3              20            30 Fair      
 4               0             0 Good      
 5              28             0 Very good 
 6               6             0 Fair      
 7              15             0 Fair      
 8               5             0 Good      
 9               0             0 Fair      
10               0             0 Good      
# … with 319,785 more rows

df.filter(like="health", axis=1)

        physical_health  mental_health gen_health
0                   3.0           30.0  Very good
1                   0.0            0.0  Very good
2                  20.0           30.0       Fair
3                   0.0            0.0       Good
4                  28.0            0.0  Very good
...                 ...            ...        ...
319790              7.0            0.0       Fair
319791              0.0            0.0  Very good
319792              0.0            0.0       Good
319793              0.0            0.0       Good
319794              0.0            0.0       Good

[319795 rows x 3 columns]

5.2.4 By type

Select columns based on their data type:

R
Python

df |> select(where(is.double))

# A tibble: 319,795 × 4
     bmi physical_health mental_health sleep_time
   <dbl>           <dbl>         <dbl>      <dbl>
 1  16.6               3            30          5
 2  20.3               0             0          7
 3  26.6              20            30          8
 4  24.2               0             0          6
 5  23.7              28             0          8
 6  28.9               6             0         12
 7  21.6              15             0          4
 8  31.6               5             0          9
 9  26.4               0             0          5
10  40.7               0             0         10
# … with 319,785 more rows

df.select_dtypes(float)

          bmi  physical_health  mental_health  sleep_time
0       16.60              3.0           30.0         5.0
1       20.34              0.0            0.0         7.0
2       26.58             20.0           30.0         8.0
3       24.21              0.0            0.0         6.0
4       23.71             28.0            0.0         8.0
...       ...              ...            ...         ...
319790  27.41              7.0            0.0         6.0
319791  29.84              0.0            0.0         5.0
319792  24.24              0.0            0.0         6.0
319793  32.81              0.0            0.0        12.0
319794  46.56              0.0            0.0         8.0

[319795 rows x 4 columns]

5.3 Remove columns

Remove one column:

R
Python

df |> select(-bmi)

# A tibble: 319,795 × 17
   heart_disease smoking alcohol_drinking stroke physical_health mental_health
   <chr>         <chr>   <chr>            <chr>            <dbl>         <dbl>
 1 No            Yes     No               No                   3            30
 2 No            No      No               Yes                  0             0
 3 No            Yes     No               No                  20            30
 4 No            No      No               No                   0             0
 5 No            No      No               No                  28             0
 6 Yes           Yes     No               No                   6             0
 7 No            No      No               No                  15             0
 8 No            Yes     No               No                   5             0
 9 No            No      No               No                   0             0
10 No            No      No               No                   0             0
# … with 319,785 more rows, and 11 more variables: diff_walking <chr>,
#   sex <chr>, age_category <chr>, race <chr>, diabetic <chr>,
#   physical_activity <chr>, gen_health <chr>, sleep_time <dbl>, asthma <chr>,
#   kidney_disease <chr>, skin_cancer <chr>

df.drop("bmi", axis=1)

       heart_disease smoking alcohol_drinking  ... asthma  kidney_disease  skin_cancer
0                 No     Yes               No  ...    Yes              No          Yes
1                 No      No               No  ...     No              No           No
2                 No     Yes               No  ...    Yes              No           No
3                 No      No               No  ...     No              No          Yes
4                 No      No               No  ...     No              No           No
...              ...     ...              ...  ...    ...             ...          ...
319790           Yes     Yes               No  ...    Yes              No           No
319791            No     Yes               No  ...    Yes              No           No
319792            No      No               No  ...     No              No           No
319793            No      No               No  ...     No              No           No
319794            No      No               No  ...     No              No           No

[319795 rows x 17 columns]

Remove more than one column:

R
Python

df |> select(-c(bmi, heart_disease))

# A tibble: 319,795 × 16
   smoking alcohol_drinking stroke physical_health mental_health diff_walking
   <chr>   <chr>            <chr>            <dbl>         <dbl> <chr>       
 1 Yes     No               No                   3            30 No          
 2 No      No               Yes                  0             0 No          
 3 Yes     No               No                  20            30 No          
 4 No      No               No                   0             0 No          
 5 No      No               No                  28             0 Yes         
 6 Yes     No               No                   6             0 Yes         
 7 No      No               No                  15             0 No          
 8 Yes     No               No                   5             0 Yes         
 9 No      No               No                   0             0 No          
10 No      No               No                   0             0 Yes         
# … with 319,785 more rows, and 10 more variables: sex <chr>,
#   age_category <chr>, race <chr>, diabetic <chr>, physical_activity <chr>,
#   gen_health <chr>, sleep_time <dbl>, asthma <chr>, kidney_disease <chr>,
#   skin_cancer <chr>

df.drop(["bmi", "heart_disease"], axis=1)

       smoking alcohol_drinking stroke  ...  asthma  kidney_disease skin_cancer
0          Yes               No     No  ...     Yes              No         Yes
1           No               No    Yes  ...      No              No          No
2          Yes               No     No  ...     Yes              No          No
3           No               No     No  ...      No              No         Yes
4           No               No     No  ...      No              No          No
...        ...              ...    ...  ...     ...             ...         ...
319790     Yes               No     No  ...     Yes              No          No
319791     Yes               No     No  ...     Yes              No          No
319792      No               No     No  ...      No              No          No
319793      No               No     No  ...      No              No          No
319794      No               No     No  ...      No              No          No

[319795 rows x 16 columns]

5.4 Sorting

5.4.1 Ascending

Sort a dataset by column values in ascending order:

R
Python

df |> arrange(sex)

# A tibble: 319,795 × 18
   heart_disease   bmi smoking alcohol_drinking stroke physical_health
   <chr>         <dbl> <chr>   <chr>            <chr>            <dbl>
 1 No             16.6 Yes     No               No                   3
 2 No             20.3 No      No               Yes                  0
 3 No             24.2 No      No               No                   0
 4 No             23.7 No      No               No                  28
 5 Yes            28.9 Yes     No               No                   6
 6 No             21.6 No      No               No                  15
 7 No             31.6 Yes     No               No                   5
 8 No             26.4 No      No               No                   0
 9 No             28.7 Yes     No               No                   0
10 No             28.2 No      No               No                   7
# … with 319,785 more rows, and 12 more variables: mental_health <dbl>,
#   diff_walking <chr>, sex <chr>, age_category <chr>, race <chr>,
#   diabetic <chr>, physical_activity <chr>, gen_health <chr>,
#   sleep_time <dbl>, asthma <chr>, kidney_disease <chr>, skin_cancer <chr>

df.sort_values("sex")

       heart_disease    bmi smoking  ... asthma kidney_disease  skin_cancer
0                 No  16.60     Yes  ...    Yes             No          Yes
179722            No  29.29      No  ...    Yes             No           No
179725           Yes  36.58     Yes  ...     No             No          Yes
179728            No  24.96      No  ...     No             No           No
179730            No  25.06     Yes  ...     No             No           No
...              ...    ...     ...  ...    ...            ...          ...
139706            No  42.91     Yes  ...     No             No           No
94088            Yes  38.39     Yes  ...     No             No           No
212147            No  37.59      No  ...     No             No           No
56348             No  32.89     Yes  ...    Yes             No          Yes
109843            No  37.03     Yes  ...     No             No           No

[319795 rows x 18 columns]

5.4.2 Descending

Sort a dataset by column values in descending order:

R
Python

df |> arrange(-bmi)

# A tibble: 319,795 × 18
   heart_disease   bmi smoking alcohol_drinking stroke physical_health
   <chr>         <dbl> <chr>   <chr>            <chr>            <dbl>
 1 No             94.8 No      No               No                   0
 2 No             94.7 No      No               No                   4
 3 No             94.0 Yes     No               No                  20
 4 No             93.9 Yes     Yes              No                  30
 5 No             92.5 Yes     No               No                   7
 6 No             91.8 No      No               No                   0
 7 No             91.6 Yes     No               No                   0
 8 No             91.6 No      No               No                   0
 9 No             88.6 No      No               No                  30
10 No             88.2 No      No               No                   0
# … with 319,785 more rows, and 12 more variables: mental_health <dbl>,
#   diff_walking <chr>, sex <chr>, age_category <chr>, race <chr>,
#   diabetic <chr>, physical_activity <chr>, gen_health <chr>,
#   sleep_time <dbl>, asthma <chr>, kidney_disease <chr>, skin_cancer <chr>

df.sort_values("bmi", ascending=False)

       heart_disease    bmi smoking  ... asthma kidney_disease  skin_cancer
126896            No  94.85      No  ...     No             No           No
242834            No  94.66      No  ...     No             No           No
104267            No  93.97     Yes  ...     No             No           No
249715            No  93.86     Yes  ...    Yes            Yes           No
156093            No  92.53     Yes  ...    Yes             No           No
...              ...    ...     ...  ...    ...            ...          ...
81754             No  12.16      No  ...     No             No           No
51637             No  12.13      No  ...     No             No           No
113373            No  12.08     Yes  ...     No             No           No
69662             No  12.02     Yes  ...     No             No           No
205511            No  12.02      No  ...     No             No           No

[319795 rows x 18 columns]