kaggle datasets download kamilpytlak/personal-key-indicators-of-heart-disease ./data --unzip2 Structured Data
The most common tabular (or structured) file formats one may encounter in a data science project are:
- delimited files such as CSV (comma-separated values) and TSV (tab-separated values) files;
- Excel files.
We will consider the Personal key indicators of heart disease from Kaggle.
library(readr)import pandas as pd2.1 Delimited files
2.1.1 TSV
As an example, let’s consider the Animal Crossing dataset from week 19 of the 2020 TidyTuesday challenge.
url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"
df <- read_tsv(url, show_col_types = FALSE)
df# A tibble: 107 × 4
grade publication text date
<dbl> <chr> <chr> <date>
1 100 Pocket Gamer UK Animal Crossing; New Horizons, much like i… 2020-03-16
2 100 Forbes Know that if you’re overwhelmed with the w… 2020-03-16
3 100 Telegraph With a game this broad and lengthy, there’… 2020-03-16
4 100 VG247 Animal Crossing: New Horizons is everythin… 2020-03-16
5 100 Nintendo Insider Above all else, Animal Crossing: New Horiz… 2020-03-16
6 100 Trusted Reviews Animal Crossing: New Horizons is the best … 2020-03-16
7 100 VGC Nintendo's comforting life sim is a tranqu… 2020-03-16
8 100 God is a Geek A beautiful, welcoming game that is everyt… 2020-03-16
9 100 Nintendo Life Animal Crossing: New Horizons takes Animal… 2020-03-16
10 100 Daily Star Similar to how Breath of the Wild and Odys… 2020-03-16
# … with 97 more rows
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"
df = pd.read_table(url)
df grade ... date
0 100 ... 2020-03-16
1 100 ... 2020-03-16
2 100 ... 2020-03-16
3 100 ... 2020-03-16
4 100 ... 2020-03-16
.. ... ... ...
102 90 ... 2020-04-16
103 90 ... 2020-04-17
104 95 ... 2020-04-22
105 90 ... 2020-05-01
106 80 ... 2020-05-01
[107 rows x 4 columns]
2.1.2 CSV
As an example, let’s consider the Indoor Pollution dataset from week 15 of the 2022 TidyTuesday challenge.
url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-04-12/indoor_pollution.csv"
read_csv(url, show_col_types = FALSE)# A tibble: 8,010 × 4
Entity Code Year `Deaths - Cause: All causes - Risk: Household air p…`
<chr> <chr> <dbl> <dbl>
1 Afghanistan AFG 1990 19.6
2 Afghanistan AFG 1991 19.3
3 Afghanistan AFG 1992 19.5
4 Afghanistan AFG 1993 19.7
5 Afghanistan AFG 1994 19.4
6 Afghanistan AFG 1995 19.6
7 Afghanistan AFG 1996 19.8
8 Afghanistan AFG 1997 19.7
9 Afghanistan AFG 1998 19.0
10 Afghanistan AFG 1999 19.9
# … with 8,000 more rows
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-04-12/indoor_pollution.csv"
pd.read_csv(url) Entity ... Deaths - Cause: All causes - Risk: Household air pollution from solid fuels - Sex: Both - Age: Age-standardized (Percent)
0 Afghanistan ... 19.623001
1 Afghanistan ... 19.335193
2 Afghanistan ... 19.508785
3 Afghanistan ... 19.677607
4 Afghanistan ... 19.432528
... ... ... ...
8005 Zimbabwe ... 8.603630
8006 Zimbabwe ... 8.656817
8007 Zimbabwe ... 8.690100
8008 Zimbabwe ... 8.736245
8009 Zimbabwe ... 8.658818
[8010 rows x 4 columns]
2.1.3 General delimiter
Again, we can consider the Animal Crossing dataset from TidyTuesday and use the read_table function to import the data by specifying a separator/delimiter.
url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"
read_delim(url, delim ="\t", show_col_types = FALSE)# A tibble: 107 × 4
grade publication text date
<dbl> <chr> <chr> <date>
1 100 Pocket Gamer UK Animal Crossing; New Horizons, much like i… 2020-03-16
2 100 Forbes Know that if you’re overwhelmed with the w… 2020-03-16
3 100 Telegraph With a game this broad and lengthy, there’… 2020-03-16
4 100 VG247 Animal Crossing: New Horizons is everythin… 2020-03-16
5 100 Nintendo Insider Above all else, Animal Crossing: New Horiz… 2020-03-16
6 100 Trusted Reviews Animal Crossing: New Horizons is the best … 2020-03-16
7 100 VGC Nintendo's comforting life sim is a tranqu… 2020-03-16
8 100 God is a Geek A beautiful, welcoming game that is everyt… 2020-03-16
9 100 Nintendo Life Animal Crossing: New Horizons takes Animal… 2020-03-16
10 100 Daily Star Similar to how Breath of the Wild and Odys… 2020-03-16
# … with 97 more rows
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"
pd.read_table(url, sep="\t") grade ... date
0 100 ... 2020-03-16
1 100 ... 2020-03-16
2 100 ... 2020-03-16
3 100 ... 2020-03-16
4 100 ... 2020-03-16
.. ... ... ...
102 90 ... 2020-04-16
103 90 ... 2020-04-17
104 95 ... 2020-04-22
105 90 ... 2020-05-01
106 80 ... 2020-05-01
[107 rows x 4 columns]
2.2 Excel files
At the moment, readxl does not support reading files from URLs. Hence, we will consider an example dataset which comes into the readxl package.
library(readxl)
path_to_ds <- readxl_example("clippy.xlsx")
read_xlsx(path_to_ds)# A tibble: 4 × 2
name value
<chr> <chr>
1 Name Clippy
2 Species paperclip
3 Approx date of death 39083
4 Weight in grams 0.9
pd.read_excel(r.path_to_ds) name value
0 Name Clippy
1 Species paperclip
2 Approx date of death 2007-01-01 00:00:00
3 Weight in grams 0.9
2.3 Summarize your data
The skimr package can be very useful to easily summarize your data!
skimr::skim(df)| Name | df |
| Number of rows | 107 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| publication | 0 | 1 | 2 | 29 | 0 | 107 | 0 |
| text | 0 | 1 | 22 | 833 | 0 | 107 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2020-03-16 | 2020-05-01 | 2020-03-23 | 28 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| grade | 0 | 1 | 90.64 | 6.11 | 70 | 90 | 90 | 94 | 100 | ▁▂▁▇▃ |
summary(df) grade publication text date
Min. : 70.00 Length:107 Length:107 Min. :2020-03-16
1st Qu.: 90.00 Class :character Class :character 1st Qu.:2020-03-16
Median : 90.00 Mode :character Mode :character Median :2020-03-23
Mean : 90.64 Mean :2020-03-25
3rd Qu.: 94.00 3rd Qu.:2020-04-02
Max. :100.00 Max. :2020-05-01
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 grade 107 non-null int64
1 publication 107 non-null object
2 text 107 non-null object
3 date 107 non-null object
dtypes: int64(1), object(3)
memory usage: 3.5+ KB
df.describe() grade
count 107.000000
mean 90.635514
std 6.114308
min 70.000000
25% 90.000000
50% 90.000000
75% 94.000000
max 100.000000