kaggle datasets download kamilpytlak/personal-key-indicators-of-heart-disease ./data --unzip
2 Structured Data
The most common tabular (or structured) file formats one may encounter in a data science project are:
- delimited files such as CSV (comma-separated values) and TSV (tab-separated values) files;
- Excel files.
We will consider the Personal key indicators of heart disease from Kaggle.
library(readr)
import pandas as pd
2.1 Delimited files
2.1.1 TSV
As an example, let’s consider the Animal Crossing dataset from week 19 of the 2020 TidyTuesday challenge.
<- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"
url
<- read_tsv(url, show_col_types = FALSE)
df df
# A tibble: 107 × 4
grade publication text date
<dbl> <chr> <chr> <date>
1 100 Pocket Gamer UK Animal Crossing; New Horizons, much like i… 2020-03-16
2 100 Forbes Know that if you’re overwhelmed with the w… 2020-03-16
3 100 Telegraph With a game this broad and lengthy, there’… 2020-03-16
4 100 VG247 Animal Crossing: New Horizons is everythin… 2020-03-16
5 100 Nintendo Insider Above all else, Animal Crossing: New Horiz… 2020-03-16
6 100 Trusted Reviews Animal Crossing: New Horizons is the best … 2020-03-16
7 100 VGC Nintendo's comforting life sim is a tranqu… 2020-03-16
8 100 God is a Geek A beautiful, welcoming game that is everyt… 2020-03-16
9 100 Nintendo Life Animal Crossing: New Horizons takes Animal… 2020-03-16
10 100 Daily Star Similar to how Breath of the Wild and Odys… 2020-03-16
# … with 97 more rows
= "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"
url
= pd.read_table(url)
df df
grade ... date
0 100 ... 2020-03-16
1 100 ... 2020-03-16
2 100 ... 2020-03-16
3 100 ... 2020-03-16
4 100 ... 2020-03-16
.. ... ... ...
102 90 ... 2020-04-16
103 90 ... 2020-04-17
104 95 ... 2020-04-22
105 90 ... 2020-05-01
106 80 ... 2020-05-01
[107 rows x 4 columns]
2.1.2 CSV
As an example, let’s consider the Indoor Pollution dataset from week 15 of the 2022 TidyTuesday challenge.
<- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-04-12/indoor_pollution.csv"
url
read_csv(url, show_col_types = FALSE)
# A tibble: 8,010 × 4
Entity Code Year `Deaths - Cause: All causes - Risk: Household air p…`
<chr> <chr> <dbl> <dbl>
1 Afghanistan AFG 1990 19.6
2 Afghanistan AFG 1991 19.3
3 Afghanistan AFG 1992 19.5
4 Afghanistan AFG 1993 19.7
5 Afghanistan AFG 1994 19.4
6 Afghanistan AFG 1995 19.6
7 Afghanistan AFG 1996 19.8
8 Afghanistan AFG 1997 19.7
9 Afghanistan AFG 1998 19.0
10 Afghanistan AFG 1999 19.9
# … with 8,000 more rows
= "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-04-12/indoor_pollution.csv"
url
pd.read_csv(url)
Entity ... Deaths - Cause: All causes - Risk: Household air pollution from solid fuels - Sex: Both - Age: Age-standardized (Percent)
0 Afghanistan ... 19.623001
1 Afghanistan ... 19.335193
2 Afghanistan ... 19.508785
3 Afghanistan ... 19.677607
4 Afghanistan ... 19.432528
... ... ... ...
8005 Zimbabwe ... 8.603630
8006 Zimbabwe ... 8.656817
8007 Zimbabwe ... 8.690100
8008 Zimbabwe ... 8.736245
8009 Zimbabwe ... 8.658818
[8010 rows x 4 columns]
2.1.3 General delimiter
Again, we can consider the Animal Crossing dataset from TidyTuesday and use the read_table
function to import the data by specifying a separator/delimiter.
<- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"
url
read_delim(url, delim ="\t", show_col_types = FALSE)
# A tibble: 107 × 4
grade publication text date
<dbl> <chr> <chr> <date>
1 100 Pocket Gamer UK Animal Crossing; New Horizons, much like i… 2020-03-16
2 100 Forbes Know that if you’re overwhelmed with the w… 2020-03-16
3 100 Telegraph With a game this broad and lengthy, there’… 2020-03-16
4 100 VG247 Animal Crossing: New Horizons is everythin… 2020-03-16
5 100 Nintendo Insider Above all else, Animal Crossing: New Horiz… 2020-03-16
6 100 Trusted Reviews Animal Crossing: New Horizons is the best … 2020-03-16
7 100 VGC Nintendo's comforting life sim is a tranqu… 2020-03-16
8 100 God is a Geek A beautiful, welcoming game that is everyt… 2020-03-16
9 100 Nintendo Life Animal Crossing: New Horizons takes Animal… 2020-03-16
10 100 Daily Star Similar to how Breath of the Wild and Odys… 2020-03-16
# … with 97 more rows
= "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"
url
="\t") pd.read_table(url, sep
grade ... date
0 100 ... 2020-03-16
1 100 ... 2020-03-16
2 100 ... 2020-03-16
3 100 ... 2020-03-16
4 100 ... 2020-03-16
.. ... ... ...
102 90 ... 2020-04-16
103 90 ... 2020-04-17
104 95 ... 2020-04-22
105 90 ... 2020-05-01
106 80 ... 2020-05-01
[107 rows x 4 columns]
2.2 Excel files
At the moment, readxl
does not support reading files from URLs. Hence, we will consider an example dataset which comes into the readxl
package.
library(readxl)
<- readxl_example("clippy.xlsx")
path_to_ds read_xlsx(path_to_ds)
# A tibble: 4 × 2
name value
<chr> <chr>
1 Name Clippy
2 Species paperclip
3 Approx date of death 39083
4 Weight in grams 0.9
pd.read_excel(r.path_to_ds)
name value
0 Name Clippy
1 Species paperclip
2 Approx date of death 2007-01-01 00:00:00
3 Weight in grams 0.9
2.3 Summarize your data
The skimr
package can be very useful to easily summarize your data!
::skim(df) skimr
Name | df |
Number of rows | 107 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 2 |
Date | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
publication | 0 | 1 | 2 | 29 | 0 | 107 | 0 |
text | 0 | 1 | 22 | 833 | 0 | 107 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2020-03-16 | 2020-05-01 | 2020-03-23 | 28 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
grade | 0 | 1 | 90.64 | 6.11 | 70 | 90 | 90 | 94 | 100 | ▁▂▁▇▃ |
summary(df)
grade publication text date
Min. : 70.00 Length:107 Length:107 Min. :2020-03-16
1st Qu.: 90.00 Class :character Class :character 1st Qu.:2020-03-16
Median : 90.00 Mode :character Mode :character Median :2020-03-23
Mean : 90.64 Mean :2020-03-25
3rd Qu.: 94.00 3rd Qu.:2020-04-02
Max. :100.00 Max. :2020-05-01
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 grade 107 non-null int64
1 publication 107 non-null object
2 text 107 non-null object
3 date 107 non-null object
dtypes: int64(1), object(3)
memory usage: 3.5+ KB
df.describe()
grade
count 107.000000
mean 90.635514
std 6.114308
min 70.000000
25% 90.000000
50% 90.000000
75% 94.000000
max 100.000000