2 Structured Data

The most common tabular (or structured) file formats one may encounter in a data science project are:

delimited files such as CSV (comma-separated values) and TSV (tab-separated values) files;
Excel files.

We will consider the Personal key indicators of heart disease from Kaggle.

kaggle datasets download kamilpytlak/personal-key-indicators-of-heart-disease ./data --unzip

R
Python

library(readr)

import pandas as pd

2.1 Delimited files

2.1.1 TSV

As an example, let’s consider the Animal Crossing dataset from week 19 of the 2020 TidyTuesday challenge.

R
Python

url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"

df <- read_tsv(url, show_col_types = FALSE)
df

# A tibble: 107 × 4
   grade publication      text                                        date      
   <dbl> <chr>            <chr>                                       <date>    
 1   100 Pocket Gamer UK  Animal Crossing; New Horizons, much like i… 2020-03-16
 2   100 Forbes           Know that if you’re overwhelmed with the w… 2020-03-16
 3   100 Telegraph        With a game this broad and lengthy, there’… 2020-03-16
 4   100 VG247            Animal Crossing: New Horizons is everythin… 2020-03-16
 5   100 Nintendo Insider Above all else, Animal Crossing: New Horiz… 2020-03-16
 6   100 Trusted Reviews  Animal Crossing: New Horizons is the best … 2020-03-16
 7   100 VGC              Nintendo's comforting life sim is a tranqu… 2020-03-16
 8   100 God is a Geek    A beautiful, welcoming game that is everyt… 2020-03-16
 9   100 Nintendo Life    Animal Crossing: New Horizons takes Animal… 2020-03-16
10   100 Daily Star       Similar to how Breath of the Wild and Odys… 2020-03-16
# … with 97 more rows

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"

df = pd.read_table(url)
df

     grade  ...        date
0      100  ...  2020-03-16
1      100  ...  2020-03-16
2      100  ...  2020-03-16
3      100  ...  2020-03-16
4      100  ...  2020-03-16
..     ...  ...         ...
102     90  ...  2020-04-16
103     90  ...  2020-04-17
104     95  ...  2020-04-22
105     90  ...  2020-05-01
106     80  ...  2020-05-01

[107 rows x 4 columns]

2.1.2 CSV

As an example, let’s consider the Indoor Pollution dataset from week 15 of the 2022 TidyTuesday challenge.

R
Python

url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-04-12/indoor_pollution.csv"

read_csv(url, show_col_types = FALSE)

# A tibble: 8,010 × 4
   Entity      Code   Year `Deaths - Cause: All causes - Risk: Household air p…`
   <chr>       <chr> <dbl>                                                 <dbl>
 1 Afghanistan AFG    1990                                                  19.6
 2 Afghanistan AFG    1991                                                  19.3
 3 Afghanistan AFG    1992                                                  19.5
 4 Afghanistan AFG    1993                                                  19.7
 5 Afghanistan AFG    1994                                                  19.4
 6 Afghanistan AFG    1995                                                  19.6
 7 Afghanistan AFG    1996                                                  19.8
 8 Afghanistan AFG    1997                                                  19.7
 9 Afghanistan AFG    1998                                                  19.0
10 Afghanistan AFG    1999                                                  19.9
# … with 8,000 more rows

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-04-12/indoor_pollution.csv"

pd.read_csv(url)

           Entity  ... Deaths - Cause: All causes - Risk: Household air pollution from solid fuels - Sex: Both - Age: Age-standardized (Percent)
0     Afghanistan  ...                                          19.623001                                                                       
1     Afghanistan  ...                                          19.335193                                                                       
2     Afghanistan  ...                                          19.508785                                                                       
3     Afghanistan  ...                                          19.677607                                                                       
4     Afghanistan  ...                                          19.432528                                                                       
...           ...  ...                                                ...                                                                       
8005     Zimbabwe  ...                                           8.603630                                                                       
8006     Zimbabwe  ...                                           8.656817                                                                       
8007     Zimbabwe  ...                                           8.690100                                                                       
8008     Zimbabwe  ...                                           8.736245                                                                       
8009     Zimbabwe  ...                                           8.658818                                                                       

[8010 rows x 4 columns]

2.1.3 General delimiter

Again, we can consider the Animal Crossing dataset from TidyTuesday and use the read_table function to import the data by specifying a separator/delimiter.

R
Python

url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"

read_delim(url, delim ="\t", show_col_types = FALSE)

# A tibble: 107 × 4
   grade publication      text                                        date      
   <dbl> <chr>            <chr>                                       <date>    
 1   100 Pocket Gamer UK  Animal Crossing; New Horizons, much like i… 2020-03-16
 2   100 Forbes           Know that if you’re overwhelmed with the w… 2020-03-16
 3   100 Telegraph        With a game this broad and lengthy, there’… 2020-03-16
 4   100 VG247            Animal Crossing: New Horizons is everythin… 2020-03-16
 5   100 Nintendo Insider Above all else, Animal Crossing: New Horiz… 2020-03-16
 6   100 Trusted Reviews  Animal Crossing: New Horizons is the best … 2020-03-16
 7   100 VGC              Nintendo's comforting life sim is a tranqu… 2020-03-16
 8   100 God is a Geek    A beautiful, welcoming game that is everyt… 2020-03-16
 9   100 Nintendo Life    Animal Crossing: New Horizons takes Animal… 2020-03-16
10   100 Daily Star       Similar to how Breath of the Wild and Odys… 2020-03-16
# … with 97 more rows

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv"

pd.read_table(url, sep="\t")

     grade  ...        date
0      100  ...  2020-03-16
1      100  ...  2020-03-16
2      100  ...  2020-03-16
3      100  ...  2020-03-16
4      100  ...  2020-03-16
..     ...  ...         ...
102     90  ...  2020-04-16
103     90  ...  2020-04-17
104     95  ...  2020-04-22
105     90  ...  2020-05-01
106     80  ...  2020-05-01

[107 rows x 4 columns]

2.2 Excel files

At the moment, readxl does not support reading files from URLs. Hence, we will consider an example dataset which comes into the readxl package.

R
Python

library(readxl)

path_to_ds <- readxl_example("clippy.xlsx")
read_xlsx(path_to_ds)

# A tibble: 4 × 2
  name                 value    
  <chr>                <chr>    
1 Name                 Clippy   
2 Species              paperclip
3 Approx date of death 39083    
4 Weight in grams      0.9

pd.read_excel(r.path_to_ds)

                   name                value
0                  Name               Clippy
1               Species            paperclip
2  Approx date of death  2007-01-01 00:00:00
3       Weight in grams                  0.9

2.3 Summarize your data

The skimr package

The skimr package can be very useful to easily summarize your data!

R
Python

skimr::skim(df)

Data summary
Name	df
Number of rows	107
Number of columns	4
_______________________
Column type frequency:
character	2
Date	1
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
publication	0	1	2	29	0	107	0
text	0	1	22	833	0	107	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2020-03-16	2020-05-01	2020-03-23	28

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
grade	0	1	90.64	6.11	70	90	90	94	100	▁▂▁▇▃

summary(df)

     grade        publication            text                date           
 Min.   : 70.00   Length:107         Length:107         Min.   :2020-03-16  
 1st Qu.: 90.00   Class :character   Class :character   1st Qu.:2020-03-16  
 Median : 90.00   Mode  :character   Mode  :character   Median :2020-03-23  
 Mean   : 90.64                                         Mean   :2020-03-25  
 3rd Qu.: 94.00                                         3rd Qu.:2020-04-02  
 Max.   :100.00                                         Max.   :2020-05-01

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   grade        107 non-null    int64 
 1   publication  107 non-null    object
 2   text         107 non-null    object
 3   date         107 non-null    object
dtypes: int64(1), object(3)
memory usage: 3.5+ KB

df.describe()

            grade
count  107.000000
mean    90.635514
std      6.114308
min     70.000000
25%     90.000000
50%     90.000000
75%     94.000000
max    100.000000