3 Semi-structured Data

When we talk about semi-structured data we usually refer to data that is not represented as a table, but as a hierarchical nested structure of key-value pairs. We will see how to import two kinds of file:

JSON (JavaScript Object Notation);
XML (eXtensible Markup Language).

R
Python

library(jsonlite)
library(xml2)

import pandas as pd

3.1 JSON

We will use the Open Library RESTful API (see chapter XX to know more about APIs) in order to retrieve information about books written by one of my favorite authors, Anna Maria Ortese.

R
Python

json_data <- read_json("http://openlibrary.org/search.json?author=anna+maria+ortese",
                       simplifyVector = TRUE)

json_data = pd.read_json("http://openlibrary.org/search.json?author=anna+maria+ortese")

By default, jsonlite::read_json() will not simplify the JSON structure, whereas pd.read_json() will try to convert the JSON file into a table.

lapply(json_data, length) # in R

$numFound
[1] 1

$start
[1] 1

$numFoundExact
[1] 1

$docs
[1] 56

$num_found
[1] 1

$q
[1] 1

$offset
[1] 0

We are interested in retrieving information from four sub-fields coming from the docs field:

R
Python

json_data$docs |> 
    dplyr::select(author_name, title, publish_year, number_of_pages_median) |> 
    head(5)

        author_name                    title                       publish_year
1 Anna Maria Ortese Il mare non bagna Napoli 1955, 1979, 1953, 1994, 1975, 1967
2 Anna Maria Ortese                L' iguana                         1978, 1965
3 Anna Maria Ortese                  Romanzi                               2002
4 Anna Maria Ortese        Poveri e semplici                         1974, 1967
5 Anna Maria Ortese            Short Stories                               1994
  number_of_pages_median
1                    160
2                    204
3                     NA
4                    163
5                    191

(
    pd.json_normalize(json_data["docs"])
    .loc[:,["author_name", "title", "publish_year", "number_of_pages_median"]]
    .head(5)
)

           author_name  ... number_of_pages_median
0  [Anna Maria Ortese]  ...                  160.0
1  [Anna Maria Ortese]  ...                  204.0
2  [Anna Maria Ortese]  ...                    NaN
3  [Anna Maria Ortese]  ...                  163.0
4  [Anna Maria Ortese]  ...                  191.0

[5 rows x 4 columns]

Read more about how to subset data frames in Chapter 5

xml_data <- read_xml("https://www.w3schools.com/xml/simple.xml")

xml_data = pd.read_xml("https://www.w3schools.com/xml/simple.xml")

R
Python

xml_data |> 
    xml_children() |> 
    as_list() |> 
    dplyr::bind_rows() |> 
    tidyr::unnest(cols = c(name, price, description, calories))

# A tibble: 5 × 4
  name                        price description                         calories
  <chr>                       <chr> <chr>                               <chr>   
1 Belgian Waffles             $5.95 Two of our famous Belgian Waffles … 650     
2 Strawberry Belgian Waffles  $7.95 Light Belgian waffles covered with… 900     
3 Berry-Berry Belgian Waffles $8.95 Light Belgian waffles covered with… 900     
4 French Toast                $4.50 Thick slices made from our homemad… 600     
5 Homestyle Breakfast         $6.95 Two eggs, bacon or sausage, toast,… 950

xml_data

                          name  ... calories
0              Belgian Waffles  ...      650
1   Strawberry Belgian Waffles  ...      900
2  Berry-Berry Belgian Waffles  ...      900
3                 French Toast  ...      600
4          Homestyle Breakfast  ...      950

[5 rows x 4 columns]