library(jsonlite)
library(xml2)
3 Semi-structured Data
When we talk about semi-structured data we usually refer to data that is not represented as a table, but as a hierarchical nested structure of key-value pairs. We will see how to import two kinds of file:
- JSON (JavaScript Object Notation);
- XML (eXtensible Markup Language).
import pandas as pd
3.1 JSON
We will use the Open Library RESTful API (see chapter XX to know more about APIs) in order to retrieve information about books written by one of my favorite authors, Anna Maria Ortese.
<- read_json("http://openlibrary.org/search.json?author=anna+maria+ortese",
json_data simplifyVector = TRUE)
= pd.read_json("http://openlibrary.org/search.json?author=anna+maria+ortese") json_data
By default, jsonlite::read_json()
will not simplify the JSON structure, whereas pd.read_json()
will try to convert the JSON file into a table.
lapply(json_data, length) # in R
$numFound
[1] 1
$start
[1] 1
$numFoundExact
[1] 1
$docs
[1] 56
$num_found
[1] 1
$q
[1] 1
$offset
[1] 0
We are interested in retrieving information from four sub-fields coming from the docs
field:
$docs |>
json_data::select(author_name, title, publish_year, number_of_pages_median) |>
dplyrhead(5)
author_name title publish_year
1 Anna Maria Ortese Il mare non bagna Napoli 1955, 1979, 1953, 1994, 1975, 1967
2 Anna Maria Ortese L' iguana 1978, 1965
3 Anna Maria Ortese Romanzi 2002
4 Anna Maria Ortese Poveri e semplici 1974, 1967
5 Anna Maria Ortese Short Stories 1994
number_of_pages_median
1 160
2 204
3 NA
4 163
5 191
("docs"])
pd.json_normalize(json_data["author_name", "title", "publish_year", "number_of_pages_median"]]
.loc[:,[5)
.head( )
author_name ... number_of_pages_median
0 [Anna Maria Ortese] ... 160.0
1 [Anna Maria Ortese] ... 204.0
2 [Anna Maria Ortese] ... NaN
3 [Anna Maria Ortese] ... 163.0
4 [Anna Maria Ortese] ... 191.0
[5 rows x 4 columns]
Read more about how to subset data frames in Chapter 5
3.2 XML
<- read_xml("https://www.w3schools.com/xml/simple.xml") xml_data
= pd.read_xml("https://www.w3schools.com/xml/simple.xml") xml_data
|>
xml_data xml_children() |>
as_list() |>
::bind_rows() |>
dplyr::unnest(cols = c(name, price, description, calories)) tidyr
# A tibble: 5 × 4
name price description calories
<chr> <chr> <chr> <chr>
1 Belgian Waffles $5.95 Two of our famous Belgian Waffles … 650
2 Strawberry Belgian Waffles $7.95 Light Belgian waffles covered with… 900
3 Berry-Berry Belgian Waffles $8.95 Light Belgian waffles covered with… 900
4 French Toast $4.50 Thick slices made from our homemad… 600
5 Homestyle Breakfast $6.95 Two eggs, bacon or sausage, toast,… 950
xml_data
name ... calories
0 Belgian Waffles ... 650
1 Strawberry Belgian Waffles ... 900
2 Berry-Berry Belgian Waffles ... 900
3 French Toast ... 600
4 Homestyle Breakfast ... 950
[5 rows x 4 columns]