rvest
. Be sure to check out the vignettes and other documentation.It’s likely that a huge portion of the datasets you’ve worked with in your life thus far are very well structured and organized to the point of being tabular. That is, the dataset consists of a single table with rows and columns. If you’re not familiar with the concept of tidy data I suggest you check out this tidyr
vignette on what it means for a dataset to be “tidy.” Most of your datasets have likely even been tidy:
Almost everyone’s experiences beyond this probably involved taking multiple tidy datasets and combining/merging them in some way, but the overall dataset still fit into this tabular structure. These often came in the form of TXT, CSV, or TSV files.
Our world is rich with data and a great deal of it doesn’t come in nice, neat tables…BUT a lot of it is still extremely structured and well organized.
Web pages can be beautiful and rich! HTML provides a wonderful structure with which to house this beauty and richness. Unfortunately, this structure is not the tabular structure that we’re used to, but it’s still very accessible. In particular, people have developed tools in R
for working with HTML. One very popular one is the rvest
package. Take a few minutes to (re-)introduce yourself to web-scraping with rvest
and working with HTML in R
.
I encourage you to following with the video to the point of copying and running the same code that’s being demonstrated.
Scraping the web is great, but wouldn’t it be great if we could get data from lots of other sources that we didn’t have to clean and wrangle ourselves…as is so often required of data gathered from HTML.
You’ve likely already noticed, via your work with leaflet
, another very popular data format called “JSON”. JSON is short for JavaScript Object Notation and is a syntax for storing and exchanging data…but data in a much broader sense than we’re used to thinking about.
In the middle of a statistics-related course when I hear the word “data” my mind instinctively envisions something tabular with which we could do some sort of visualization or analysis with. However, in the broadest sense, the word “data” really just describes pieces of information…which encapsulates everything from movie theater schedules when you look up showtimes on a website to what your cell phone uses as part of your “data plan.”
While the website for your favorite movie theater may not have had your visualization and analysis plans in mind, their data may still be of interest to us. Many companies and organizations have built tools that allow access to their data in a more streamlined way, and JSON has become a ubiquitous format for many of these data sources to use.
One of the reasons the JSON format is so important and useful is because it can accommodate very complex types and shapes of data. So while it may not always be workable into a tabular form, it still maintains a high degree of structure that we can exploit!