Learning and Teaching Web Scraping

A summary of a longer article on web scraping in data science education.

teaching
R
curriculum
manuscripts
Author
Published

July 15, 2020

Background

Mine Çetinkaya-Rundel and I recently wrote a paper on Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities. None of us have the time in our life to read all the papers we want to read and watch all the movies we want to watch. So I thought I would summarize the paper in this blog post.

We wrote the paper for an audience that would teach web scraping, in a way, it is a teaching web scraping guide. We present examples of teaching web scraping with R but even if you teach with Python or any other tool, there are many points in the paper that are language agnostic. In addition to teaching, the paper has enough information to help a beginner learn web scraping.

I want to provide some points that might help you at least know what is in this paper and you can read it in the future if you find the time or if you need it for your own work. I wrote different reasons why one may read the paper and the suggested sections in the paper. Needless to say, in order to get the whole picture, you should read the whole paper.

Summary

Goal Section
You want to learn the basic idea behind HTML & CSS Section 2.1 (HTML & CSS)
You want to learn web scraping using R Sections 2 (Technical Tools) and 3 (Classroom Examples).
You want to know why one should learn web scraping and/or teach web scraping Section 1 (Introduction) and Section 5
You are trying to learn web scraping but running into problems Section 4 (Challenges)
You want to teach web scraping and need an example Section 3 (Classroom Examples)
You want to teach web scraping or already are teaching web scraping and want to consider the bigger picture. Section 4 (Challenges) and Section 5 (Opportunities)


We do not only focus on technical aspects of web scraping. We also note that just because we can scrape data off the web that does not necessarily mean that we should scrape any data off the web. We discuss issues related to ethics of web scraping in Section 5 (Opportunities).

We also provide the following set of questions that we believe instructors (and novice scrapers) should consider before deciding on which website to scrape from:

  • Is the data from human subjects? If yes, is it ethical to scrape the data?
  • Does the website provide an API?
  • Does the website allow web scraping?
  • Are the data provided in an HTML table?
  • Are the CSS Selectors easy to select with SelectorGadget?
  • Is there non-numeric data? If yes, how easy is it to manipulate it?
  • Would the process of scraping involve iteration over multiple pages? If yes, how much data are you planning to scrape, all or a sample?

If you have the time and the interest feel free to read the whole paper. We wrote the paper in R Markdown using the rticles package. In case it is useful for you, you may also find the R Markdown version of the paper as well as the code (in R) in the paper on GitHub.

To cite the paper: Mine Dogucu & Mine Çetinkaya-Rundel (2020): Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities, Journal of Statistics Education, DOI:10.1080/10691898.2020.1787116

No matching items