Administrative Information

This seminar will be offered again in summer term 2020.

  • Course day/time: Fri., 9:15-10:45 (building 22A, room 203)
  • Instructor: Uli Niemann
  • Course type: seminar
  • ECTS credits: 6
  • Audience: all FIN Master degree programs
  • Course language: english
  • Registration and application:
  • Prerequisites: see Prerequisites section
  • Technical requirements: bring your own laptop with R and RStudio installed on it
  • Grading: based on several deliverables in the context of a semester-long data science project → see Project page


The schedule for the final presentations includes links to the project websites now.

The definite schedule for the final presentations is fixed now, please see the tables below. There is one slight change in the instructions for the presentations: since each team has just 15 min to present their work, the project screencast should NOT be shown at the beginning of a talk, simply due to lack of time. The project websites and corresponding screencasts will be linked from the course website, so that everyone can watch the video in advance to the presentations.


  Time   Shortcode Title
1 09:00 09:15 Github-Mining Clustering and Pattern Discovery of Github Project Repositories
2 09:20 09:35 CuBA Customer Behavioural Analytics in the Retail Sector
3 09:40 09:55 TaPStOP Analysis and Tag Prediction for Stack Overflow Posts
4 10:00 10:15 CSV Machine Learning for Automatic Vulnerability Detection in Cyber Security
5 10:20 10:35 WHOCares Identification of Predictors for Climate Change Effects
6 10:40 10:55 StopMoVis Streaming Topic Model and Visualization for Twitter


  Time   Shortcode Title
7 09:00 09:15 Eur-Lex Multi-Label Classification of Legal Text Documents
8 09:20 09:35 MineR Analyzing Trajectory Data to Understand the Effects of Video Games on Human Short- and Long-Term Memory
9 09:40 09:55 Foursquare Exploring Location-Based Social Network Data from Foursquare
10 10:00 10:15 RAAS Predictive Policing for EDA on the Dallas Crime Dataset
11 10:20 10:35 MaLGA “Make the Liver Great Again” - Regression for High-Dimensional Longitudinal Cohort Data Study

The poll for the topics of weeks 11&12 is closed now. You have decided that we will cover the topics ‘Feature Engineering’ and ‘Interpretable Machine Learning’.

The project proposal review meetings which take place on 19./20.11. are scheduled via Appointlet. The link for registration has been sent by mail.

The notifications regarding admission have been sent to all students who registered via LSF.

Due to the very high interest in Data Science with R, we unfortunately cannot admit all students who have registered to the seminar via LSF. This is why we would like you to submit a brief application letter (ca. 300 words) at uli.niemann[at] until October 8th 2018. This letter should contain a) a statement why the seminar would enhance your study and b) a list of prerequisites and/or recommendations. On October 10th 2018, we will pick up to 30 students for admission to the seminar from all applications.

Course Description

Data Science with R (DataSciR) is an applied course about learning from data to perform predictions and to obtain useful insights. In the seminar, we will use the statistical programming language R.

Necessary skills to manage and analyze data will be taught and practiced on real-world applications and through a semester-long graded data science project.

Programming knowledge of other courses are helpful but not mandatory. However, you are expected to have a profound knowledge of fundamental data mining techniques, such as classification, regression and clustering.

After successful completion of this course, you will be able to proficiently perform the following tasks in R:

  • import and preprocess raw data
  • transform data for modelling
  • perform exploratory data analysis with summary statistics and visualization
  • understand, build and evaluate predictive classification and prediction models, including regression models, tree-based models, ensembles and boosted models
  • communicate and disseminate results and findings through reproducible documents, presentations, websites and interactive web applications

Tentative Syllabus


There are no mandatory prerequisites for DataSciR. However, it is recommended that you have heard at least one of the following lectures (or comparable):

Also, you should have a basic programing and statistics knowledge. For example, you will learn the most important vector types and classes in R, but you will not learn what a vector or a class is in general. Accordingly, you should know what the terms mean, standard deviation, probability, hypothesis test, p-value, etc. mean.

Technical Requirements

It is recommended to bring your laptop to each course meeting. Class meetings are a mix of lecture and short coding exercises. You will get the most out of the meetings if you have a laptop and can work on these exercises. Hence, you should set-up your laptop until the end of the first week as described in the Software section.


Other Ressources


By the end of the first week, you should have installed the following software on your own laptop:

  1. R
  2. RStudio
  3. optional for Windows: Rtools

Also, please check whether you can successfully install packages. To do so, click on the Packages tab in the bottom-right pane in RStudio. Then, click on the Install button and specify an arbitrary package, e.g. dplyr. Finally, click on Install. Alternatively, you can install a package from the console with install.packages("dplyr"). If everything is set up correctly, no error messages should be displayed when you load the installed package with library(dplyr).

List of packages used on slides

Execute the following code chunk to install all packages that are used on the course slides (so that you don’t need to manually install each of them). Please note that the list will be updated during the semester.

pacman::p_load(char = c(