`R`

(DataSciR)This seminar will be offered again in **summer term 2020**.

- Course day/time: Fri., 9:15-10:45 (building 22A, room 203)
- Instructor: Uli Niemann
- Course type: seminar
- ECTS credits: 6
- Audience: all FIN Master degree programs
- Course language: english
- Registration and application:
- via LSF from 01.09.2018 until 08.10.2018
- brief motivational letter
- at the examination office as term paper (
*Hausarbeit*) using this form until 19.11.2018

- Prerequisites: see Prerequisites section
- Technical requirements: bring your own laptop with
`R`

and RStudio installed on it - Grading: based on several deliverables in the context of a semester-long data science project → see Project page

17.01.2018

The schedule for the final presentations includes links to the project websites now.

21.12.2018

The definite schedule for the final presentations is fixed now, please see the tables below. There is one slight change in the instructions for the presentations: since each team has just 15 min to present their work, the project screencast should *NOT* be shown at the beginning of a talk, simply due to lack of time. The project websites and corresponding screencasts will be linked from the course website, so that everyone can watch the video in advance to the presentations.

-18.01.2019-

Time | Shortcode | Title | ||
---|---|---|---|---|

1 | 09:00 | 09:15 | Github-Mining | Clustering and Pattern Discovery of Github Project Repositories |

2 | 09:20 | 09:35 | CuBA | Customer Behavioural Analytics in the Retail Sector |

3 | 09:40 | 09:55 | TaPStOP | Analysis and Tag Prediction for Stack Overflow Posts |

4 | 10:00 | 10:15 | CSV | Machine Learning for Automatic Vulnerability Detection in Cyber Security |

5 | 10:20 | 10:35 | WHOCares | Identification of Predictors for Climate Change Effects |

6 | 10:40 | 10:55 | StopMoVis | Streaming Topic Model and Visualization for Twitter |

-25.01.2019-

Time | Shortcode | Title | ||
---|---|---|---|---|

7 | 09:00 | 09:15 | Eur-Lex | Multi-Label Classification of Legal Text Documents |

8 | 09:20 | 09:35 | MineR | Analyzing Trajectory Data to Understand the Effects of Video Games on Human Short- and Long-Term Memory |

9 | 09:40 | 09:55 | Foursquare | Exploring Location-Based Social Network Data from Foursquare |

10 | 10:00 | 10:15 | RAAS | Predictive Policing for EDA on the Dallas Crime Dataset |

11 | 10:20 | 10:35 | MaLGA | “Make the Liver Great Again” - Regression for High-Dimensional Longitudinal Cohort Data Study |

02.11.2018

The poll for the topics of weeks 11&12 is closed now. You have decided that we will cover the topics ‘Feature Engineering’ and ‘Interpretable Machine Learning’.

30.10.2018

The project proposal review meetings which take place on 19./20.11. are scheduled via Appointlet. The link for registration has been sent by mail.

10.10.2018

The notifications regarding admission have been sent to all students who registered via LSF.

24.09.2018

Due to the very high interest in Data Science with R, we unfortunately cannot admit all students who have registered to the seminar via LSF. This is why we would like you to submit a brief application letter (ca. 300 words) at uli.niemann[at]ovgu.de until October 8th 2018. This letter should contain a) a statement why the seminar would enhance your study and b) a list of prerequisites and/or recommendations. On October 10th 2018, we will pick up to 30 students for admission to the seminar from all applications.

Data Science with R (*DataSciR*) is an applied course about learning from data to perform predictions and to obtain useful insights. In the seminar, we will use the statistical programming language `R`

.

Necessary skills to manage and analyze data will be taught and practiced on real-world applications and through a semester-long graded data science project.

Programming knowledge of other courses are helpful but not mandatory. However, you are expected to have a profound knowledge of fundamental data mining techniques, such as classification, regression and clustering.

After successful completion of this course, you will be able to proficiently perform the following tasks in `R`

:

- import and preprocess raw data
- transform data for modelling
- perform exploratory data analysis with summary statistics and visualization
- understand, build and evaluate predictive classification and prediction models, including regression models, tree-based models, ensembles and boosted models
- communicate and disseminate results and findings through reproducible documents, presentations, websites and interactive web applications

There are no mandatory prerequisites for DataSciR. However, it is recommended that you have heard at least one of the following lectures (or comparable):

Also, you should have a basic programing and statistics knowledge. For example, you will learn the most important vector types and classes in `R`

, but you will not learn what a vector or a class *is* in general. Accordingly, you should know what the terms mean, standard deviation, probability, hypothesis test, p-value, etc. mean.

It is recommended to bring your laptop to each course meeting. Class meetings are a mix of lecture and short coding exercises. You will get the most out of the meetings if you have a laptop and can work on these exercises. Hence, you should set-up your laptop until the end of the first week as described in the Software section.

Data Mining / Statistical Analysis:

- Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning. Springer, 2017.
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction.. Springer, 2009.
- Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Pearson, 2005.
- Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011.

`R`

-specific:

- Hadley Wickham, and Garrett Grolemund. R for Data Science. O’Reilly, 2017.
- Max Kuhn. The
`caret`

package. Online documentation. - Max Kuhn, and Kjell Johnson. Applied Predictive Modeling. Springer, 2013.
- Yihui Xie, J. J. Allaire, and Garrett Grolemund. R Markdown: The Definitive Guide. Chapman & Hall/CRC, 2018.
- Hadley Wickham. Advanced R. Chapman & Hall, 2018/2019.
- Hadley Wickham. ggplot2 - Elegant Graphics for Data Analysis. Springer, 2016.
- Max Kuhn, and Kjell Johnson. Feature Engineering and Selection: A Practical Approach for Predictive Models. Draft version.
- Bradley Boehmke. Hands-on Machine Learning with R. Draft version.

other:

- Jeffrey Leak. Organizing Data Science Projects. Learnpub.com.
- Jenny Bryan, and others. Happy Git and GitHub for the useR. 2018.

- RStudio cheat sheets
- RStudio webinars
- DataCamp (online learning platform that offers (paid) interactive
`R`

courses) - Quick-R (short tutorials on various topics, e.g. data import, statistics and graph generation)

By the end of the first week, you should have installed the following software on your own laptop:

Also, please check whether you can successfully install packages. To do so, click on the *Packages* tab in the bottom-right pane in RStudio. Then, click on the *Install* button and specify an arbitrary package, e.g. `dplyr`

. Finally, click on *Install*. Alternatively, you can install a package from the console with `install.packages("dplyr")`

. If everything is set up correctly, no error messages should be displayed when you load the installed package with `library(dplyr)`

.

Execute the following code chunk to install all packages that are used on the course slides (so that you don’t need to manually install each of them). Please note that the list will be updated during the semester.

```
install.packages("pacman")
pacman::p_load(char = c(
"broom",
"caret",
"corrplot",
"corrr",
"cowplot",
"dbscan",
"deldir",
"discretization",
"e1071",
"extrafont",
"fansi",
"farver",
"fastAdaboost",
"feather",
"FSelector",
"gapminder",
"GGally",
"ggdendro",
"ggmap",
"ggrepel",
"ggridges",
"ggthemes",
"gplots",
"here",
"htmltab",
"iml",
"janitor",
"kableExtra",
"knitr",
"kohonen",
"latex2exp",
"lime",
"maps",
"mice",
"openintro",
"plotly",
"randomForest",
"ranger",
"rattle",
"RColorBrewer",
"rgdal",
"rpart",
"rpart.plot",
"seriation",
"skimr",
"tidytext",
"tidyverse",
"tsne",
"UsingR",
"xaringan",
"xgboost",
"viridisLite",
"wordcloud"
))
```