Advanced Data Analysis (ADA) course (code 27084)
Lecturer: Dr. Paolo Coletti
Paolo.Colettiunibz.it
. Office E203  Office hours: www.paolocoletti.it/timetable
Website: www.paolocoletti.it/advanceddataanalysis
At least two days before the end of the course
Do a Wilcoxon test for paired data twosided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion). If you do not have two variables which use the same measure, use Bolner's dataset.
Do a Wilcoxon test for paired data onesided, peeking at the sample average of differences and choosing the most promising one for H1 (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion). If you do not have two variables which use the same measure, use Bolner's dataset.
Open your old files and consider one test that you did in the past for each of these types:
 Student t test for a single variable
 Student t test for a variable by groups
 ANOVA
 Student t test for paired data
and check, using four different methods (one for each) the normality prerequisites. Write what does your conclusion implies on the test.
Consider one of your Pearson's correlation tests and check the normality of the two involved variables using for each one a different graphical methods (QQ plot and histogram)
Word file is here
For 13 December 2019
Something I forgot from last time: Do a chi square test using ONE nonbinary nominal variable (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a ANOVA TEST (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a KruskalWallis test (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a Pearson correlation test twosided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a Pearson correlation test onesided, peeking at your sample Pearson correlation and choosing as H1 the most promising side (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a Spearman correlation test twosided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a Spearman correlation test onesided, peeking at your sample Pearson correlation and choosing as H1 the most promising side (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a Student ttest for paired data twosided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion). If you do not have two variables which use the same measure, use Bolner's dataset.
Do a Student ttest for paired data onesided, peeking at the sample average of differences and choosing the most promising one for H1 (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion). If you do not have two variables which use the same measure, use Bolner's dataset.
Word file is here
For 9 December 2019 (for 13 December if you have to celebrate Saint Nicholas)
Do a chi square test using two binary nominal variables (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a chi square test using two nominal variables of which one with several categories (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Recode one of the two variable (or both) to have fewer categories and avoid the problem of expected frequencies less than 5. And redo the test.
Subset the sample to eliminate one or more categories to have fewer categories and avoid the problem of expected frequencies less than 5. Remember to drop unused factor levels. And redo the test.
Go back to your original dataset.
Do a Student's t test for two populations twosided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a Student's t test for two populations onesided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a Student's t test for two populations other onesided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a MannWhitney test for two populations twosided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a MannWhiteny test for two populations other onesided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Word file is here
For 6 December 2019
Redo all the commands I wrote in class using different numbers, different mathematical formulas, different variable names. The script is here
Do a Student's t test twosided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion) on a scale variable.
Do a Student's t test onesided (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion) on a scale variable.
Redo the same Student's t test onesided but on the opposite side (write everything: hypotheses, significance, sample statistical, reject/accept and conclusion).
Do a Sign test on a scale variable.
For 3 December 2019
Subset your dataset using a condition on a nominal/ordinal variable with several categories, keeping at least two categories. Then drop unused factor levels.
Subset your dataset using a condition on a scale variable.
Recode a nominal/ordinal variable with several categories into one with few categories, possibly using also else.
Recode a scale variable into a nominal one and convert it into an ordered factor (do you still remember how to do it?).
Bin a scale variable into a ordinal one (you do not need to make an ordered factor).
Bin another scale variable using another binning and label methods.
Compute a new variable using other two or more variables using the graphical interface.
Computer a new variable using other variables without using the graphical interface.
This time instead of sending me the Word file, send me the script (you can save the script from R commander interface or copy and paste everything into the email or whereever you want).
If you need the video recorded in class, it is this one.
For 29 November 2019
Boxplots for a scale and a nominal variable with comments.
Histograms for
a scale and a nominal variable with comments.
Pearson and Spearman correlation for two scale variables with comments.
Again
Pearson and Spearman correlation for two scale variables with comments.
Scatterplot for two scale variables with comments, changing the symbol to another bizzarre one and using IF NECESSARY the log axis.
Contingency table for three nominal variables with with row percentages with comments and then with column percentages with comments.
Scatterplot of two scale variables by a nominal variable with comments.
The word file I build in class is this one.
If you need the video, it is this one from minute 18.
For 22 November 2019
Contingency table for two binary nominal variables with row percentages with comments and then with column percentages with comments.
Contingency table for a nonbinary nominal variable and another nominal variable (your choiche) with row percentages with comments and then with column percentages with comments.
Staked bar chart for a binary nominal variable and another nominal variable (your choiche) with comments.
Sidebyside bar chart for a nonbinary nominal variable and another nominal variable (your choiche) with comments.
Numerical summaries for a scale variable grouped by a nonbinary nominal variable with comments.
The word file I build in class is this one.
If you need the video, it is this one up to minute 18. The video I recoded in class on contingency tables is this one
For 14 November 2019
Numerical summaries for a scale variable with comments.
Frequencies for a binary nominal variable with comments.
Frequencies for a non binary nominal variable (with no more than 10 categories!) with comments.
Bar graph for
a nominal variable with comments.
Pie chart for a non binary nominal variable (with no more than 10 categories!) using different coluors chosen by you with comments.
Histogram for a scale variable with appropriate binning and, in case you need it, setting xlim appropriately. With comments obviously.
Boxplot for another scale variable with a colour and with comments.
Index plot for a scale variable.
The word file I build in class is this one. The script with the commands I types is this one.
If you need the video, it is this one.
For the potential students
If you are lazy, do not read below and just watch this video here
Starting from this academic year this course has become a free choice course for students of the Master in Entrepreneurship and Innovation as well as for the studentes of the Master in Accounting and Finance. Therefore, to better accommodate your needs, I will adjust the timetable depending on the students who plan to attend. So, if you choose this course and plan to attend it, either come at the first lesson or drop me an email.
The course in brief
This course gives you the skills to organize and analyse data in a rigorous and scientific way. This is a kind of skill which you might have seen in other courses, but probably either too theoretically or as part of other scopes. In this course you will learn the vital skill of taking raw data and extract useful information from them. All the course is handson, i.e. we will focus on examples, exercises and assignments rather than on theory. At the end of the course you will be able to gather unstructured data, put them into a usable format, analyse them finding relations and interesting features and present your results.
If you plan to study regularly during the course, you can skip 94% of the exam study load. For the first part you are evaluated on a group assignment without the need for mandatory attendance, while for the second part you are evaluated on regular courseworks and oral presentations of your home activities (mandatory attendance either in person or via Skype is required).
If you cannot study regularly during this course, videos of all my lessons are provided on YouTube. However, take into consideration to do the assignment for first part anyway, since it does not require attendance and you can skip 46% of the exam study load.
Course
ADA course syllabus A.Y. 2019/20
ADA course content A.Y. 2019/20
Prerequisites
In order to correctly follow this course each student is required previous knowledge on these topics:
Course content
Organization
How to study for this course
This course is different from the majority of courses you are used to and therefore you need to adapt your study strategy.
If you plan to take the database assignment and the R courseworks, then you will automatically study through practice and no strategy effort is required.
However, if you plan not to attend, take note that this course is much more technical than theoretical and it is strictly sequential. This means that you either attend all the lessons (or compensate for missing lessons watching immediately the corresponding videos or reading the book) or it is really not worth coming sporadically, better staying home and study everything on the videos. For the exam the main difference with respect to other courses is that you have to train much more than studying. The content of this course is easy and does not need extensive study, however it is only with practice that you become skilled enough and know immediately what to do without wasting time.
Exam (if you plan to do database assignment and R coursework, you will do only the fourth part worth 6%)Exam is divided into four parts.
The first part (23% of the final grade) consists in written exercise on relational databases architecture and lasts 40 minutes, but may vary according to the complexity of exercises. This part is completely on paper and "closed book": no paper nor electronic help or tool is allowed.
The second part (23% of the final grade, file handling errors count negatively towards the final grade) consists in exercises on Access and lasts approximately 20 minutes depending on the length and complexity of exercises. It is held in computer room in turns of 25 students each. This part is totally open books: you may use any written or electronic document (including books, previous exams with solutions, your personal handwritten or electronic notes and my slides). You may not, however, use any communication program or device.
The third part (48% of the final grade, file handling errors count negatively towards the final grade) consists in statistical data analysis and graph creation with R and lasts approximately 40 minutes depending on the length and complexity of exercises. It follows the same other rules of the second part. It is held in computer room in turns of 25 students each. This part is totally open books: you may use any written or electronic document (including books, previous exams with solutions, your personal handwritten or electronic notes and my slides). You may not, however, use any communication program or device. In case there are more than 25 students, turns will appear on this website as soon as enrolment is closed; if you have specific timetable's needs, please write an email to Dr. Coletti as soon as possible, before the timetable appears.
Important warning: the crucial point of practical parts is time. If you never practice with the programs, you will still be able to do the exam but you will waste a lot of time looking for the commands or wondering what should you do and you will not finish it in time. Only with exercises will you be fast enough to complete it in the indicated time. It is a good idea to practice with a clock on the previous exams that you find below.
The fourth part (6% of the final grade) consists in written open questions on cryptocurrencies and blockchain technology. For the students who already did this part in one of my other courses, the question "What is a statistical test? Explain it using also a numerical example different from the one done in class". It lasts approximately 20 minutes. This part is completely on paper and "closed book": no paper nor electronic help or tool is allowed.
The exam's grade consists in a weighted average of the four parts, with file handling errors not participating in the average but counting negatively. Active participation in class, in particular interventions to improve the course (spotting errors or giving very good suggestions) can slightly increase the final grade up to 3/30. It is not necessary that all parts be sufficient to pass the exam, the weighted average must be sufficient. As the exam is indivisible from an administrative point of view, it is not possible to split it or to save a part and redo another: all parts must be taken in the same session. Only the assignments last until the last session of the academic year.
Database assignment
All students have the possibility to do an assignment which will replace the database and Access parts of the exam following these rules.
0. The assignment can be done in groups of 1, 2, 3, 4 students, you decide group's composition.
1. Each group must submit to Dr. Paolo Coletti via email the group composition and a database proposal. Database proposal must contain the database description and a draft version of the schema, containing the main tables, relations and fields involved. The proposal is part of the assignment's evaluation. I expect that your database be different from the ones of previous years, the ones used in class and the many ones I gave in exams in these years. However, if you still want to choose a database similar to these ones, I want that your structure be different. Thus, if you look at my solution and try to twist my structure deliberately it might be a suicide, much better not looking at my solution.
2. Dr. Paolo Coletti answers with the crucial corrections and suggestions.
3. The group builds a database in Access. The database must contain at least
Students in the group  1 or 2 
3 
4 

Total tables (junction and non junction)  6 
8 
10 

Junction tables  2 
2 
3 

Fields (not counting foreign keys)  20 
25 
30 

Validation rules  3 
4 
5 

Table validation rules  2 
2 
3 

Different summary queries  3 
4 
5 

Different nonselection queries  3 
4 
5 

Different queries with left/right join  2 
3 
4 

Different 2 queries involving al least >  3 tables 
4 tables 
5 tables 

Forms with locking using at least 2 tables  2 
3 
4 

Reports based on a query with grouping  2 
3 
4 

In the previous queries, at least one must contain a condition, at least one must contain a formula, at least one must contain a function, at least one must ask something to user, at least one must be based on another query. If it is not, add other queries with these things.  
Tables must use Required and Index feature wherever it is appropriate.  
Database must be filled in with enough data to make queries, forms and reports return something meaningful. 
4. The group writes the database documentation.
5. The database and documentation must be submitted to Dr. Paolo Coletti at least a couple of days before the last lessons (but also much earlier if you prefer). Database and documentation are part of assignment's evaluation.
6. The entire group must present the database before the end of the course . Presentation is rather brief and straightforward. Dr. Paolo Coletti chooses who presents which part and who answers to each question. Presentation and answering is part of the assignment's evaluation.
7.During the presentation other students are kindly invited to present observations and questions. Do not worry about lowering your colleagues' evaluation pointing out mistakes, as if any error I have not spotted comes out it will not be considered in the evaluation. You receive a small increase on the exam's final grade for each competent and correct observation.
8. Assignment's grades are, unless a tragic presentation is made, the same for all the students of the group. Grade is based on: proposal, database, documentation, presentation, answering.
9. If you have an assignment's grade of at least 60% you automatically use this grade as database+Access grade and you do not receive those parts' papers during the exam. This grade lasts until the last session of current academic year. If you prefer NOT to use your assignment's grade for an exam's session, you MUST write an email to Dr. Paolo Coletti AT LEAST 7 days before, no exceptions.
What not to do:
 submit a proposal in a hurry just to do something: you are reducing your assignment's grade
 submit partially (only database for example): I do not make puzzles of different submissions, everything missing when I correct will be considered missing
 resubmit with modifications: they will not be accepted if I have already corrected
 plan to work on the last days: if something goes wrong, you might be unable to submit everything.
R coursework
This coursework is a continuous activity which requires regular attendance and constant commitment, following these rules:
0.
It can be done only alone.
1.
Each student must attend each lesson from October to January. You have only 1 possibility to skip a lesson, no exception. You can ask to attend the lesson via Skype (checking that connection works is your responsability).
2. As soon as we start the R part , each student must find a suitable dataset different from the one used in class and submit it to me before the next R lesson. The dataset must come from real data (you can add few invented variables or few invented cases, if there are not enough) and possibly not from a questionnaire on marketing, customer satisfaction, health condition. If it comes from a questionnaire it must be very very different from the example I use in class, otherwise I shall reject your dataset. On the web there are really tons and tons of datasets eithern not from questionnaires or originated from very different surveys.
It must be an RData file with:

no more than 20 variables

at least 3 binary nominal variables, already set to Factor with appropriate labels
 at least 3 nonbinary nominal variables, already set to Factor with appropriate labels
 at least 3 scale variables (no dates nor times please)
 at least
3 ordinal variables, already set to Ordered Factor with appropriate labels
 at least 50 cases, better at least 100 to get more meaningful results during your statistical analyses
 all missing cases handled appropriately
 a codebook in a separate file (Word, PDF or text).
The quality of this dataset and codebook is part of the evaluation.
3. After each lesson, you submit the same content of the lesson but applied to your dataset with the same comments and considerations I did in class. You have time until the next lesson starts.
4. At the beginning of each lesson, a random person is selected to present publicly the homework, but done on another student's dataset.
5. After this is done, if you want to go away and not attend my part of the lesson you may do it without penalties (you can watch the videos).
6. Your evaluation depends on the quality of your dataset and codebook, on the assignments you regularly submit and on the presentation of the previous lesson on another person's dataset.
7. If your evaluation is at least 60% you automatically use this grade as R grade and you do not receive that part papers during the exam. This grade lasts until the last session of current academic year. If you prefer NOT to use your coursework's grade for an exam's session, you MUST write an email to Dr. Paolo Coletti AT LEAST 7 days before, no exceptions.
There is no 60% minimum per exercise at the exams, this rule applies only to assignment and coursework.
Differences from A.Y. 2018/19
If you attended the course last year there are no important differences.
Study resources
Topic 
Lessons' slides 
Videos as support to attendance 
Books 
Further readings 

Precourse  Go down here 


Relational database architecture 
Relational databases  Go down here  Databases course book (this book includes the theoretical part, as well as the pratical part covered by the videos) 
• Allen G. Taylor, Database Development For Dummies, For Dummies, 2000, ISBN 978 0764507526 
Access 
• Sams Teach Yourself Microsoft Office Access 2003 in 24 Hours, Alison Balter, ISBN 0672325454 

Cryptocurrencies and blockchain technology  Slides in PDF and in PPTX  Go down here 


Statistical analysis with R  Statistical Analysis with R  Go down here  Data analysis course book (chapters 16, 914) 
To begin: 
Files and programs used in class  Last updated 

MyFarm database used in Databases course book, library, studentsandexams, studentsandexams2 databases  
Northwind 2003 database for Access 2007 and 2010 and 2013 and 2016 

My dataset, your datasets and files for R distributed in class. Tiny URL: https://tinyurl.com/adazip  22 November 2017 
Rportable, portable and already configured  5 November 2017 
R Commander menues in Italian and English (for students with R commander in Italian)  18 April 2013 
Videos of lessons
precourse.avi 97 MB 
Precourse for Windows 10 on unibz network and file handling, first part.  
precourse 02.avi 74 MB 
Precourse for Windows 10 on unibz network and file handling, second part.  
precourse Mac 01 176 MB 
Precourse for Mac on unibz network and file handling, first part.  
precourse Mac 02 173 MB 
Precourse for Mac on unibz network and file handling, second part.  
databases01.avi 107 MB 
Single table database in normal form, primary key, information redundancy, empty fields, onetomany and manytoone relations, foreign key. 

databases02.avi 83 MB 
Onetoone relations, manytomany relations, junction table, temporal databases. 

databases03.avi 122 MB 
Junction tables for more relations, details table, foreign keys with more relations, orphans and referential integrity, hierarchical structure, process structure. Suggestions for database design. 

access01.avi 63 MB 
Northwind database overview. Access overview, Saving operations. Tables, field types, primary key. Queries, query wizard, design view, sorting, criteria. 

access02.avi 65 MB 
Using other fields for criteria, asking for values, virtual fields, expression builder, functions DateDiff, DateAdd, Year, Between. 

access03.avi 133 MB 
Summary queries, examples, “where” and “group by” 

access04.avi 51 MB 
Nonselection queries. Left/right joins. Database documentation. 

A decentralised currency, basic cryptography, Bitcoin history and technology, blockchain technology, advantages and criticisms  
R01 
R overview: portable version, installing packages, loading packages, R commander, saving script, output, workspace, loading workspace.  
Installing R and R commander on Mac 
Mac users, get ready for this painful procedure!  
R02 
Questionnaires. Variables: scale, nominal, ordinal, Likert scale. Missing values: NA, NaN. Build, import and set up a dataset.  
R03 
Descriptive statistics for one nominal variable, for one scale variable. Graphs for one nominal variable: column plot, pie chart, radar graph, bar plot, line plot, area plot, 3D. R: color palette, bar plot, pie chart. Graphs for one scale variable: histogram, box plot, plot case by case. R: histogram, box plot, index plot.  
R04 
Descriptive statistics for two variables: contingency table, row and column percentage, statistics by groups, Pearson correlation, Spearman correlation. Graphs for two variables: clustered column plot, stacked column plot, 3D column plot, boxplots by groups, histograms by groups, scatterplot, mathematical graph. Three variables: surface plot, bubble chart, scatterplot by groups.  
R05 
Restrict data set, remove cases with missing values, binning, recode variables, massive recoding, compute new variables. Basic vector operations.  
R06 
Statistical tests: sample and population, hypotheses, significance. Student t test for one variable. Chisquare test for one dimensional contingency table.  
R07 
Chisquare test for a twodimensional contingency table, Student's t test for two populations, prerequisite, MannWhitney test, ANOVA, KruskalWallis test, correlations' tests, when testing difference of two scale variables, Student's t test for two paired variables, Wilcoxon signedrank test. Normality: histogram with normal curve, QQplot, skewness and excess Kurtosis, Shapiro Wilk test.  
R08 
Additional video for A.Y. 2016/17. Checking normality prerequisite, onesided tests, sign test, subsetting while removing categories.  
networkfolders.avi 24 MB 
This short video illustrates how to reach unibz network folder \\ubz01fst (which contains course_coletti and your own personal stuff) using VPN when you are connected from outside university or when you are connected using wifi. 
Exam
Before the exam:
Frequently Asked Questions
Q: I have a technical issue with my computer or need to install/configure something.
A: Considering that this year there are very few students, just come to one of my several office hours also of other sunbjects http://www.paolocoletti.it/timetable and we will fix it togeteher.
Q: I may not enrol online for technical or administrative reasons or I forgot to enrol. Can I do the exam anyway?
A: No, I may not let nonenrolled students take part of the exam. Do not ask me to do illegal things! Ask the secretary whether there is something they can do.
Previous exams
Session 
Notes 
Exam link 
Solution link 
Video solution 

July 2019 21 

January 2019 20 

January 2018 18 

July 2017 17 

February 2017 16 

Autumn 2016 15 

Summer 2016 14 

Winter 2016 13 

prototype for AY 2015/16 12 

prototype for AY 2015/16 11 

prototype for AY 2015/16 10 

Old exams, they are good only for database design and, in part, for R 

Autumn 2015 09 
old exam 

Summer 2015 08 
old exam 

Winter 2015 07 
old exam 

Autumn 2014 06 
old exam 

Winter 2014 05 
old exam 

Autumn 2013 04 
old exam 

Summer 2013 03 
old exam 

prototype 02 
old exam 

prototype 01 
old exam 
This page is maintained by Paolo Coletti.
Marisa Crucitti il teatro per ringiovanire Paolo Coletti personal page La stanza dell'arte Paolo Coletti Paolo Associazione culturale e ricreativa Kender Trento Bolzano La stanza dell'arte Marzia Centro Felix Trento Aarghen Thael Il Vecchio Continente GURPS Marisa Crucitti il teatro per ringiovanire Laboratorio d'arte Gabbana cornici Rovereto Nursing Up sindacato infermieri Bolzano ASL Italia Advanced Squad Leader Club scherma Bolzano Bozen Fecht club spada fioretto sciabola