ADVANCED STATISTICAL MODELLING FOR BIG DATA

International Teaching ADVANCED STATISTICAL MODELLING FOR BIG DATA

0222400038
DEPARTMENT OF ECONOMICS AND STATISTICS
EQF7
STATISTICAL SCIENCES FOR FINANCE
2021/2022



OBBLIGATORIO
YEAR OF COURSE 2
YEAR OF DIDACTIC SYSTEM 2014
PRIMO SEMESTRE
CFUHOURSACTIVITY
1060LESSONS
Objectives
ACQUIRE (I) KNOWLEDGE OF ANALYSIS OF ADVANCED STATISTICAL MODELS USEFUL FOR THE UNDERSTANDING OF PROBLEMS AND IMPROVEMENT OF DECISION-MAKING PROCESSES; (II) KNOWLEDGE OF ADVANCED STATISTICAL MODELS AND STATISTICAL LEARNING TOOLS USEFUL TO SUPPORT DECISIONS REGARDING PHENOMENA AND SYSTEMS WHERE LARGE AMOUNTS OF DATA, VARIABILITY AND UNCERTAINTY IMPLY A LEVEL OF COMPLEXITY UNMANAGEABLE USING OTHER TECHNIQUES; (III) ABILITY TO ANALYZE AND INTERPRET DATA GENERATED BY COMPLEX GENERATING PROCESSES, AND TO PRODUCE PREDICTIVE AND ANALYTICAL MODELS SUPPORTING CONTROL AND MANAGEMENT POLICIES OF A COMPANY, BOTH IN PUBLIC OR PRIVATE SECTORS. ALL STATISTICAL MODELS WILL BE PRESENTED BOTH AS PREDICTIVE AND ANALYTICAL TOOLS TO GAIN A DEEP UNDERSTANDING OF PROBLEMS IN A GENERAL DECISION-MAKING PROCESS. PARTICULARLY THE STUDENTS WILL DEVELOP THE ABILITY TO SPECIFY, ESTIMATE AND VALIDATE A BROAD CLASS OF STATISTICAL MODELS WHEN APPLIED TO A COMPLEX DATA STRUCTURE. A SPECIFIC FOCUS WILL BE GIVEN TO THE MODERN TOOLS AVAILABLE TO MANAGE AND ANALYZE BIG DATA AND THE STATISTICAL PROGRAMMING LANGUAGES AVAILABLE TO DEVELOP AND IMPLEMENT EFFECTIVE ANALYTICAL SOLUTIONS. SEVERAL CASE STUDIES WILL BE PRESENTED AND DISCUSSED TO CREATE THE ABILITY OF THE STUDENTS TO EXPLOIT THEIR KNOWLEDGE TO ANALYZE REAL PROBLEMS AND DATASETS.

CONOSCENZE E CAPACITÀ DI COMPRENSIONE
THE STUDENT WILL DEVELOP KNOWLEDGE:
–THE MAIN ESTIMATION TECHNIQUES FOR LINEAR MODELS AND GENERALIZED LINEAR MODELS (GLM) FOR MASSIVE DATA SETS (HIGH NUMBER OF OBSERVATIONS)
–THE MAIN PENALIZED ESTIMATION TECHNIQUES (RIDGE, LASSO AND ELASTIC NET) IN THE CONTEXT OF LINEAR MODELS AND GENERALIZED LINEAR MODELS (GLM) TO TREAT HIGH-DIMENSIONAL DATASETS (HIGH NUMBER OF FEATURES)
–OF THE MAIN INFERENCE TECHNIQUES FOR SPARSITY MODELS
–PACKAGES AVAILABLE IN THE R LANGUAGE FOR ESTIMATING LINEAR PREDICTIVE MODELS AND GLM IN THE PRESENCE OF BIG DATA
CAPACITÀ DI APPLICARE CONOSCENZA E COMPRENSIONE
BASED ON THE KNOWLEDGE LEARNED, THE STUDENT WILL DEVELOP THE ABILITY TO:
–IMPLEMENT PREDICTIVE MODELS TO SUPPORT DECISIONS IN DIFFERENT AREAS.
–USE THE STATISTICAL LANGUAGE R FOR THE IMPLEMENTATION OF THE MODELS COVERED BY THE COURSE
–ANALYZE AND EVALUATE AUTONOMOUSLY AND CRITICALLY DOCUMENTS AND REPORTS DEVELOPED ON THE BASIS OF STATISTICAL MODELS FOR BIG DATA, MAKING CRITICAL JUDGMENTS ON HOW TO SPECIFY, ESTIMATE AND VALIDATE THE IDENTIFIED MODELS, ON THE INFERENCE TECHNIQUES AND ON THE PREDICTIVE MODELS BUILT, AS WELL AS ON THE VALIDITY , INTERNAL AND EXTERNAL, OF THE CONCLUSIONS REACHED.
–PRESENT THE RESULTS OBTAINED, BOTH IN ORAL AND WRITTEN FORM, WITH LANGUAGE PROPERTIES, EFFECTIVELY AND CLEARLY.
STUDENTS WILL BE URGED TO LEARN THE LOGICAL-CONCEPTUAL STRUCTURE NECESSARY FOR THE DEVELOPMENT AND IMPLEMENTATION OF MODELS FOR BIG DATA, ALSO PROVIDING THE ABILITY TO LINK THE SKILLS ACQUIRED WITH THOSE LEARNED IN THE MORE RELATED STUDY COURSES.
Prerequisites
KNOWLEDGE OF NOTIONS OF MATRIX CALCULUS, BASIC PROGRAMMING, STATISTICAL LANGUAGE R, PROBABILITY AND STATISTICAL INFERENCE IS REQUIRED.
Contents
A SINGLE MODULE OF 60 (LM SCIENZE STATISTICHE PER LA FINANZA) AND 63 ORE (LM DATA SCIENCE E GESTIONE DELL'INNOVAZIONE).HOURS.
LINEAR PREDICTIVE MODELS. ESTIMATING LINEAR MODELS FORR MASSIVE DATASETS. ESTIMATES OF LINEAR MODELS IN DIFFUSE DATASETS. STATISTICAL MODELS ESTIMATION IN SPARK. ESTIMATING LINEAR MODELS FOR HIGH DIEMSIONALITY. PENALIZED ESTIMATES. RIDGE AND LASSO REGRESSION FOR LINEAR MODELS. GENERALIZED LINEAR MODELS (GLM). GENERALIZATION OF THE LASSO. ELASTIC NET.
A SINGLE MODULE OF 60/63 HOURS.
THE GROUP LASSO. THE FUSED LASSO. OPTIMIZATION METHODS FOR PENALIZED ESTIMATES. STATISTICAL INFERENCE: BOOTSTRAP, DEBIASED LASSO, POST-SELECTION INFERENCE. LINEAR MODELS AND GLM FOR BIG DATA IN R. PENALIZED ESTIMATES IN R. CASE STUDIES AND APPLICATIONS TO SIGNIFICANT PROBLEMS.
Teaching Methods
THE COURSE INCLUDES 60 (LM SCIENZE STATISTICHE PER LA FINANZA) E 63 ORE (LM DATA SCIENCE E GESTIONE DELL'INNOVAZIONE). HOURS OF CLASSROOM TEACHING. ALTHOUGH NOT MANDATORY, GIVEN THE NATURE OF THE COURSE, ATTENDANCE IS STRONGLY RECOMMENDED.
DURING THE LESSONS, THEORETICAL ISSUES WILL BE ADDRESSED, CONSTANTLY SUPPORTED BY THE PRESENTATION OF CASE STUDIES THROUGH WHICH THE METHODS OF IMPLEMENTATION OF THE TECHNIQUES, THE CONTEXTS OF USE OF THE VARIOUS TOOLS AND THE POSSIBLE INTERPRETATIONS OF THE RESULTS OBTAINED WILL BE CLARIFIED. THE EXERCISES WILL THEREFORE FORM AN INTEGRAL PART OF THE SCHEDULED LESSONS.
Verification of learning
THE STUDENT WILL BE ASSESSED DURING THE FINAL TEST TO BE HELD ON THE EXAM DATES SCHEDULED BY THE DEPARTMENT.
DURING THE FINAL TEST THE STUDENT WILL HAVE TO TAKE A WRITTEN TEST (ASSESSED IN THIRTIETHS) AND AN ORAL TEST WHICH WILL BE HELD, TYPICALLY, IN THE DAYS IMMEDIATELY FOLLOWING. THE DATE OF THE WRITTEN TEST IS THAT FORESEEN IN THE DEPARTMENT CALENDAR, THE DAY OF THE ORAL TEST IS AGREED WITH THE STUDENTS AT THE END OF THE WRITTEN TEST.
THE WRITTEN TEST (DURATION OF ABOUT 2 H) IS AIMED AT ASCERTAINING THE STUDENT'S ABILITY TO USE THE SOFTWARE TOOLS COVERED BY THE COURSE, THE STATISTICAL TECHNIQUES OF BOTH EXPLORATORY AND INFERENTIAL TYPES STUDIED, TO INTERPRET AND COMMENT ON THE STATISTICAL RESULTS OBTAINED. DURING THE WRITTEN TEST, THE STUDENT WILL RECEIVE AN EXAM TRACK AND WILL BE ASKED TO ANSWER 5 QUESTIONS (EACH WITH A MAXIMUM SCORE OF 6 POINTS) ON THE ENTIRE COURSE PROGRAM. THE ORAL TEST (LASTING ABOUT 30 MINUTES) CONSISTS OF AN INTERVIEW WITH QUESTIONS AND DISCUSSION OF THE WRITTEN PAPER. THE FINAL MARK (MIN 18, MAX 30 WITH POSSIBLE HONORS) IS ATTRIBUTED BY EVALUATING THE RESULTS OF THE WRITTEN AND ORAL TESTS IN WHICH THE MASTERY OF THE COURSE CONTENT, APPROPRIATENESS OF THE DEFINITIONS AND THEORETICAL REFERENCES, CLARITY OF THE ARGUMENT, DOMAIN OF SPECIALIZED LANGUAGE.
THE EXAM DOES NOT INCLUDE TESTS TAKEN.
Texts
LECTURE NOTES, WEB SITES AND SUGGESTED PAPERS WILL BE MADE AVAILABLE BY THE INSTRUCTOR DURING SCHEDULED CLASSES
–GENERALIZED LINEAR MODELS FOR INSURANCE DATA, PIET DE JONG GILLIAN HELLER, CAMBRIDGE UNIVERSITY PRESS
–STATISTICAL LEARNING WITH SPARSITY, TREVOR HASTIE, ROBERT TIBSHIRRANI, MARTIN WAINWRIGHT, CRC PRESS
More Information
THE INSTRUCTOR PROVIDES FURTHER EXPLANATIONS AND METHODOLOGICAL SUPPORT TO STUDENTS DURING OFFICE HOURS.
DAYS, TIMES AND PLACE OF THE OFFICE HOURS,, AS WELL AS ANY CHANGES, ARE COMMUNICATED ON THE INSTRUCTOR’S WEB PAGE.
IT IS POSSIBLE TO ARRANGE AN APPOINTMENT OUTSIDE THE SCHEDULED RECEPTION HOURS BY SENDING AN EMAIL TO THE TEACHER'S EMAIL ADDRESS.
  BETA VERSION Data source ESSE3