Data Collection and Data Analysis in Software Engineering

Schedule and location

Wednesday November 21 -  Thursday November 22
New location: Koulutustila Seitsemäs Taivas, Yliopistonkatu 60 A, 7th floor,  33100 TAMPERE


Registration is open September 1st  - November 14th .


Full Professor Sandro Morasca, University of Insubria, Italy
Associate Professor Luigi Lavazza,  University of Insubria, Italy



Assistant Professor (tenure track) Davide Taibi,  Tampere University of Technology, Finland.
Post-doc researcher Valentina Lenarduzzi, Tampere University of Technology, Finland.




Software Engineering is still a young research field. Several errors are still occurring in methods and techniques applied  to empirically validate results. The errors are not only made by early research scientist but also present in the most  important conference and journal papers [1].

This seminar aims at providing a comprehensive coverage of methods and techniques to collect and analyze data in software engineering. We will cover both basic statistical methods, predictive models and Artificial Intelligence approach, including machine learning techniques. In this seminar students will learn how to apply these techniques to analyze different type of data collected during software development processes and how to draw conclusions based on these data. The seminar will be conducted by three of the leading professors in the area of empirical software engineering and in  the area of machine learning analysis. Two hands-on sessions will help students to directly apply the techniques explained during the course, one applying a  set of statistical techniques (linear and logistic regressions) and another applying classification techniques (Random Forest) on a set of data mined from open source repositories.



The topics covered in the seminar include:

  • Data Collection and packaging approaches in software engineering. 
    • Survey 
    • Case study 
  • Data analysis techniques in software engineering. Which method should be applied in the different contexts? 
    • Statistical methods and measurement. 
    • Machine Learning techniques 
    • Pattern recognition 
  • Drawing conclusions from results 
    • How to interpret the results obtained from the techniques applied

Detailed Program

Day 1 (21.11.2018)

9:30 – 9:45 Introduction (Davide Taibi / Valentina Lenarduzzi)

9:45 - 12:30 Data Collection in Software Engineering (Luigi Lavazza)

-Metrics definition based on a given goal
-The GQM (Goal/Question/Metrics) approach to identifying a set of measures to be collected

12:30 - 13:30 Lunch

13:30 - 16:30 Data management in R (Luigi Lavazza)
-Importing data from files or from databases into the R environment
-Main data structures supported by R (data frames, matrices, lists, etc.)
-Visual exploration of collected data with R

16:30 – 16.45 Summary of the first day (Davide Taibi)


Day 2 (22.11.2018)

9:30 – 9:45 Introduction (Davide Taibi / Valentina Lenarduzzi) 

9:45 - 12:30 Data Analysis Techniques in Software Engineering (Sandro Morasca)
-Data analysis techniques (preconditions, outcomes, and usefulness)
-Measurement scales to show how data of different kinds should be properly analyzed
-Descriptive and association statistics

12:30 - 13:30 Lunch

13:30 - 16:30 Prediction Model Building and Validation (Sandro Morasca)
-Statistical bases of predictive model building
-Basic, traditional and innovative techniques for the prediction of continuous and discrete variables
-Models validation

16:30 – 16.45 Summary of the second day (Davide Taibi)



Credit points

Doctoral students participating in the seminar can obtain 2 credit points. This requires participating both days and completing the assignments: 

Registration fee

This seminar is free-of-charge for member organization's staff and their PhD students. For others the participation fee is 400 €. The participation fee includes access to the event and the event materials. Lunch and dinner are not included.