Q

7COM1073 Foundations of Data Science Assignment Help

7COM1073 Foundations of Data Science Assignment Help - Need Best 7COM1073 Foundations of Data Science Assignment Help? Approach Qualified UK Tutors And Secure Good Grades!!
Previous << >> Next

GET READYMADE 7COM1073 FOUNDATIONS OF DATA SCIENCE ASSIGNMENT SOLUTIONS - 100% PLAGIARISM FREE WORK DOCUMENT AT NOMINAL CHARGES!

7COM1073 Foundations of Data Science Assignment - University of Hertfordshire, UK

Assessment - Data Classification

The programming language you should use to finish this assessment is Python (in version 3 and above). You can use functions from the following packages: Numpy, Pandas, Matplotlib, Seaborn and Sklearn.

1. Information on the Data

Fozziwig's Software Developers have contracted you to explore the possibility of an automated software defect prediction system. They want to know if developing such a system would be cost-effective, based on the predictive accuracy that you can achieve with a sample of their data. Static code metrics are measurements of software features. They can be used to quantify various software properties which may potentially relate to defect-proneness, and thus to code quality. Examples of such properties and how they are often measured include: size, via lines of code (LOC) counts; readability, via operand and operator counts (as proposed by [1]); and complexity, via linearly independent path counts (this relates to the control flow graph of a program, and was proposed by [2]).

The data that you have been given contains the static code metrics for each of the functions which comprise a software system. This system was developed by Fozziwig's Software Developers several years ago. As well as the metrics for each function, it has also been recorded whether or not a fault was experienced in each function. This data came from the software testers who examined the system before it was publicly released.

You have been given two labelled data files, a training data set (trainingSet.csv) and a testing data set (testingSet.csv). Each data set contains 13 features (each one a software metric). Class labels are shown in the last column of each file: a value of '+1' means 'defective' (the software module contained a defect (fault)) while a value of '-1' means 'non-defective'. Note that this is clearly a simplification of the real world, as both fault quantity and severity have not been taken into account.

Part A - Data pre-processing and data exploration

1) Use Pandas to load both trainingSet.csv and testingSet.csv.

2) Find the number of patterns in each class for both loaded data sets using Python.

3) Choose an attribute and generate a boxplot for the two classes in the training set.

4) Show one scatter plot, that is, one feature against another feature. It is your choice to show which two features you want to use. You need to use the training set.

5) Divide the original training set into a smaller training set (II) and a validation set. In this task, you need to use 55% of total training data points as the validation set.

MOST RELIABLE AND TRUSTWORTHY 7COM1073 FOUNDATIONS OF DATA SCIENCE ASSIGNMENT HELP & ASSESSMENT WRITING SERVICES AT YOUR DOORSTEPS!

Part B - Do a principal component analysis

1) Perform a PCA analysis on the original training data set.

2) Plot a scree plot to report variances captured by each principal component.

3) Project the test set on the same PCA space produced by the original training dataset.

4) Plot two subplots in one figure: one for the training data in the PC1 and PC2 projection space and label the data in the picture according to its class; the other one for the test data in the same PCA space and label the data in the picture according to its class.

Part C - Do a classification using the Naïve Bayes Classification model

Train the model using the original training set and report the performance on the test set including accuracy rate.

SAVE DISTINCTION MARKS IN EACH 7COM1073 FOUNDATIONS OF DATA SCIENCE ASSIGNMENT WHICH IS WRITTEN BY OUR PROFESSIONAL WRITER!

Part D - Investigate how the number of features in the training dataset affects the model performance on the validation set

1) Use the training set (II) to train 13 Naïve Bayes Classification models, with 13 different feature sets. That is: the first one is to use the 1st feature only; the second one is to use the 1st and the 2nd features; the third one is to use the 1st, 2nd, and 3rd features, the fourth one is to use the first 4 features, and so on.

Measure the accuracy rate on both the training set and the validation set. Report the results by plotting them in a figure: that is, a plot of the accuracy rate against the number of features used in each model. There should be two curves in this figure: one for the training set (II); the other one for the validation set.

2) Report what is the best number of features you would like to use in this work and explain why you choose it. Write it down in your Jupyter notebook.

3) Use the selected number of features to train the model and report the performance on the test set.

Part 5 - Summarize your findings, write your conclusions using critical thinking (no more than 100 words) and write it down in your Jupyter notebook.

Note - Submit a .ipynb file to show your completed Python code.

WE HELP STUDENTS TO IMPROVE THEIR GRADES! AVAIL TOP QUALITY 7COM1073 FOUNDATIONS OF DATA SCIENCE ASSIGNMENT HELP AND ASSESSMENT WRITING SERVICES AT CHEAPER RATE!


Want to Excel in Course? Hire Trusted Writers for Help! —> https://miracleskills.com/

Lists of comments


Leave a comment


Captcha