Q

Foundations of Data Science Assignment Help

7COM1073 Foundations of Data Science Assignment Help - Need Best Foundations of Data Science Assignment Help? Approach Qualified UK Tutors And Secure Good Grades!!
Previous << >> Next

GET READYMADE 7COM1073 FOUNDATIONS OF DATA SCIENCE ASSIGNMENT SOLUTIONS - 100% PLAGIARISM FREE WORK DOCUMENT AT NOMINAL CHARGES!

7COM1073 Foundations of Data Science Assignment - University of Hertfordshire, UK

Assessment - Data Classification

The programming language you should use to finish this assessment is Python (in version 3 and above). You can use functions from the following packages: Numpy, Pandas, Matplotlib, Seaborn and Sklearn.

Advance Your Knowledge with Premium Tutoring: Access Top-Quality Diploma Assignment Help for Foundations of Data Science Assignment - Hire the Best Tutor Today!

1. Information on the Data

Fozziwig's Software Developers have contracted you to explore the possibility of an automated software defect prediction system. They want to know if developing such a system would be cost-effective, based on the predictive accuracy that you can achieve with a sample of their data. Static code metrics are measurements of software features. They can be used to quantify various software properties which may potentially relate to defect-proneness, and thus to code quality. Examples of such properties and how they are often measured include: size, via lines of code (LOC) counts; readability, via operand and operator counts (as proposed by [1]); and complexity, via linearly independent path counts (this relates to the control flow graph of a program, and was proposed by [2]).

The data that you have been given contains the static code metrics for each of the functions which comprise a software system. This system was developed by Fozziwig's Software Developers several years ago. As well as the metrics for each function, it has also been recorded whether or not a fault was experienced in each function. This data came from the software testers who examined the system before it was publicly released.

You have been given two labelled data files, a training data set (trainingSet.csv) and a testing data set (testingSet.csv). Each data set contains 13 features (each one a software metric). Class labels are shown in the last column of each file: a value of '+1' means 'defective' (the software module contained a defect (fault)) while a value of '-1' means 'non-defective'. Note that this is clearly a simplification of the real world, as both fault quantity and severity have not been taken into account.

Elevate Your Data Skills with Professional Assignment Support: Master Y/617/3035 Advanced Data Analytics in the OTHM Level 6 Diploma in Information Technology Program!

Part A - Data pre-processing and data exploration

Answer: Two crucial phases in the pipeline for data analysis are data pre-processing and data exploration. Data pre-processing, which includes resolving missing values, eliminating outliers, and converting data types, entails cleaning, transforming, and getting ready raw data for analysis. This stage guarantees that the information is correct, comprehensive, and formatted appropriately for analysis. Contrarily, data exploration entails applying statistical techniques and visual aids to comprehend the underlying relationships, patterns, and distributions in the data. This stage aids in the formulation of hypotheses for more research as well as the identification of correlations, trends, and anomalies. When combined, data pre-processing and exploration offer a strong basis for data analysis, allowing analysts and researchers to draw insightful and significant conclusions from the data. Through meticulous preparation and examination of the data, researchers can pinpoint possible

1) Use Pandas to load both trainingSet.csv and testingSet.csv.

2) Find the number of patterns in each class for both loaded data sets using Python.

3) Choose an attribute and generate a boxplot for the two classes in the training set.

4) Show one scatter plot, that is, one feature against another feature. It is your choice to show which two features you want to use. You need to use the training set.

5) Divide the original training set into a smaller training set (II) and a validation set. In this task, you need to use 55% of total training data points as the validation set.

MOST RELIABLE AND TRUSTWORTHY 7COM1073 FOUNDATIONS OF DATA SCIENCE ASSIGNMENT HELP & ASSESSMENT WRITING SERVICES AT YOUR DOORSTEPS!

Part B - Do a principal component analysis

Answer: Principal Component Analysis (PCA) is a statistical method used to reduce the dimensionality of a large dataset while retaining most of the information. It involves transforming the original variables into a new set of uncorrelated variables called principal components, which are ordered by the amount of variance they explain. To perform PCA, I would:
1. Standardize the data to have zero mean and unit variance.
2. Compute the covariance matrix of the standardized data.
3. Calculate the eigenvectors and eigenvalues of the covariance matrix.
4. Select the top k eigenvectors corresponding to the largest eigenvalues, where k is the desired number of principal components.
5. Project the original data onto the selected eigenvectors to obtain the principal components.
The resulting principal components can be used for:
- Data visualization
- Noise reduction
- Feature extraction
- Dimensionality reduction
By applying PCA, you can simplify complex datasets, identify patterns, and gain insights into the underlying structure of the data.

1) Perform a PCA analysis on the original training data set.

2) Plot a scree plot to report variances captured by each principal component.

3) Project the test set on the same PCA space produced by the original training dataset.

4) Plot two subplots in one figure: one for the training data in the PC1 and PC2 projection space and label the data in the picture according to its class; the other one for the test data in the same PCA space and label the data in the picture according to its class.

Conquer Complexity with Expert Guidance: Access Tailored Assignment Help for Unit 19 Data Structures and Algorithms in the Higher National Certificate/Diploma in Computing - Download Your Solution Now!

Part C - Do a classification using the Naïve Bayes Classification model

Answer:

To perform classification using the Naïve Bayes Classification model, you would follow these steps:
1. Prepare the data: Split the dataset into features (X) and target variable (y). Ensure the data is cleaned and preprocessed.
2. Choose the Naïve Bayes variant: Select the appropriate Naïve Bayes classifier based on the data type:
- Multinomial Naïve Bayes for categorical features
- Gaussian Naïve Bayes for continuous features
- Bernoulli Naïve Bayes for binary features
3. Train the model: Fit the Naïve Bayes classifier to the training data (X_train, y_train).
4. Make predictions: Use the trained model to predict the target variable for the test data (X_test).
5. Evaluate the model: Assess the performance using metrics like accuracy, precision, recall, F1-score, and confusion matrix.
The Naïve Bayes Classification model works by:
- Assuming independence between features
- Calculating the probability of each class given the features
- Selecting the class with the highest probability as the prediction
By following these steps, you can effectively use Naïve Bayes Classification to predict categorical outcomes in various applications, such as text classification, sentiment analysis, and medical diagnosis.

Train the model using the original training set and report the performance on the test set including accuracy rate.

SAVE DISTINCTION MARKS IN EACH 7COM1073 FOUNDATIONS OF DATA SCIENCE ASSIGNMENT WHICH IS WRITTEN BY OUR PROFESSIONAL WRITER!

Part D - Investigate how the number of features in the training dataset affects the model performance on the validation set

Answer: Examining how feature count affects model performance identifies an important trade-off. Model performance on the validation set first gets better as the number of features in the training dataset grows because more complex patterns and relationships are being captured. But adding more features makes the model overfit, which deteriorates model performance after a point. This occurs as a result of the model becoming unduly specialized to the training set, which hinders its ability to generalize to newly discovered data in the validation set. On the other hand, underfitting, or the model oversimplifying and ignoring significant correlations, is caused by an inadequate number of features. As a result, there is an ideal feature count at which the model achieves maximum performance on the validation set while striking a balance between complexity and generalizability. This demonstrates how crucial feature selection and dimensionality reduction are.

1) Use the training set (II) to train 13 Naïve Bayes Classification models, with 13 different feature sets. That is: the first one is to use the 1st feature only; the second one is to use the 1st and the 2nd features; the third one is to use the 1st, 2nd, and 3rd features, the fourth one is to use the first 4 features, and so on.

Measure the accuracy rate on both the training set and the validation set. Report the results by plotting them in a figure: that is, a plot of the accuracy rate against the number of features used in each model. There should be two curves in this figure: one for the training set (II); the other one for the validation set.

2) Report what is the best number of features you would like to use in this work and explain why you choose it. Write it down in your Jupyter notebook.

3) Use the selected number of features to train the model and report the performance on the test set.

Unlock Success with Personalized Tutoring: Get Expert Assignment Help for Unit 38 Database Management System in the Higher National Certificate/Diploma in Computing - Hire a Tutor Today!

Part 5 - Summarize your findings, write your conclusions using critical thinking (no more than 100 words) and write it down in your Jupyter notebook.

Note - Submit a .ipynb file to show your completed Python code.

WE HELP STUDENTS TO IMPROVE THEIR GRADES! AVAIL TOP QUALITY 7COM1073 FOUNDATIONS OF DATA SCIENCE ASSIGNMENT HELP AND ASSESSMENT WRITING SERVICES AT CHEAPER RATE!


Want to Excel in Course? Hire Trusted Writers for Help! —> https://miracleskills.com/

Lists of comments


Leave a comment


Captcha