Data Analysis For Business Plan Sample

Table of Contents

Introduction Of Data Analysis For Business
EDA using Excel
EDA using Python
Model evaluation
Ethical Issues

12 Pages 3021Words

Introduction Of Data Analysis For Business

In the business world, “exploratory data analysis” is crucial since it aids in finding patterns, connections, and insights in data. This allows businesses to decide confidently, create winning plans, and gain an advantage over rivals. Hereafter the importance of having strong data analysis and machine learning skills is growing in the current data-driven corporate environment. Here in this study, using business case studies and real-world datasets basic statistics and exploratory data analysis with Python and Excel are covered in the course. Therefore, regression, classification, and unsupervised learning are just a few of the several machine learning methods that are covered. After that, learning the information and abilities needed to do data analysis, create machine learning models, and assess their effectiveness through this course.

Need a helping hand with your assignments? Look no further than New Assignment Help! Our expert team offers top-notch assignment writing services in the UK, ensuring you excel in your academics. Plus, don't forget to check out our free assignment samples for a glimpse of our quality work.

EDA using Excel

A crucial first step in data analysis is exploratory data analysis (EDA), which uses Excel and basic statistics. Therefore, the EDA seeks to comprehend the data structure and spot anomalies, patterns, and trends in the data. Overall, EDA utilizing Excel and basic statistics provides insightful information into the data that may aid with decision-making and point up potential topics for more research (Rahmany et al. 2020).

Figure 1: Datasets displaying

(Source: Retrieved from Excel)

Datasets displaying

Excel is a commonly used data analysis application that offers a number of statistical features that are beneficial for EDA. Here in this study analyse movie datasets named “unit 8 movies.csv” to make charts and graphs, generate descriptive statistics, and summarise data using these tools. After that, EDA utilizing Excel and basic statistics may assist pinpoint problems with data quality, such as missing or incorrect numbers, and direct further investigation.

Figure 2: Calculating Average score

(Source: Retrieved from Excel)

Calculating Average score

The above scenario illustrates some typical values for audience rating, “Rotten Tomatoes” rating, profitability, and global gross. Therefore, Excel functions like “AVERAGE”, which determines the mean of the provided data, may be used to generate these numbers. After that, Excel may also be used to make graphs and charts to visualize this data (Khadka et al. 2019).

Figure 3: Calculating min, median, and maximum score

(Source: Retrieved from Excel)

Calculating min, median, and maximum score

Figure 4: Barplot implementations

(Source: Retrieved from Excel)

Barplot implementations

To compare the audience and Rotten Tomatoes scores graphically, use a barplot in Excel. Therefore, Excel’s chart function may be used to visualize the data, after that, with the “audience rating” and “Rotten Tomatoes” rating on the X-axis and the rate or number on the Y-axis. Hereafter any distinctions or resemblances between the two scores can be seen in the ensuing bar chart.

Organise your data into intervals or bins, make a frequency table, and then utilise Excel's charting capabilities to generate a histogram using EDA. To improve visualisation and get insights into the distribution of the data, change the bin size and labels.

Using Excel for “exploratory data analysis (EDA)” has numerous significant advantages. EDA is a vital phase in the data analysis process. First of all, Excel has a user-friendly and recognisable design that makes it accessible to a variety of users, even those who lack in-depth programming knowledge. Users may rapidly import and alter data, carry out simple computations, and create visualisations like charts and graphs to better understand the distribution and trends of the data. Excel's adaptability also enables interactive and iterative data examination. Users may quickly alter charts and visualisations, adjust parameters, and see how their changes affect the outcomes, allowing for a greater knowledge of the data. Excel's utility in EDA is further increased by its capacity to manage huge datasets and carry out simple statistical computations. Additionally, Excel's EDA features provide a framework for more complex studies. Analysts are better able to choose and use more complex statistical approaches or machine learning algorithms by seeing probable patterns, correlations, and distributions in the data.

EDA using Python

Exploratory data analysis (EDA), especially in Python, is a crucial stage in machine learning. EDA entails analysing the data to understand it before applying algorithms or creating models. While EDA is a vital component of data analysis, it is also important to take into account the ethical issues that arise when EDA is used in conjunction with machine learning. Here are some moral questions to think about:

Data Privacy and Permission: EDA often needs access to private or delicate information. It is essential to confirm that information used for assessment is gathered and utilised with the appropriate permission, abiding with privacy laws and policies. To safeguard people's privacy, care must be taken to make it anonymous or de-identify the data.

Fairness and Bias: EDA may reveal racial or gender biases in the data. To maintain fairness in the decision-making processes, it is crucial to critically analyse and correct these biases. Any possible discriminatory impacts that may result from information or models need to be identified and mitigated.

Data Integrity and Quality: EDA is predicated on the premise that the data are reliable and representative. However, data might include inaccuracies, omissions, or inconsistencies that can result in biassed analysis. To ensure the data's integrity and dependability, it is crucial to verify and sanitise it.

Transparent and Explainability: EDA must be carried out in a way that makes it possible for others to replicate and verify the analysis. Machine learning algorithms and models should be comprehensible, including details on how they make judgements. This openness fosters confidence and promotes responsibility.

Making educated Decisions: EDA should assist in making educated decisions. Avoiding distorting or interpreting the findings is essential when interpreting the results. To make wise decisions, individuals must be informed of the analyses' constraints and presumptions.

Data Security: Since EDA includes accessing and modifying data, it is important to take strong security precautions to prevent unauthorised entry, breaches of privacy, or data leaks. Throughout the EDA process, safeguards must be put in place to guarantee the safety and confidentiality of the information.

Data Ownership & Intellectual Properties: It is crucial to respect the data's owner rights as well as any related intellectual property rights. The restrictions set out by data licences, copyrights, and use agreements should be followed while doing EDA.

Data Dissemination in a Responsible Manner: EDA results should be disseminated in a responsible manner, taking into account the possible effects on people, communities, or organisations. To prevent injury, prejudice, or spreading false information, sensitive information should be kept private.

Ethical Frames and Guidelines: Adhering to well-known ethical principles, such as the Uniform Information Practise Principles (FIPPs), or guidelines, such as the moral standards for Trustworthy AI, may aid in directing ethical decision-making through the EDA process.

The offered code sample shows how to use several libraries to create algorithms for machine learning and gauge their effectiveness. This explanation will go into detail about the functions and goals of each library, paying particular attention to the modules for the libraries numpy, matplotlib, pandas, seaborn, and scikit-learn.

Numpy is first loaded to facilitate quick numerical calculations and array manipulation. With support for arrays with multiple dimensions or mathematical operations, it offers a strong basis for scientific computing.For data analysis and manipulation, pandas is also loaded. It provides robust data structures, such DataFrames, that make managing structured data simple. Then comes Matplotlib. A flexible charting package called plt, which imports pyplot, is used. It makes it possible to create different visualisations, such as line graphs, scatter plots, or histograms. Understanding data distributions, correlations, and trends is made easier by these visualisations.

A higher-level visualisation library built on the foundation of matplotlib called sns is loaded in place of seaborn. It is helpful for examining links between elements in the data since it offers extra aesthetically pleasing upgrades and practical features for statistical visualisations.The model_selection submodule of scikit-learn's "train_test_split" module is loaded in order to divide the dataset between training and testing groups. With the help of this function, which randomly splits the data, the model may be trained on certain portions and then tested on a different, independent subset to see how well it generalises.The DecisionTreeClassifier class from the sci-kit-learn library's trees module is imported. The decision tree technique is implemented in this class for classification jobs. Decision trees are potent supervised learning systems that divide the feature space according to feature values, allowing for accurate target variable prediction.The precision_score modules from the metrics submodule of scikit-learn is then imported. To evaluate the effectiveness of categorization models, this module offers assessment metrics. The precision, a metric for how well the model predicts positive samples, is calculated by the precision_score function.

The given code demonstrates how key libraries for information manipulation, visualisation, model training, assessment, and performance monitoring may be integrated into a machine-learningworkflow. Together, these libraries enable academics and industry professionals to efficiently analyse and model large datasets, advancing research in a variety of fields.

The above figure shows the visualizations, exploring data, and using tools like Matplotlib, Seaborn, and Plotly in Jupyter Notebook where barplot, histogram, pie chart, scatterplot, and others are shown respectively (Orji et al. 2022).

The dataset has been prepared for the training and evaluation of a machine-learning based on machine learning in the specified environment. A few of the characteristics in the dataset include "Film," "Genre," "Audience score X," "Lead Studio," "profitability," "Rotten Tomatoes X," "Year," and "ld." These attributes, which include genre, audience score, the studio, revenue, ratings, and the year of release, indicate many facets of movies. The goal is to forecast the financial success of films using these characteristics. In order to do this, the information being used is split into two groups: the set used for training (x_train, y_train) or the set used for testing (x_test, y_test), with an evaluation test size of 0.20. The model for machine learning has been trained using the training set, and its performance may be assessed using the testing set (Fruhwirthet al. 2020).

This figure shows the linear regression model using the Python programming language where the “MSE” score is 14.60.

The K-nearest neighbours (KNN) technique is being used for classification in the provided piece of code. The KNeighborsClassifier object obtained from the sklearn. neighbors package is imported by the code. After that, a KNeighborsClassifier class instance is generated and assigned to the 'knn' variable. When doing classification tasks depending on the KNN algorithm, the neighbour classifier class is often employed (Adži?et al. 2021).

Unexpected input kind was given, according to an error that is observed. The return statement should be changed to self, it advises. _fit(X, y), suggesting that the input arrays' X and y formats are the problem. The n_neighbors=7 parameter specifies how many neighbours the KNeighborsClassifier is going to take into account while generating predictions. The algorithm's decision boundaries' granularity is determined by this parameter.

The accuracy score for the training data is determined using the 'knn' object, which stands for the KNN classifier. The training data ('x_train') and associated goal values ('y_train') are sent when using the ‘score' function. The output shows the accuracy score, which is around 0.3709677419354839. This function evaluates the KNN model's forecasts using the training data for accuracy and outputs the resultant accuracy score (Istantiet al. 2020).

In the provided code snippet, the decision tree classifier is being employed for classification tasks. The 'tree' module from the 'sklearn' library is imported, which contains various tree-based algorithms. An instance of the DecisionTreeClassifier class is created and assigned to the variable 'DecisionTree'. This class represents the decision tree classifier algorithm.

The next step involves training the decision tree classifier model using the 'fit' method. The training data, 'x_train' (input features), and 'y_train' (target labels) are provided as arguments to the 'fit' method. This process allows the decision tree classifier to learn from the training data and build a tree-based model. Overall, this code sets up and trains a decision tree classifier model for subsequent classification tasks (Normaliniet al. 2019).

Model evaluation

Here in the study “mean square error (MSE)” for linear regression was determined to be 14.60, suggesting a rather high error rate. The “coefficient of determination (R-squared)” value was 0.06, indicating that the model only partially described the variability in the data. With an accuracy score of 0.3709, the kNN model successfully predicted 37.1% of the test data.

The entire test set's data points were properly identified by the decision tree classifier model, which received a flawless accuracy score of 1.0. After that, the logistic regression model's accuracy score was 0.3064, indicating that it accurately predicted 30.6% of the test data. Overall, the evaluation of both regression and classification models is dependent on the particular evaluation measures employed. In general, better-performing models are those with lower mistake rates and greater accuracy ratings (Peng et al. 2021).

Details into the performance of the logistic regression model, the decision tree classifier, linear regression model, and kNN model are offered in the evaluation of theresults.The "mean square error"of "14.60"in the "linear regression"model is an indication of a reasonably high error rate. A greater "MSE"indicates a considerable departure between the model's predictions and the observed data. Furthermore, the model only somewhat reflects the variability contained in the dataas indicated by the "coefficient of determination” and that is "R-squared"value of "0.06". Low "R-squared"values show that the concernedmodel's explanatory capacity is constrained and that it fails to explain a significant amount of the variance in the data. As an outcome, the "linear regression"model appears to have poor performance in this case.

The "kNN"modelhasan accuracy score of "0.3709"which indicates that it accurately forecast "37.1%"of the test data. Although this score might appear to be quite low, it's vital to take the background and the problem's complexity into account. A 37.1% accuracy score can still be significant and helpful, given the application in question and the type of data. It's important to keep in mind that it can be difficult to assess the suitability of this score for accuracy in the absence of more details about the particular problem and area.

Ethical Issues

Many different sectors are using business analytics, a rapidly expanding field, to support data-driven decision-making. Therefore, case studies in a variety of business analytics fields, including customer segmentation, identifying fraud, and supply chain optimization, have been developed using real-world information. But the growing use of data in business also prompts moral questions about bias, transparency, and privacy. The use of data is responsible and equitable, ethical issues in data science must be properly taken into account.

Conclusion

In the corporate environment, exploratory data analysis is crucial to gaining understanding and making wise decisions. Therefore, having good data analysis and machine learning abilities is essential given the growing usage of data in the business sector. After that, this study covers fundamental statistics, exploratory data analysis, regression, classification, and unsupervised learning utilizing business case studies and real-world datasets with Python and Excel.

Reference

Adži?, S. and Al-Mansour, J., 2021. Business analysis in the times of COVID-19: Empirical testing of the contemporary academic findings.Management Science Letters,11(1), pp.1-10.
Fruhwirth, M., Rachinger, M. and Prlja, E., 2020. Discovering business models of data marketplaces.
Istanti, E., Sanusi, R. and Daengs, G.S., 2020. Impacts of price, promotion and go food consumer satisfaction in faculty of economic and business students of Bhayangkara University Surabaya.Ekspektra: Jurnal Bisnis dan Manajemen,4(02), pp.104-120.
Khadka, B., 2019. Data analysis theory and practice: Case: Python and Excel Tools.
Normalini, M.K., Ramayah, T. and Shabbir, M.S., 2019. Investigating the impact of security factors in E-business and internet banking usage intention among Malaysians.Industrial Engineering & Management Systems,18(3), pp.501-510.
Orji, U.E., Ukwandu, E., Obianuju, E.A., Ezema, M.E., Ugwuishiwu, C.H. and Egbugha, M.C., 2022, November. Visual Exploratory Data Analysis of the Covid-19 Pandemic in Nigeria: Two Years after the Outbreak. In 2022 5th Information Technology for Education and Development (ITED) (pp. 1-6). IEEE.
Peng, J., Wu, W., Lockhart, B., Bian, S., Yan, J.N., Xu, L., Chi, Z., Rzeszotarski, J.M. and Wang, J., 2021, June. Dataprep. eda: task-centric exploratory data analysis for statistical modeling in python. In Proceedings of the 2021 International Conference on Management of Data (pp. 2271-2280).
Rahmany, M., Zin, A.M. and Sundararajan, E.A., 2020. Comparing tools provided by python and r for exploratory data analysis. IJISCS (International Journal of Information System and Computer Science), 4(3), pp.131-142.