48013 Pages
3238 Words
Assignment – 771766 Fundamentals Of Data Science Project
Introduction
For this project, a mock census of a made-up town is given as census.csv. A census's main goal is to give the government with accurate population information that will aid in improved planning, policy creation, and money distribution. As a team of local government officials who must evaluate the information and decide how to invest in a vacant piece of land, this data plays a critical role in investment and construction decision making. Hence, in order to get knowledge and make wise judgements, it is necessary to clean and evaluate the census data. Bar graphs and histograms, data filtering and grouping, and the replacement of missing values are all part of the study.
New Assignment Help is your one-stop destination for impeccable academic support. Our online assignment help in the UK caters to diverse subjects and topics, guaranteeing thorough assistance for every student. Explore our Free Assignment Samples to kick-start your assignments with confidence.
The tool used
Python is used to write the code, while libraries like pandas, ‘numpy’, and ‘matplotlib’ are used for data processing and visualization. Furthermore, it's conceivable that the code was tested and executed in Jupyter notebooks or other IDEs.
Data cleaning
In order to remove errors, inconsistencies, and missing values, data must first be processed and cleaned. This is known as data cleaning. Missing values are a frequent problem with datasets, and they can be brought about by several things, including incorrect data entry, equipment malfunction, or survey responder non-response (Ridzuan and Zainon, 2019). Due to the reduction in sample size, the introduction of bias, and the reduction in statistical power caused by missing data, data analysis can suffer. Data cleaning entails locating missing values and replacing them with suitable values, such as means, medians, or imputed values. The qualities and underlying linkages of the data must be carefully considered during the process of replacing missing values. The above figure has suggested that categorical values in some of the columns have been replaced by integers with the help of the code.
Making sure the data is correct, full, and pertinent to the study issue is the aim of data cleaning (Hosseinzadeh et al. 2021). Similarly, after a thorough clean-up of the data, non a single NA, None or Nan value can be found on the given data set. Therefore, it can be said that the data clean-up process was successful.
TASK A
(i) High-density housing
The above code is suggesting that it makes a single figure with two subplots—one for men and one for women—using the Matplotlib module. The 'age' intervals that have been used for the histogram are specified by the bins variable and are separated by 5 years. In order to properly filter the dataset for gender, the hist () method is used twice, once for males and one for females (Raschka et al. 2020). In order to make it simpler to tell apart the overlapping bars, the histograms for each gender are coloured blue and red, respectively, with a transparency level of 0.5. Before displaying the histogram, the code lastly sets the titles and axis labels for each subplot as well as a title for the complete figure.
Now, the result is suggesting that the male age group mostly lies between 25-45 whereas the female age group is between 18-50. It is important to state that 'Age' and 'housing density' go hand in hand since older people would prefer to live in less densely populated places due to things like noise, traffic, and access to facilities (Duranton and Puga, 2020). Younger people, on the other hand, can be more tolerant of increased-density living due to aspects like affordability, convenience, and social possibilities. Therefore, it can be said that as per the figure of the age distribution of male and female groups, high-density housing can be a great decision for the local authority.
(ii) Low-density housing
The above code is suggesting on age distribution in the census data. It can be seen that the distribution of ages in a dataset is shown through this code's creation of a histogram. Moreover, using age on the x-axis and the number of people falling inside a given age range on the y-axis, it creates a single plot using the Matplotlib module. The graph is suggesting that the old people population is very low as compared to other age people. Therefore, it can be said that low-density housing decisions may not be a viable option (Aleström et al. 2020).
(iii) Train station
This code generates a straightforward bar plot to show the proportion of male and female university students in a dataset. A single plot with two bars denoting the proportions of male and female students is made using the Matplotlib software (Saabith et al. 2021). Since there is no university in the town, therefore it was assumed that all the population belonging to the student category are commuters. In such a case, gender-wise commuter numbers are significantly low as compared to the total population of the city. Thus, building a train station for only 1500 commuters may not be an ideal option for the authority.
(iv) Religious building
The above figure is suggesting on the religious affiliation of the people according to their respective faith. It can be seen from the above code that uses the value counts () method, counts the number of adherents of each religion in the DataFrame, and stores the results in a new variable named religion counts.
The code then uses the plot () function with the option kind='bar' to construct a bar plot of the religion counts data. The x and y axes are labelled, and a plot title is added, using the xlabel (), ylabel (), and title () methods (Stan?in and Jovi?, 2019). The visualization that results displays the census data's distribution of religious affiliations. The data suggests that the religious trend is increasing toward secularism. The 'none' data should be considered as secular people who reside in the city. It is currently higher than the religious people. Through the help of this visualization, it can be said that there is a significantly low need for a religious building.
(v) Emergency medical building
In order to find whether there is any need for emergency buildings in the city two major factors have been considered. First, the infirmity count has been shown in the above code. Infirmity counts in a panda’s data frame are converted into a bar chart using this code. In order to construct a bar chart, the infirmity counts object is passed the kind='bar' option to the plot () method (Larson, 2019). The visualization shows that disabled people number in the census data is quite low.
On the other hand, another data visualization has been done on the Female group that belongs to the age range of 18-30. The original Data Frame is filtered by this code to only contain females between the ages of 18 and 30, the number of females in this age range is counted, and the count is then displayed as a bar chart. It can be seen that there are more than 750 females in the city who belongs to the critical pregnancy range as well as possibility. Therefore, it can be said that there is a high need for emergency facilities for pregnant women in the city.
Investment areas |
Whether there is high demand or need for the data analysis |
Final decision |
High-density housing |
Medium |
Not needed |
Low-density housing |
Low |
Not needed |
Train station |
Low |
Not needed |
Religious building |
Low |
Not needed |
Emergency medical building |
High |
Highly needed |
Table 1: Summary of the data output
(Source: Learner)
The above table is suggesting the on the final data analysis output and it shows that there is a need for an Emergency medical building in the city. Thus, it can be said that considering future pregnancies likely in the population, that unoccupied plot of land should be used for medical building purposes.
TASK B
(i) Employment and training
The following figure shows the rate of unemployment rate by the different types of age; In this visualization, it is seen that the unemployment rate for ages is up to 0.15 for 60 to 80. This is at a peak level of 100 for the unemployment rate of 0.25.
This code tries to investigate the patterns in unemployment and determine whether some ages have higher jobless rates than others. This is accomplished by adding a new column to the table called "Unemployed," whose binary values are 1 if indeed the occupation is equivalent to 1073 or 0 otherwise. If the profession is like 1073, it is assumed that the individual is unemployed. The code then divides the data into age groups and uses the 'Unemployed' column to calculate the average rate of unemployment for each age group. The matplotlib software is then used to plot the jobless rate as an indicator of age (Dhruv et al. 2021). The graph that results illustrates the trend in the unemployment rate with aging. It is implied that ages are much more susceptible to joblessness than others if the plot indicates a higher rate of unemployment for those ages. In this visualization, it is seen that the unemployment rate is at its peak for the age of 100 but in real cases, this is average for the age range of 20 to 63.
(ii) Old age care
With a panda’s data frame, this code generates a histogram of an age column. By organizing the information into intervals known as bins and tallying the number of values which fall in each bin, a histogram is indeed a form of a graph that shows the dispersion of the continuous scale, such as age.
To build the histogram, the code makes use of the matplotlib library. The first input for the function plt. hist () is the column of data that may be plotted, in this case, the "Age" row of the Data Frame df. The histogram's second argument, bins=20, defines how many bins may be used. The histogram becomes more detailed the more bins there are.
Using the methods plt. xlabel () and plt. ylabel, the code applies names to the x and y axes after producing the histogram (). Age is labelled on the x-axis, while the number of entries in each bin is labelled "Frequency" on the y-axis (Kross and Guo, 2019).
Finally, the histogram is displayed by the code using plt. show (). The resultant image shows the overall distribution of age in the data, with the y-axis displaying the prevalence of ages in each bin and the x-axis displaying age ranges.
This histogram can indeed be helpful for comprehending the dispersion of age in a dataset, seeing patterns or outliers, and learning about the demographics of a group or sample's age range. It additionally has the potential to evaluate the age distribution in other datasets or groups.
In this visualization, the age distribution for the frequency of 20 to 48 is of increasing in nature and thereafter this is decreasing type in nature. From this generated visualization it can be said easily that the number of active people is high for the age range up to the age of 60. Thereafter the number of retiring people for this range is decreasing type in nature.
(iii) Increase spending
With a panda’s data frame, this code generates a histogram of an age column. By organizing the information into intervals known as bins and tallying the number of values which fall in each bin, a histogram is indeed a form of graph that shows the dispersion of the continuous scale, such as age (Ostrowski and Menyhárt, 2020).
To build the histogram, the code makes use of the matplotlib library. The first input for the function plt.hist () is the column of data that may be plotted, in this case the "Age" row of the Data Frame df. The histogram's second argument, bins=20, defines how many bins may be used. The histogram becomes more detailed the more bins there are.
The "Age" row in the filtering Data Frame df filtered is then plotted using the histogram method plt.hist () from the matplotlib toolkit. The edge colour option is set to "black" to create a black boundary around each bar, and the bins option is set to 10 for split the information into 10 equal size bins. The plt. xlabel (), plt. ylabel (), or plt. title () methods are used by the code for adding names toward the x-axis, y-axis, or plot title, respectively, after constructing the plot. Finally, the plot is displayed by the code using plt. show (). The range of ages again for the age range of 6 to 18 are shown in the resulting plot. It additionally has the potential to evaluate the age distribution in other datasets or groups (Christensen et al. 2021).
From this visualization, it is seen the number of students for the age group of 6 to 18 is high and this can be seen as 100 to 200. In this range the investment for schooling must be high; it is the recommendation.
(iv) General infrastructure
This code generates a bar graph that displays the total number of street names in a Data Frame then counts how many of them are unique. The "House Number" field of a Data Frame df is first used by the code to get a collection of distinctive house numbers. The number of distinct home numbers is then calculated using the len () function, and the results is saved in the field num houses (Lortie, 2022).
The code then generates a bar chart using the matplotlib library's plt.bar () function. House Numbers is the label for the x-axis, and the num houses variable is the label for the y-axis. To put a black boundary all around bars, the edgecolor option is set to "black".
The plt.xlabel (), plt.ylabel(), the plt.title() functions are used by the code to add names to the x-axis, y-axis, the plot title, respectively, after constructing the plot. Finally, the plot is displayed by the code using plt.show (). The total number of distinctive street names in the Data Frame df is displayed in the resulting bar chart. Understanding the variety of home numbers inside a sample or population, spotting patterns or outliers inside the dispersion of street names, and analysing the dispersion of house numbers between various groups or datasets can all benefit from this.
From this visualization, it can be easily said that the average number of family members for the easy type of family is mainly 4 in number.
Investment areas |
Whether there is high demand or need from the data analysis |
Final decision |
Increase spending for schooling |
Medium |
Not needed |
Old age care |
Low |
Not needed |
Employment and training |
High |
Not needed |
General infrastructure |
Low |
Highly needed |
Table 2: Summary of the output
(Source: Learner)
From the analysis of 4 different cases, it can be easily said that the investment for case 1 and case 3 must be in priority because the amount of unemployment is high and the number of students is also high there (van de Schoot, 2021)
Conclusion
Through the help of Python Modules and Libraries, this census.csv data analysis has suggested on two specific investment areas where local authority should be considering.
References
Aleström, P., D’Angelo, L., Midtlyng, P.J., Schorderet, D.F., Schulte-Merker, S., Sohm, F. and Warner, S., 2020. Zebrafish: Housing and husbandry recommendations. Laboratory animals, 54(3), pp.213-224.
Christensen, M., Yunker, L.P., Adedeji, F., Häse, F., Roch, L.M., Gensch, T., dos Passos Gomes, G., Zepel, T., Sigman, M.S., Aspuru-Guzik, A. and Hein, J.E., 2021. Data-science-driven autonomous process optimization. Communications Chemistry, 4(1), p.112.
Dhruv, A.J., Patel, R. and Doshi, N., 2021. Python: the most advanced programming language for computer science applications. In Proceedings of the international conference on culture heritage, education, sustainable tourism, and innovation technologies (CESIT 2020) (pp. 292-299).
Duranton, G. and Puga, D., 2020. The economics of urban density. Journal of economic perspectives, 34(3), pp.3-26.
Hosseinzadeh, M., Azhir, E., Ahmed, O.H., Ghafour, M.Y., Ahmed, S.H., Rahmani, A.M. and Vo, B., 2021. Data cleansing mechanisms and approaches for big data analytics: a systematic study. Journal of Ambient Intelligence and Humanized Computing, pp.1-13.
Kross, S. and Guo, P.J., 2019, May. Practitioners teaching data science in industry and academia: Expectations, workflows, and challenges. In Proceedings of the 2019 CHI conference on human factors in computing systems (pp. 1-14).
Larson, D., 2019. Best Practices in Accelerating the Data Science Process in Python. In Introduction to Data Science and Machine Learning. IntechOpen.
Lortie, C.J., 2022. Python and R for the Modern Data Scientist. Journal of Statistical Software, 103, pp.1-4.
Ostrowski, J.G. and Menyhárt, J., 2020. Statistical analysis of machinery variance by python. Acta Polytechnica Hungarica, 17(5).
Raschka, S., Patterson, J. and Nolet, C., 2020. Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information, 11(4), p.193.
Ridzuan, F. and Zainon, W.M.N.W., 2019. A review on data cleansing methods for big data. Procedia Computer Science, 161, pp.731-738.
Saabith, S., Vinothraj, T. and Fareez, M., 2021. A review on Python libraries and Ides for Data Science. Int. J. Res. Eng. Sci., 9(11), pp.36-53.
Stan?in, I. and Jovi?, A., 2019, May. An overview and comparison of free Python libraries for data mining and big data analysis. In 2019 42nd International convention on information and communication technology, electronics and microelectronics (MIPRO) (pp. 977-982). IEEE.
van de Schoot, R., Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M.G., Vannucci, M., Gelman, A., Veen, D., Willemsen, J. and Yau, C., 2021. Bayesian statistics and modelling. Nature Reviews Methods Primers, 1(1), p.1.