U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Machine Learning: Algorithms, Real-World Applications and Research Directions

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349 Chattogram, Bangladesh

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated  applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. ​ Fig.1, 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of 0 ( m i n i m u m ) to 100 ( m a x i m u m ) has been shown in y - axis . According to Fig. ​ Fig.1, 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig1_HTML.jpg

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

  • To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.
  • To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.
  • To discuss the applicability of machine learning-based solutions in various real-world application domains.
  • To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

  • Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.
  • Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.
  • Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.
  • Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. ​ Fig.2. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig2_HTML.jpg

Various types of machine learning techniques

  • Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.
  • Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.
  • Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.
  • Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table ​ Table1, 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Various types of machine learning techniques with examples

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. ​ Fig.3, 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

  • Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.
  • Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.
  • Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

  • Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].
  • Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.
  • Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification. g ( z ) = 1 1 + exp ( - z ) . 1
  • K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.
  • Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig4_HTML.jpg

An example of a decision tree structure

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig5_HTML.jpg

An example of a random forest structure considering multiple decision trees

  • Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.
  • Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.
  • Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, α is the learning rate, and J i is the training example cost of i th , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the j th iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations. w j : = w j - α ∂ J i ∂ w j . 4
  • Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure ​ Figure6 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

  • Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations: y = a + b x + e 5 y = a + b 1 x 1 + b 2 x 2 + ⋯ + b n x n + e , 6 where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .
  • Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of n th in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below: y = b 0 + b 1 x + b 2 x 2 + b 3 x 3 + ⋯ + b n x n + e . 7 Here, y is the predicted/target output, b 0 , b 1 , . . . b n are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is n th degree of polynomial then we use polynomial regression to get desired output.
  • LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig6_HTML.jpg

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

  • Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.
  • Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig7_HTML.jpg

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

  • Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.
  • Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.
  • Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

  • K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.
  • Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.
  • DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.
  • GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.
  • Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

  • Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.
  • Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

  • Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.
  • Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is [ - 1 , 1 ] , where - 1 means perfect negative correlation, + 1 means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ] r ( X , Y ) = ∑ i = 1 n ( X i - X ¯ ) ( Y i - Y ¯ ) ∑ i = 1 n ( X i - X ¯ ) 2 ∑ i = 1 n ( Y i - Y ¯ ) 2 . 8
  • ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.
  • Chi square: The chi-square χ 2 [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on χ 2 . The chi-square χ 2 is commonly used for testing relationships between categorical variables. If O i represents observed value and E i represents expected value, then χ 2 = ∑ i = 1 n ( O i - E i ) 2 E i . 9
  • Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.
  • Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig8_HTML.jpg

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

  • AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.
  • Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.
  • ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.
  • FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].
  • ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

  • Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.
  • Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.
  • Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure ​ Figure9 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig9_HTML.jpg

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig10_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig11_HTML.jpg

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

  • LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

  • Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.
  • Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.
  • Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.
  • Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO 2 pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.
  • Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.
  • E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.
  • NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.
  • Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.
  • Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.
  • User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Declaration

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Machine Learning: Algorithms, Real-World Applications and Research Directions

Affiliations.

  • 1 Swinburne University of Technology, Melbourne, VIC 3122 Australia.
  • 2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349 Chattogram, Bangladesh.
  • PMID: 33778771
  • PMCID: PMC7983091
  • DOI: 10.1007/s42979-021-00592-x

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study's key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Keywords: Artificial intelligence; Data science; Data-driven decision-making; Deep learning; Intelligent applications; Machine learning; Predictive analytics.

© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2021.

Publication types

An Overview of Supervised Machine Learning Algorithm

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 09 September 2021

Advancing agricultural research using machine learning algorithms

  • Spyridon Mourtzinis 1 ,
  • Paul D. Esker 2 ,
  • James E. Specht 3 &
  • Shawn P. Conley 4  

Scientific Reports volume  11 , Article number:  17879 ( 2021 ) Cite this article

8950 Accesses

7 Citations

35 Altmetric

Metrics details

  • Agroecology

Rising global population and climate change realities dictate that agricultural productivity must be accelerated. Results from current traditional research approaches are difficult to extrapolate to all possible fields because they are dependent on specific soil types, weather conditions, and background management combinations that are not applicable nor translatable to all farms. A method that accurately evaluates the effectiveness of infinite cropping system interactions (involving multiple management practices) to increase maize and soybean yield across the US does not exist. Here, we utilize extensive databases and artificial intelligence algorithms and show that complex interactions, which cannot be evaluated in replicated trials, are associated with large crop yield variability and thus, potential for substantial yield increases. Our approach can accelerate agricultural research, identify sustainable practices, and help overcome future food demands.

Introduction

Increasing food demand will challenge the agricultural sector globally over the next decades 1 . A sustainable solution to this challenge is to increase crop yield without massive cropland area expansion. This can be achieved by identifying and adopting best management practices. To do so requires a more detailed understanding of how crop yield is impacted by climate change 2 , 3 and growing-season weather variability 4 . Even with that knowledge, prediction is challenging because various factors interact with each other. For example, variability in soil type can interact with weather conditions and mitigate or aggravate climate-related impacts on crop yield 5 , 6 . Additionally, seed genetics (G) and crop management decisions (M), interact with the effect of environment (E: soil and in-season weather conditions), thereby resulting in a near infinite number of combinations of G × E × M that can impact crop yield.

Substantial variability in crop yield arises from the wide range of optimal to sub-optimal management observed in soybean farmers’ fields 7 , 8 . Reducing the frequency of lowest vs . highest yields has been proposed as an effective means to increase food production in existing crop land 9 . In that regard, replicated field experiments have been used to identify best management practices for several decades. Most commonly, the effectiveness of up to three management factors and their interactions are evaluated in a single location due to practical constraints (e.g., cost, logistics). By holding the background management constant, causal relationships are identified, and the effectiveness of the examined management practice/s is assessed. It is assumed that background management practices are optimal or at least relevant to what most farmers use in the region, which in fact may not be realistic for many farmers.

Multi-year-site performance trials that account for large environmental and background management variability is another common practice in agricultural research. Such trials usually estimate an average effect across environments and background cropping systems. Inevitably, the measured yield response magnitude and sign may not apply to all farms in the examined region. Other research approaches involve analysis of producer self-reported data 7 , 8 , which can capture yield trends attributable to producer management choice across large regions, but such studies lack sufficient power relative to establishing causality and evaluating complex high-order G × E × M interactions.

Process-based models have been extensively used to evaluate the effect of weather 10 and management 11 , 12 on crop yield. However, to obtain accurate estimates, the models require extensive calibration, which is not a trivial task due to the large number of parameters. Specifically, it has been shown that management is an important source of uncertainty in process-based models, which can lead to substantial and varying degree of bias in yield estimates across the US, even when using harmonized parameters 13 .

Given all the well-known deficiencies of current agricultural research methods, we argue that a method that allows environment-specific identification of unique cropping systems with the greatest yield potential is essential to meet future food demand. Here, by utilizing maize and soybean yield and management data from publicly available performance tests, plus associated weather data, and by leveraging the power of machine learning (ML) algorithms, we developed a method that can evaluate myriads of potential crop management systems and thereby identify those with the greatest yield potential in specific environments across the US.

Results and discussion

Two databases including yield, management, and weather data for maize (n = 17,013) and soybean (n = 24,848) involving US crop performance trials conducted in 28 states between 2016 to 2018 for maize and between 2014 to 2018 for soybean, were developed (Fig.  1 ). Crop yield and management data were obtained from publicly available variety performance trials which are typically performed yearly in several locations across each state ( see methods for more information ). Final databases were separated in training (80% of database) and testing (20% of database) datasets using stratified sampling by year, use of irrigation, and soil type. For each crop, an extreme gradient boosting (XGBoost, see methods for more information ) algorithm to estimate yield based on soil type and weather conditions (E), seed traits (G) and management practices (M) was developed (see variables listed in Tables S1 and S2 for maize and soybean, respectively, and data science workflow in Fig. S1 ).

figure 1

Locations where maize and soybean trials were performed during the examined period. The map was developed in ArcGIS Pro 2.8.0 ( https://www.esri.com ).

The developed algorithms exhibited a high degree of accuracy when estimating yield in independent datasets (test dataset not used for model calibration) (Fig.  2 ). For maize, the root mean square error (RMSE) and mean absolute error (MAE) was a respective 4.7 and 3.6% of the dataset average yield (13,340 kg/ha). For soybean, the respective RMSE and MAE was 6.4 and 4.9% of the dataset average yield (4153 kg/ha). As is evident in the graphs (Fig.  2 ), estimated yields exhibited a high degree of correlation with actual yields for both algorithms in the independent datasets. For maize and soybean, 72.3 and 60% of cases in the test dataset deviated less than 5% from actual yields, respectively. Maximum deviation for maize and soybean reached 43 and 70%, respectively. Data points with deviations greater than 15% from actual yield were 1.5% in maize and 3.6% in soybean databases. These results suggest that the developed algorithms can accurately estimate maize and soybean yields utilizing database-generated information involving reported environmental, seed genetic, and crop management variables.

figure 2

Actual versus algorithm-derived maize (left) and soybean (right) yield in test datasets. Black solid line indicates y = x, red short-dashed lines, black dashed lines, and red long-dashed lines indicate ± 5, 10, and 15% deviation from the y = x line. RMSE, root mean square error; MAE, mean absolute error; r 2 , coefficient of determination; n = number of observations. Each observation corresponds to a yield of an individual cropping system in a specific environment (location-year).

In contrast to statistical models, ML algorithms can be complex, and the effect of single independent variables may not obvious. However, accumulated local effects (ALE) plots 14 can aid the understanding and visualization of important and possibly correlated features in ML algorithms. For both crops, indicatively important variables included sowing date, seeding rate, nitrogen fertilizer (for maize), row spacing (for soybean) and June to September cumulative precipitation (Fig.  3 ). Across the entire region and for both crops, the algorithm-derived trends suggest that above average yields occur in late April to early May sowing dates, but sharply decrease thereafter. Similar responses have been observed in many regional studies across the US for both, maize 15 , 16 , 17 , 18 and soybean 19 . Similarly, simulated yield curves due to increasing seeding rate are in close agreement with previous maize 20 , 21 and soybean 22 studies. The maize algorithm has captured the increasing yield due to increasing N fertilizer rate. The soybean algorithm suggests that narrower row spacing resulted in above average yield compared to wider spacing. Such response has been observed in many regions across the US 23 . Season cumulative precipitation between 400 and 700 mm resulted in above average yields for both crops.

figure 3

Accumulated local effect plots for maize sowing date ( A ), seeding rate ( B ), Nitrogen fertilizer rate ( C ), and cumulative precipitation between June and September (mm) ( D ), and soybean sowing date ( E ), seeding rate ( F ), row spacing ( G ), and cumulative precipitation between June and September (mm) ( H ).

The responses in the ALE plots (Fig.  3 ) suggest that these algorithms have captured the general expected average responses for important single features. Nevertheless, our databases include hundreds of locations with diverse environments across the US and site-specific crop responses which may vary due to components of the G × E × M interaction. We argue that, instead of examining a single or low-order management interactions, site-specific evaluation of complex high order interactions (a.k.a. cropping systems) can reveal yield differences that current research approaches cannot fully explore and quantify. For example, sowing date exerts a well-known impact on maize and soybean yield. For each crop separately, by creating a hypothetical cropping system (a single combination of all management and traits in Tables S1 and S2 ) in a randomly chosen field in south central Wisconsin (latitude = 43.34, longitude = -89.38), and by applying the developed algorithms, we can generate estimates of maize and soybean yield. For that specific field and cropping system (out of the vast number of management combinations a farmer can choose from), maize yield with May 1st sowing was 711 kg/ha greater (6% increase) than June sowing (Fig.  4 A). By creating scenarios with 256 background cropping system choices (Table S3 ), the resultant algorithm-derived yield estimate difference for the same sowing date contrast (averaged across varying cropping systems) was smaller but still positive (3% increase), although the range of possible yield differences was wider (Fig.  4 B). However, when comparing, instead of averaging, the estimated yield potential among the simulated cropping systems, a 2903 kg/ha yield difference (25% difference) was observed (Fig.  4 C). Interestingly, when focusing on the early sown fields that were expected to exhibit the greatest yield, the same yield difference was observed (Fig.  4 D). This result shows that sub-optimal background management can mitigate the beneficial effect of early sowing (Table S4 ).

figure 4

Maize yield difference (in kg/ha and percentage) due to sowing date (May 1st vs. June 1st) for a single identical background cropping system ( A ), maize yield difference due to sowing date when averaged across 256 (3 years × 256 cropping systems = 768 year-specific yields) ( B ), maize yield variability in each of the 256 cropping systems ( C ), and maize yield variability in each of the 128 cropping systems with early sowing ( D ). Soybean yield difference due to sowing date (May 1st vs June 1st) for a single identical background cropping system ( E ), soybean yield difference due to sowing date when averaged across 128 (5 years × 128 cropping systems = 640 year-specific yields) ( F ), soybean yield in each of the 128 cropping systems ( G ) and soybean yield variability due in each of the 64 cropping systems with early sowing ( H ). Within each panel, the horizontal red and grey lines indicate the boxplot with maximum and minimum yield, respectively. In the left four panels, boxes delimit first and third quartiles; solid lines inside boxes indicate median and green triangles indicate means. Upper and lower whiskers extend to maximum and minimum yields. Each maize and soybean cropping system is a respective 8-way and a 7-way interaction of management practices in a randomly chosen field in Wisconsin, USA (Table S3 and S5 , respectively).

In the case of soybean, a May 1st sowing resulted in greater yield (588 kg/ha; a 14% increase) than a June 1st in the single background cropping system (Fig.  4 E). The result was consistent when yield differences due to sowing date were averaged across 128 background cropping system choices (Table S5 ) (Fig.  4 F). Similar to what was observed in maize, among all cropping systems, yield varied by 1704 kg/ha (44% difference) (Fig.  4 G). When focusing only on the early sown fields, a 1181 kg/ha yield difference (27% yield increase) was observed (Fig.  4 H). In agreement with maize, this result highlights the importance of accounting for sub-optimal background management which can mitigate the beneficial effect of early sowing (Table S6 ).

We note here the ability of farmers to change management practices can be limited due to an equipment constraint (e.g., change planter unit row width) or simply impossible (e.g., change the previous year’s crop). Thus, recommended management practices that were evaluated in studies that used specific background management may not be applicable in some instances. The benefits of the foregoing approach, which involves extensive up-to-date agronomic datasets and high-level computational programing, can have important and immediate implications in future agricultural trials. Our approach allows for more precise examination of complex management interactions in specific environments (soil type and growing season weather) across the US (region covered in Fig.  1 ). The ability to extract single management practice information (even across cropping systems) is also possible by utilizing ALE plots, or by calculation of the frequency at which a given level/rate of a management practice appeared among the highest yielding cropping systems (Tables S4 and S6 ).

Among all available 30-d weather variables, many were strongly correlated in both crop databases (Figs. S2 and S3 for maize and soybean, respectively). Models using all 30-d interval variables with r < 0.7 (Tables S8 and S9 ) showed minimal to no performance gain compared to the final more parsimonious models that included season-long weather variables (Fig. S4 ). Thus, we consider the length of periods we chose to represent well the approximate successive 60-d pre-sowing, 120-d in-season, and 60-d post-harvest segments of growing season in the US (Fig. S7 ). Season-long weather conditions have been used in previous studies 13 , 24 , and it has been shown that choice of growing season does not affect climate-related effects on crop yield 25 , 26 .

As an additional sensitivity analysis, we developed ALE plots for the algorithms using the aforementioned 30-d weather variables (Fig. S8 ). For major management practices, there were no differences in simulated responses between the algorithms that used multiple 30-d weather variables and the final chosen algorithms that used longer intervals (Fig.  3 ). Repeating the analysis for the same hypothetical cropping system in the same Wisconsin location using the algorithms developed with the 30-d weather conditions, the observed trends were consistent with the season-long weather algorithms, although the simulated yields were numerical different (Fig. S9 ). Nevertheless, across all representations of weather conditions (algorithms with 30-d intervals and season-long), the levels/rates of management practices in the 5% highest and lowest yielding maize and 5% highest soybean cropping systems with early sowing date were identical, apart from manure use in maize. Based on these results, we consider the algorithm-derived yield estimates robust to different representations of seasonal weather variability.

It appears that several different cropping systems can result in similar high yield for both crops (Fig.  4 C,D,G,H). This is in agreement with other agricultural decision maker tools 27 . Moreover, it is common for neighboring farms to attain similar crop yield despite the use of a different cropping system, suggesting that a single optimal solution does not necessarily exist and that different combinations of management practices, when they interact with environment, can still result in similar high yields. Since the effect of environment is ever-changing, the high level of complexity of synergies between G × E × M suggests that long-term optimization of single management factor may not be possible 28 , which further highlights the importance of accounting for the effect of the entire cropping system at the field level.

The approach we present here should not be considered as a crop yield forecasting exercise. There have been several attempts to forecast crop yields using deep neural network methods (e.g., 29 , 30 ). In contrast, the algorithms we present here can generate hypothetical experimental data that can be used to rapidly examine G × E × M interaction for both maize and soybean across the US. Of the millions of possible G × E × M combinations, our ML algorithms can identify hidden complex patterns between G × E × M combinations for yield optimization that may be non-obvious, but once identified, worthy of field test confirmation. Farmers can use the algorithms to gain insights about optimum management interactions in their location-specific environment (known soil type × expected weather conditions), and to identify farm factors that may be too costly to alter without a priori reason (generated by the model) for doing so. Researchers can compare expected yield across thousands of hypothetical cropping systems and use the results as a guide to design more efficient future field-based crop management practice evaluation experiments.

We note that this approach should not be considered as a substitute of replicated trials. To the contrary, replicated field trials performed by Universities are continually needed to serve as an excellent source of high-quality unbiased data which can be used to train even more comprehensive algorithms. The major issue with current performance trial data is that a great amount of management information is not reported. Usually, only information relevant to the examined management factors in each trial are reported, which inevitably results in missing values (Tables S1 and S2 ), or even in absence of important variables (e.g., number and dates of split fertilizer application). As we have highlighted here, the high order and complex background management interactions should not be considered as irrelevant.

Conclusions

Agricultural experiments repeated every year in hundreds of locations across the US generate a vast amount of crop yield and management datasets which are useful for broad inferences (average effect of a management practice across a range of environments). Such datasets have, to date, remained disconnected from each other, and are difficult to combine, standardize, and properly analyze. In the presented work, we overcame these issues by developing large databases and by leveraging the power of ML algorithms. We argue that our algorithms can advance agricultural research and aid in revealing a currently hidden yield potential in each individual farm across the US.

Crop yield and management data were obtained from publicly available variety performance trials which are typically performed yearly in several locations across each state 31 . Recorded, trial-specific, management practices for maize included use of irrigation, tillage practice, seeding rate, row spacing, sowing date, previous crop, fertilizer (N, P, and K), use of manure, cultivar’s maturity, insecticide traits and use of seed treatments (Table S1 ). For soybean, use of irrigation, foliar fungicide, tillage practice, seeding rate, row spacing, sowing date, previous crop, and cultivar maturity were recorded (Table S2 ).

Since data were collected from different states and years, it was assumed that reported management practices (general categories) were consistent across all locations. Additionally, the type and application method of fertilizer was rarely reported. Similarly, there was a lack of information on the active ingredient and rates of seed treatments and foliar applied products. We acknowledge that this lack of information, as we state in the discussion section, is a limitation of our databases and our assumption, that the way different management practices are reported across different states is consistent, may have contributed to the observed unexplained variability.

For both databases, data entry was performed manually. Additionally, for both crops, soil type was recorded and weather data (Table S7 ) were retrieved from the DAYMET 32 database for each year and set of coordinates. DAYMET daily data are reasonably accurate when means or totals are calculated over extended periods 33 . Therefore, means and sums for three periods (90–150, 151–270, and 271–330 days of year) (Tables S1 and S2 ) and 30-d periods (Tables S8 and S9 ) were calculated. The different sets of weather variables were used in different models to assess their impact in model accuracy.

The exact coordinates for each site were not reported in the trial reports. Therefore, approximate coordinates, based on the nearest reported city, were used for each unreported site. When unmanageable production adversities were reported (e.g., hail, damage due to deer etc.), the associated data were not used. Missing values were present in almost all management-related variables in both databases (Tables S1 and S2 ). Since the data were derived from designed experiments, levels of management were not a result of response to external factors (e.g., weather conditions) but were researcher’s decisions to answer specific research questions (e.g., crop yield response to different sowing dates or maturity ratings), no missing data imputation was performed.

The first step before data analysis was to examine correlations among the weather variables. Due to their strong collinearity (Figs. S3 and S4 for maize and soybean, respectively), only those with Pearson r < 0.7 were retained for subsequent analyses. The final maize database included seven weather variables (Table S1 ) and the final soybean database included eight weather variables (Table S2 ). Categorical variables were one-hot encoded and then databases were separated in training (80% of database) and testing (20% of database) datasets. To ensure adequate representation of growing environments in both, the training and testing portions of the data, stratified sampling was performed by year, use of irrigation, and soil type. For each crop, an extreme gradient boosting (XGBoost) algorithm 34 was trained to predict final yield as a response of the aforementioned weather and management variables listed in Tables S1 and S2 . The hyperparameters were optimized using the training dataset and included number of estimators, tree depth, number of leaves, minimum sum of instance weight in node, learning rate, subsample percentage, column sample by tree and by level, gamma, alpha and lambda parameters. To efficiently tune the hypermeters, Bayesian optimization was performed using “hyperopt” in Python 3.6.9 with tenfold cross validation. The combination of the hypermeters that resulted in the lowest root mean square error (RMSE) in the tenfold cross validations was chosen as the final model which was further evaluated on the test portion of the data (Fig.  2 in main document).

Accumulated local effects (ALE) plots 14 , which are robust to correlation among independent variables, were developed for indicative and important variables using 1000 Monte Carlo simulations. These plots are useful to visualize how individual features influence the predictions of the developed “black-box” algorithms. To perform the evaluation for the “what if” scenarios, the final algorithms were applied on hypothetical cropping systems in a randomly chosen field in south central Wisconsin (latitude = 43.34, longitude =  − 89.38) and weather conditions in 2016–2018 for maize and 2014–2018 for soybean. Boxplots were used to visually evaluate the results.

Data and code availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Godfray, H. C. J. et al. Food security: The challenge of feeding 9 billion people. Science 327 , 812–818 (2010).

Article   ADS   CAS   Google Scholar  

Schlenker, W. & Lobell, D. B. Robust negative impacts of climate change on African agriculture. Environ. Res. Lett. 5 , 014010 (2010).

Article   ADS   Google Scholar  

Mourtzinis, S. et al. Climate-induced reduction in US-wide soybean yields underpinned by region- and in-season specific responses. Nat. Plants 1 , 14026 (2015).

Article   Google Scholar  

Hoffman, L. A., Kemanian, A. R. & Forest, C. E. The response of maize, sorghum, and soybean yield to growing-phase climate revealed with machine learning. Environ. Res. Lett. 15 , 094013 (2020).

Folberth, C. et al. Uncertainty in soil data can outweigh climate impact signals in global crop yield simulations. Nat. Commun. https://doi.org/10.1038/ncomms11872 (2016).

Article   PubMed   PubMed Central   Google Scholar  

Makinen, H., Kaseva, J., Virkajarvi, P. & Kahiluoto, H. Shifts in soil–climate combination deserve attention. Agric. For. Meteorol. 234 , 236–246 (2017).

Rattalino Edreira, J. I. et al. Assessing causes of yield gaps in agricultural areas with diversity in climate and soils. Agric For Meteorol 247 , 170–180 (2017).

Mourtzinis, S. et al. Sifting and winnowing: analysis of farmer field data for soybean in the US North-Central region. Field Crops Res. 221 , 130–141 (2018).

Pradhan, P., Lüdeke, M. K. B., Reusser, D. E. & Kropp, J. P. Food self-sufficiency across scales: How local can we go?. Environ. Sci. Technol. 48 , 9463–9470 (2014).

Frieler, K. et al. Understanding the weather signal in national crop-yield variability. Earths Fut. 5 , 605–616 (2017).

Puntel, L. A. et al. Modeling long-term corn yield response to nitrogen rate and crop rotation. Front. Plant Sci. 7 , 1630 (2016).

Rong, J. et al. Exploring management strategies to improve maize yield and nitrogen use efficiency in northeast China using the DNDC and DSSAT models. Comput. Electron. Agric. 166 , 104988 (2019).

Leng, G. & Hall, J. W. Predicting spatial and temporal variability in crop yields: an inter-comparison of machine learning, regression and process-based models. Environ. Res. Lett. 15 , 044027 (2020).

Apley, D. W. & Zhu, J. Visualizing the effects of predictor variables in black box supervised learning models. arXiv:1612.08468v2 (2016).

Swanson, S. P. & Wilhelm, W. W. Planting date and residue rate effects on growth, partitioning, and yield of corn. Agron. J. 88 , 205–210 (1996).

Wiatrak, P. J. & Wright, D. Corn hybrids for late planting in the Southeast. Agron. J. 96 , 1118–1124 (2004).

Bruns, H. A. & Abbas, H. K. Planting date effects on Bt and non-Bt corn in the mid-south USA. Agron. J. 98 , 100–106 (2006).

Long, N. V., Assefa, Y., Schwalbert, R. & Ciampitti, I. A. Maize yield and planting date relationship: A synthesis-analysis for US high-yielding contest-winner and field research data. Front. Plant Sci. 8 , 2106 (2017).

Mourtzinis, S., Specht, J. E. & Conley, S. P. Defining optimal soybean sowing dates across the US. Sci. Rep. 9 , 2800 (2019).

Assefa, Y. et al. Yield responses to planting density for US modern corn hybrids: A synthesis-analysis. Crop Sci. 56 , 2802–2817 (2016).

Article   CAS   Google Scholar  

Light, M. A., Lenssen, A. W. & Elmore, R. W. Corn (Zea mays L.) seeding rate optimization in Iowa, USA. Precis. Agric. 18 , 452–469 (2016).

Gaspar, A. et al . Defining optimal soybean seeding rates and associated risk across North America. Agron. J. 1–12 (2020).

Andrade, J. et al. Assessing the influence of row spacing on soybean yield using experimental and producer survey data. Field Crops Res. 230 , 98–106 (2019).

Lobell, D. B. et al. The critical role of extreme heat for maize production in the United States. Nat. Clim. Change 3 , 497–501 (2013).

Lobell, D. B. & Field, C. B. Global scale climate–crop yield relationships and the impacts of recent warming. Environ. Res. Lett. 2 , 014002 (2007).

Schlenker, W. & Roberts, M. J. Nonlinear temperature effects indicate severe damages to US crop yields under climate change. Proc. Natl. Acad. Sci. 106 , 15594–15598 (2009).

Hochman, Z. et al. Re-inventing model-based decision support with Australian dryland farmers. 4. Yield prophet (R) helps farmers monitor and manage crops in a variable climate. Crop Pasture Sci. 60 , 1057–1070 (2009).

Sadras, V. O. & Densison, R. F. Neither crop genetics nor crop management can be optimized. Field Crops Res. 189 , 75–83 (2016).

Khaki, S. & Wang, L. Crop yield prediction using deep neural networks. Front. Plant Sci. 10 , 621 (2019).

Khaki, S., Wang, L. & Archontoulis, S. V. A CNN-RNN framework for crop yield prediction. Front. Plant Sci. 10 , 1750 (2020).

Websites for each state-specific university variety trial can be found in Table S10 in supplementary material.

Thornton, P.E. et al. Daymet: Daily surface weather data on a 1-km grid for North America, Version 3. ORNL DAAC, Oak Ridge, Tennessee, USA. https://doi.org/10.3334/ORNLDAAC/1328 (2016).

Mourtzinis, S., Rattalino Edreira, J. I., Conley, S. P. & Grassini, P. From grid to field: assessing quality of gridded weather data for agricultural applications. Eur. J. Agron. 82 , 163–172 (2017).

Chen, T. & Guestrin, C. XGBoost: A Scalable tree boosting system. arXiv:1603.02754v3 (2016).

Download references

Acknowledgements

The authors thank Adam Roth and multiple students for their help in database development and John Gaska for constructing Fig. 1 . This research was funded in part by the Wisconsin Soybean Marketing Board, The North Central Soybean Research Program (S.P. Conley), and the USDA National Institute of Food and Federal Appropriations under Project PEN04660 and Accession number 1016474 (P.D. Esker).

Author information

Authors and affiliations.

Agstat Consulting, Athens, Greece

Spyridon Mourtzinis

Department of Plant Pathology and Environmental Microbiology, Pennsylvania State University, State College, PA, 16801, USA

Paul D. Esker

Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE, 68583-0915, USA

James E. Specht

Department of Agronomy, University of Wisconsin-Madison, Madison, WI, 53706, USA

Shawn P. Conley

You can also search for this author in PubMed   Google Scholar

Contributions

S.M. conceived the idea, analyzed the data, and wrote the paper. P.D.E and J.E.S. contributed to idea development, reviewed results, and provided revisions for improvement of the manuscript. S.P.C. contributed to the data set and idea development, reviewed results, and commented on the manuscript.

Corresponding author

Correspondence to Spyridon Mourtzinis .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Mourtzinis, S., Esker, P.D., Specht, J.E. et al. Advancing agricultural research using machine learning algorithms. Sci Rep 11 , 17879 (2021). https://doi.org/10.1038/s41598-021-97380-7

Download citation

Received : 17 January 2021

Accepted : 25 August 2021

Published : 09 September 2021

DOI : https://doi.org/10.1038/s41598-021-97380-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Machine learning in agriculture: a review of crop management applications.

  • Ishana Attri
  • Lalit Kumar Awasthi
  • Teek Parval Sharma

Multimedia Tools and Applications (2023)

A robust and resilience machine learning for forecasting agri-food production

  • Amin Gholamrezaei
  • Kiana Kheiri

Scientific Reports (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

machine learning algorithms research papers

Help | Advanced Search

Quantum Physics

Title: systematic literature review: quantum machine learning and its applications.

Abstract: Quantum computing is the process of performing calculations using quantum mechanics. This field studies the quantum behavior of certain subatomic particles for subsequent use in performing calculations, as well as for large-scale information processing. These capabilities can give quantum computers an advantage in terms of computational time and cost over classical computers. Nowadays, there are scientific challenges that are impossible to perform by classical computation due to computational complexity or the time the calculation would take, and quantum computation is one of the possible answers. However, current quantum devices have not yet the necessary qubits and are not fault-tolerant enough to achieve these goals. Nonetheless, there are other fields like machine learning or chemistry where quantum computation could be useful with current quantum devices. This manuscript aims to present a Systematic Literature Review of the papers published between 2017 and 2023 to identify, analyze and classify the different algorithms used in quantum machine learning and their applications. Consequently, this study identified 94 articles that used quantum machine learning techniques and algorithms. The main types of found algorithms are quantum implementations of classical machine learning algorithms, such as support vector machines or the k-nearest neighbor model, and classical deep learning algorithms, like quantum neural networks. Many articles try to solve problems currently answered by classical machine learning but using quantum devices and algorithms. Even though results are promising, quantum machine learning is far from achieving its full potential. An improvement in the quantum hardware is required since the existing quantum computers lack enough quality, speed, and scale to allow quantum computing to achieve its full potential.

Submission history

Access paper:.

  • Download PDF
  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • INSPIRE HEP
  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Predicting building energy consumption in urban neighborhoods using machine learning algorithms

  • Research article
  • Open access
  • Published: 16 February 2024
  • Volume 2 , article number  6 , ( 2024 )

Cite this article

You have full access to this open access article

  • Qingrui Jiang 1 , 2 ,
  • Chenyu Huang 1 ,
  • Zhiqiang Wu 1 , 2 ,
  • Jiawei Yao 1 ,
  • Jinyu Wang 1 ,
  • Xiaochang Liu 1 &
  • Renlu Qiao 1 , 2  

35 Accesses

Explore all metrics

Assessing building energy consumption in urban neighborhoods at the early stages of urban planning assists decision-makers in developing detailed urban renewal plans and sustainable development strategies. At the city-level, the use of physical simulation-based urban building energy modeling (UBEM) is too costly, and data-driven approaches often are hampered by a lack of available building energy monitoring data. This paper combines a simulation-based approach with a data-driven approach, using UBEM to provide a dataset for machine learning and deploying the trained model for large-scale urban building energy consumption prediction. Firstly, we collected 18,789 neighborhoods containing 248,938 buildings in the Shanghai central area, of which 2,702 neighborhoods were used for UBEM. Simultaneously, building functions were defined by POI data and land use data. We used 14 impact factors related to land use and building morphology to define each neighborhood. Next, we compared the performance of six ensemble learning methods modeling impact factors with building energy consumption and used SHAP to explain the best model; we also filtered out the features that contributed the most to the model output to reduce the model complexity. Finally, the balanced regressor that had the best prediction accuracy with the minimum number of features was used to predict the remaining urban neighborhoods in the Shanghai central area. The results show that XGBoost achieves the best performance. The balanced regressor, constructed with the 9 most contributing features, predicted the building rooftop photovoltaics potential, total load, cooling load, and heating load with test set accuracies of 0.956, 0.674, 0.608, and 0.762, respectively. Our method offers an 85.5%-time advantage over traditional methods, with only a maximum of 22.75% of error.

Similar content being viewed by others

machine learning algorithms research papers

Discovery of Energy Performance Patterns for Residential Buildings Through Machine Learning

machine learning algorithms research papers

Energy Consumption Prediction of Residential Buildings Using Machine Learning: A Study on Energy Benchmarking Datasets of Selected Cities Across the United States

machine learning algorithms research papers

Building Energy Information: Demand and Consumption Prediction with Machine Learning Models for Sustainable and Smart Cities

Avoid common mistakes on your manuscript.

1 Introduction

During urbanization, we face the pressing challenge of climate change. To meet the goals of the Paris Agreement, the scientific community has united in efforts to combat global warming of 1.5 degrees Celsius (Mishra et al., 2022 ; Morfeldt & Johansson, 2022 ; Slameršak et al., 2022 ). The building and construction sector stands as one of the largest consumers of energy in the world, contributing for 25–40% of global CO2 emissions (Pomponi & Moncaster, 2017 ). In China, the construction sector among the top three energy consuming sectors, representing about 21.9% of carbon emissions from energy-related sectors (You et al., 2023 ). China has set forth ambitious carbon neutrality goals, aiming for carbon peaking by 2030 and carbon neutrality by 2060. Thus, curbing carbon emissions from the building sector is critical to China's strategy. Recent studies have raised concerns regarding the carbon neutrality of China's construction sector (Camarasa et al., 2022 ), suggesting that the construction sector might need more intensive efforts to align with carbon neutrality compared to others.

To achieve the goal of decarbonizing the building sector, a range of strategies are essential, including the construction of zero-carbon buildings, retrofitting existing energy-intensive buildings, developing new low-carbon building technologies, and promoting renewable energy sources in cities. Conducting carbon emission assessments at the decision-making stage of urban planning and urban regeneration is crucial (Dahlström et al., 2022 ; Heidelberger & Rakha, 2022 ). This approach aids decision-makers in developing detailed urban renewal plans and sustainable development strategies. However, current city-level assessments of building energy consumption present challenges. On the one hand, the top-down approach, which depends on monitoring and statistics (Abbasabadi & Ashayeri, 2019 ; Wu et al., 2022 ), is often lacking in smaller cities or cities with insufficient economic development. On the other hand, the resources and time required for assessments that rely on bottom-up approaches with energy consumption simulation engines often prove prohibitive in the early stages of planning (W. Wang et al., 2021 ). While scholars have recently turned to artificial intelligence and machine learning to predict building energy consumption, most studies focus on predicting the dynamic loads of building monoliths (L. Zhang et al., 2021 ). However, for planning designers and policymakers, energy use intensity, rather than dynamic loads, is the primary evaluation indicator. Moreover, these models often demand detailed inputs to ensure prediction accuracy (Parhizkar et al., 2021 ), posing challenges for data collection in the early stages of urban planning.

This paper employs a combination of a physical simulation engine and data-driven techniques to predict city-level building energy consumption in a bottom-up manner. We divided urban neighborhoods into simulation and prediction datasets, performed urban building energy simulations on a small sample of simulation datasets, and trained machine learning models. The trained models were then deployed to the prediction set to generalize to the full urban neighborhoods. Through interpretability analysis, we identified and retained the features that contribute most to the model output, thereby reducing the model's complexity. The organization of this paper is as follows: Sect. 2 presents the related work; Sect. 3 describes the main methods, including data collection, impact factor calculation, urban building energy simulation, and interpretable machine learning modeling; Sect. 5 provides a discussion on the findings; and conclusions are presented in Sect. 6 .

2 Related works

2.1 methods for estimating energy consumption in urban buildings.

The estimation methods for urban building energy consumption can be categorized into two main types: top-down and bottom-up approaches (Ma et al., 2017 ; Reinhart & Cerezo Davila, 2016 ). The top-down approach focuses less on the energy use of each end building but treats urban building energy consumption on a macro level. Top-down approaches rely on historical or statistical urban building energy consumption data, often correlating them to the level of economic, demographic, and technological development. This approach equips urban decision-makers with long-term or cross-city energy knowledge (Gan et al., 2022 ; Huo et al., 2022 ; Sun et al., 2022 ). Though top-down methods can provide a rapid assessment of large-scale building energy demand, they are ineffective in cities and regions that lack data. Furthermore, this approach often employs the grid as the smallest cell (Shi et al., 2019 ; J. Wang et al., 2022a , 2022b , 2022c ; Y. Zhang et al., 2022a , 2022b ), resulting in a misalignment between study outcomes and policy implementation boundaries.

In contrast to top-down approaches, bottom-up approaches emphasize the energy use of individual building or building complexes and can be categorized into physical simulation-based approaches and data-driven approaches. The physical simulation-based approach has a long-standing history. For a single building entity, building energy simulation models the thermodynamic energy processes by abstracting the building geometry into a network of connected nodes. Heat balance equations are then formulated and solved for each node, based on the provided non-geometric building parameters (Nutkiewicz et al., 2021 ). However, accurately modeling the energy consumption of individual buildings is resource-intensive and time-consuming due to the extensive number of nodes and associated equations. With the advent of artificial intelligence techniques, data-driven approaches have become a focal point in building energy consumption research (Bourdeau et al., 2019 ). Machine learning and deep learning techniques discern hidden patterns from vast energy consumption datasets, creating a predictive 'black box' for building energy consumption. This method significantly streamlines the building energy consumption assessment process. Yet, much of the current research is centered on the building O&M process, notably optimizing HVAC systems by predicting dynamic building loads (Ahmad et al., 2016 ; Zhu et al., 2022 ). Some studies have highlighted the use of data-driven methods for predicting building energy use intensity early in the design phase (M. Wang et al., 2022a , 2022b , 2022c ). However, the limited data on energy use intensity of buildings, compared to time series data, poses challenges in training robust and generalizable models (Fan et al., 2022 ).

2.2 Urban building energy consumption modeling

The urban neighborhood serves as the basic units of urban planning (H. Zhang et al., 2022a , 2022b ), and conducting building energy simulations for these urban neighborhoods offers valuable insights for urban planning and architectural design. While urban building energy simulation is gaining traction, it remains a nascent field. Urban building energy modeling encompasses the computational modeling and simulation of a group of buildings within an urban context. This approach accounts for not only the dynamics of individual buildings but, more crucially, the interactions between them (Buckley et al., 2021 ; T. Hong et al., 2020a , 2020b ). Zhou et al., ( 2022 ) utilized UBEM to simulate the energy use intensity of 9,000 residential buildings in Dublin, aiming to support the energy renovation process in the European housing sector. However, the heightened physical complexity in building energy simulations, when compared to modeling individual buildings, renders the calculations notably less efficient.

Distinct from the energy simulation of individual buildings, UBEM presents the challenge of sourcing input data. Securing accurate and comprehensive input parameters, such as geometric parameters (building geometry, window-to-wall ratio, number of floors, etc.) and non-geometric parameters (energy use patterns and HAVC systems), often proves difficult (C. Wang et al., 2022a , 2022b , 2022c ). However, leveraging data collection methods from other disciplines offers potential solutions. Mapping platforms, notably OpenStreetMap, can supply building footprint data essential for UBEM (Chen & Hong, 2018 ; Schiefelbein et al., 2019 ). Cell phone data helps characterize building occupancy(Barbour et al., 2019 ; Pang et al., 2018 ), a key determinant of energy use. Given the challenges in accessing cell phone data, point-of-interest data can also serve to identify building functions, subsequently informing UBEM about building usage (C. Wang et al., n.d. , 2020 ). These data sourcing strategies have led to UBEM often being integrated with GIS(Ali et al., 2020a , 2020b ; Groppi et al., 2018 ). Of particular note recently is the growing interest in urban distributed photovoltaic power generation. Scholars combine GIS with UBEM to evaluate the PV potential of buildings (Boccalatte et al., 2022 ; Montealegre et al., 2022 ), paving the way for sustainable urban development.

2.3 Data-driven building energy prediction

While we have highlighted the advancements in UBEM, the significant consumption of computational resources remains a major challenge, particularly for city-level energy consumption assessments where UBEM becomes almost impractical. To address this, scholars have turned to data-driven approaches. Existing studies can be broadly grouped into two categories, the first employs data-driven tools for building energy consumption and, built environment assessment, aiming to expedite the design process of sustainable urban neighborhoods (Huang et al., 2022 ; Nutkiewicz et al., 2018 ; W. Wang et al., 2021 ); the second utilizes data-driven methods to identify building energy consumption across expansive urban neighborhoods, offering insights for energy retrofitting (Ali et al., 2020a , 2020b ; Ye et al., 2021 ). The research presented in this paper aligns with the latter category.

Similar to UBEM based on physical simulation, early data-driven approaches often necessitated a plethora of input parameters to ensure model prediction accuracy. However, recent advancements have seen interpretable analysis employed to identify the most impactful features on building energy use, thereby reducing the required feature inputs for data-driven models (Seo et al., 2022 ; L. Zhang, 2021 ). Moreover, the adoption of interpretable methods has proven to enhance the generalization capability of these models (Jin et al., 2022 ; Manfren et al., 2022 ). Research has consistently demonstrated building function and morphology as pivotal factors influencing building energy consumption (Abbasabadi et al., 2019 ). Quan and Li, ( 2021 ) proposed a multi-scale data-driven energy use modeling framework, comparing various machine learning algorithms and emphasizing the pronounced impact of building size and height on building EUI. In our study, we utilized land use data sourced from POI to determine building energy use and employed building data to compute building morphology factors. Subsequently, we developed a machine learning model to predict building energy use in urban neighborhoods, integrating land use and building morphology as inputs. To streamline the feature set, we employed interpretable analysis to discern the most influential features for the model's predictive objective and devised the balanced regressor. This balanced regressor optimizes prediction accuracy while minimizing input features.

3.1 Research workflow

In this study, we leveraged data from UBEM simulation to train machine learning models, subsequently employing them to predict urban building energy consumption over an expansive area. The research process is segmented into four stages. First, we collected land use and building morphology data for modeling urban building energy consumption. Concurrently, we processed these datasets to derive 14 impactful characteristics that define urban neighborhoods. Next, we randomly selected urban neighborhoods within our study domain, subjecting the sampled data to simulations for both urban building energy consumption and rooftop photovoltaic power generation. In the third stage, we partitioned the simulated samples into training and test sets, executed machine learning modeling, and evaluated the performance of various machine learning models. We then employed the SHAP value to interpret the optimal model. Lastly, we applied the trained machine learning models to predict building energy consumption for the unsimulated urban neighborhoods within our study domain. We also contrasted the data distributions of the impact factors between the simulated and predicted samples. The study's workflow is depicted in Fig. 1 .

figure 1

Research workflow

3.2 Data collection

Shanghai (120°52′ E-122°12′ E, 30°40′ N-31°53′ N) is located on the west coast of the Pacific Ocean and has a subtropical monsoon climate with abundant light and rainfall (H. Zhang et al., 2022a , 2022b ). Recognized as one of the world's preeminent mega-cities, Shanghai stands as a beacon of urbanization in China (Cao et al., 2021 ). This rapid urbanization has resulted in a pronounced heat island effect (Yang et al., 2022 ), subsequently driving up building energy consumption (Y. Hong et al., 2020a , 2020b ). The central area of Shanghai comprises Huangpu, Hongkou, Jing'an, Xuhui, Changning, Yangpu, and Putuo districts. This region is renowned for its thriving economy, showcasing top-tier commercial, entertainment, and culinary establishments, alongside state-of-the-art infrastructure. Moreover, the central area presents a diverse architectural timeline, with a notable disparity in building ages. Many of its older structures necessitate heightened energy consumption to sustain a comfortable indoor climate. Given its high energy consumption profile and architectural diversity, we opted for the central area of Shanghai as our study domain (see Fig. 2 ).

figure 2

Study area a China, b Shanghai, c Shanghai central area

The land use data (LU) for this study was sourced from point-of-interest (POI) data and calculations based on urban planning unit data. The POI data were provided by OpenStreetMap (data source: https://www.openstreetmap.org/ ), were categorized into 18 types (e.g., restaurants, shopping malls, schools, etc.) leveraging the semantic phrases they contained. These categories were then converted into proportions representing building functions. The urban planning unit data offered geometric boundaries as well as land use types for each urban neighborhood (data source: https://www.shanghai.gov.cn/nw42806/ ). Building morphology data (BM) was derived from calculations based on urban building data (data source: https://lbsyun.baidu.com/ ).

We screened the urban neighborhoods in the Shanghai central area. The criteria for this screening were: 1) excluding lands devoid of buildings, such as landscapes, water bodies, and open spaces, and 2) eliminating lands with an area smaller than 12,000 m2, falling in the lower quartile. This screening aimed to minimize errors in urban building energy simulation and enhance the stability of the machine learning model. Following this process, we secured a total of 18,689 samples. These samples encompassed only five land use categories: LU-1: urban residential land; LU-2: industrial and mining storage land; LU-3: public infrastructure land; LU-4: public building land; and LU-5: rural settlement land. We partitioned the complete sample set into simulated and predicted datasets (see Table 1 ). The simulated dataset served the dual purpose of urban building energy consumption modeling and machine learning modeling. The models, once trained, were then applied to the prediction dataset, enabling us to determine the urban building energy consumption across the entire sample in the Shanghai central area.

3.3 Calculation of impact factors

We selected fourteen impact factors as key characteristics for predicting the urban building energy consumption. These data were classified into two categories LU and BM. LU includes 7 factors, which are land use type (LUT), the proportion of restaurant buildings (REST), the proportion of medical buildings (HOSP), the proportion of educational buildings (SCH), the proportion of commercial buildings (MALL), the proportion of residential buildings (RES), and proportion of office buildings (OFC). BM includes 7 factors, which are Site Area (SA), Number of Buildings (NoB), Building Coverage Ratio (BCR), Floor Area Ratio (FAR), Average Building Height (HAVE), Building Height Standard Deviation (HSTD), and Building Shape Coefficient (BSC) (see Fig. 3 ).

figure 3

The variables in this study, a land use type, b different functions in LUTs, c Site Area, d Building Coverage Ratio, e Number of Buildings, f Floor Area Ratio, g Average Building Height, h Building Height Standard Deviation, i Building Shape Coefficient

Among the 7 impact factors of LU, LUT was directly sourced from urban planning unit data, while the remaining impact factors were derived from POI calculations. The approach involved counting the number of POIs within each urban neighborhoods using spatial join in GIS. Subsequently, we determined the proportion of different POI classifications, which were then translated into the proportion of building functions. The 7 impact factors of BM were calculated using GIS, and the calculation formula was presented in Table 4 in Appendix .

To facilitate visualization, we introduced a cofactor: the functional mix degree (FMD). This cofactor utilizes Shannon's information entropy to represent the degree of mixing of different building functions in a plot. It is formulated as follows (Eq. ( 1 )).

where n is the number of building functions within an urban neighborhood and \({P}_{i}\) is the proportion of the i th function within the urban neighborhood.

3.4 Urban building energy simulation

UBEM is a bottom-up, physics-based approach that calculates building energy consumption, accounting for energy consumption for heating, air conditioning, ventilation, lighting, the equipment uses, and heat through the envelope during building operation. We utilized the Dragonfly plug-in of the Rhino/Grasshopper platform for urban building energy simulation, which requires around 3000 parameters for a single simulation. Dragonfly simplifies EnergyPlus model input by pre-setting many parameters using ASHARE's standard values. Additionally, Dragonfly offers a visual programming interface for urban building energy simulations. In this study, Dragonfly was used to batch call EnergyPlus, enabling us to model urban building energy consumption for all samples of the simulation dataset.

The function of a building not only determines its configuration but also its energy use. In this research, we determined building functions in each urban neighborhood using six categories of building function proportions (REST, HOSP, SCH, MALL, RES and OFC) derived from POI data. For each simulated neighborhood, building functions were randomly assigned based on their respective proportions. The urban building energy modeling process accounted for the impact of building shading on energy consumption within a 50 m radius. To expedite the simulation process, each unique room (or standard floor) was simulated once, and the results were then aggregated. The simulation spanned a full year with an hourly time step. The weather data was sourced from epw files specific to Shanghai (data source: https://www.ladybug.tools/epwmap/ ). The final UBEM outputs comprised the total load (TL), cooling load (CL), and heating load (HL) for each neighborhood. Detailed parameter settings pertaining to the building functions can be found in Table 5 in Appendix .

Furthermore, this study also simulated the solar power potential of building rooftops, given the rapid development of distributed photovoltaic power in Shanghai. The yearly acceptable solar irradiance of building rooftops was calculated using the Ladybug plug-in for Rhino/Grasshopper. Ladybug employs RADIANCE to run global and diffuse radiation simulations and is widely validated for its accuracy and efficiency in solar irradiance simulation studies(Li et al., 2022 ). The weather file utilized the epw file of Shanghai, and the calculation process took into account the shading of surrounding buildings with an accuracy of 1 m. The results from the irradiance calculation were multiplied by the attenuation coefficient of the PV panels to derive the solar power potential (RPV) of the building roof. In this work, the attenuation coefficient was set to 0.2.

The entire process of urban building energy simulation is illustrated in Fig. 4 . The outputs of the urban building energy simulation encompass RPV, TL, CL, and HL. These 14 impact factors were combined with the simulation outputs to create the dataset. A Pearson correlation analysis explored the relationship between the impact factors and the simulation outputs. Subsequently, maximum-minimum normalization was applied to the dataset in preparation for machine learning model training.

figure 4

Urban building energy simulation process, a random setting according to the proportion of building functions, b generating EnergyPlus model, c calculating of building load, d calculating of building rooftop PV potential

3.5 Explainable machine learning modeling

3.5.1 ensemble learning method.

We employed machine learning to model the nonlinear relationship between 14 impact factors and the output of an urban building energy modeling. Ensemble learning is a machine learning paradigm where multiple weak learners are combined to achieve better predictive performance than could be obtained from any of the constituent learners alone. The efficacy of ensemble learning methods in predicting building energy consumption have been well-established. In this study, we focused on two prominent ensemble learning methods, the Bagging method and the Boosting method (see Fig. 5 ). The Bagging method trains weak learners in parallel using subsets of the data and aggregates their predictions through a deterministic averaging process. In contrast, the Boosting method trains weak learners sequentially. During this process, Boosting iteratively fits a weak learner, incorporates it into the ensemble model, and then "updates" the training dataset to emphasize the strengths and weaknesses of the current ensemble model when fitting the subsequent base model. While the primary objective of the bagging approach is to produce an ensemble model with reduced variance (enhancing stability), the boosting approach aims to yield a model with diminished bias (increasing accuracy).

figure 5

Ensemble learning method, a Bagging method, b Boosting method

In this paper, we evaluated six ensemble models, For the Bagging method, we considered Bagging Regression, Extra Tree, and Random Forest; and for the Boosting method, we looked at Gradient Boosting, AdaBoost, and XGBoost. We adopted the hold-out method to partition the simulation dataset, allocating 70% for training and 30% for testing. Training was conducted using the Scikit-learn machine learning library. Model performance was assessed using coefficient of determination (R2) and mean square error (MSE). The selection of the optimal model was based on both model comparison and hyperparameter optimization.

3.5.2 Model explanation method

We used the SHAP (Shapley Additive exPlanations) library to explain the performance of the best-trained model, aiming to discern the contribution of 14 features to the model. SHAP is a model interpretation method developed from cooperative game theory, which calculates the marginal contribution of features to the model output by computing the Shapley value. SHAP constructs an additive interpretation model where all features are treated as "contributors". For each prediction sample, the model produces a prediction value, and the Shapley value is the value assigned to each feature in that sample, representing the feature contribution or feature importance. We used SHAP to determine a feature importance ranking for the best model.

3.6 Model generalization

To improve the generalizability of machine learning models, we aimed to simplify the model inputs while ensuring the prediction accuracy of the model. Reducing the number of feature inputs can maximize the simplification of the model, while also reducing the complexity of data collection. However, a decrease the number of features may compromise the model’s accuracy. Thus, we sought to develop a model that strikes a balance the number of features and model accuracy, which we termed "The Balanced Regressor".

Based on the ranking of the feature contributions, we identified the most influential impact factors on RPV, TL, CL, and HL. We then examined the effect of varying the number of feature inputs on the model accuracy. We established five different feature selection methods: 1) 14 features: using all features for training; 2) 13 features: using all features except for LUT for training; 3) 9 features: selecting the top 9 features for RPV, TL, CL, and HL respectively for training; 4) 5 features: selecting the top 5 features for RPV, TL, CL and HL respectively for training; 5) 3 features: selecting the top 3 features for RPV, TL, CL, and HL, respectively. Finally, the balanced regressor, which offers the best accuracy, was employed to predict the building energy consumption in the remaining urban neighborhoods (prediction dataset) in the Shanghai central area.

It's essential to highlight that to ensure the reliability of model generalization, a distribution test on the input features of both the training and generalized data is crucial. Only when the distributions of the two sets of features align or are similar can the machine learning model be reliably deployed.

4 Results and discussion

4.1 description of simulation results.

As depicted in Fig. 5 , the simulation results for RPV, TL, CL, and HL are presented. The overall distribution of RPV, TL, CL, and HL appears balanced across different LUTs. This suggests the feasibility of employing using machine learning models with consistent weights to predict across various LUTs. The RPV for the majority of urban neighborhoods ranged falls between 0 and 10,000,000 kWh, TL ranges from 0 to 800 kWh/m2, CL from 0 to 200 kWh/m2, and HL from 0 to 100 kWh/m2. The most pronounced variation in HL is observed across different LUTs, as illustrated in Fig. 6 (d). The median HL of LU-1 and LU-5, predominantly residential buildings, is considerably higher than that for LU-2 to LU-4, which are primarily public buildings. This discrepancy can be attributed to Shanghai's climate, characterized by hot summers and cold winters. The city lacks centralized heating during winter, yet most residential buildings employ heating equipment, leading to a surge in HL.

figure 6

UBEM output results of urban neighborhoods on different LUTs, a RPV on different LUTs, b TL on different LUTs, c CL on different LUTs, and d HL on different LUTs

Figures 7 and 8 illustrates the Pearson correlations for the variables within the simulated dataset. The correlation matrix for the complete dataset reveals a pronounced correlation between RPV and SA (0.75) and NoB (0.73) (see Fig. 7 (a)). There exists a notable positive correlation between TL (0.66), CL (0.70), and HL (0.43) with REST. This indicates that the prevalence of restaurants significantly influences the building load within urban neighborhoods. This observation is further corroborated by Table 5 in Appendix , which indicates that restaurants have greater equipment power, gas power, and hot water usage compared to other building functions. Both RES and OFC exhibit significant negative correlations with TL, CL, and HL, suggesting that neighborhoods with more residential and office buildings tend to have reduced total energy consumption. BSC displays a notable correlation with HL (0.38), aligning with conclusions from previous research. Moreover, the observed covariance between certain impact factors suggests potential feature redundancy.

figure 7

Pearson Correlation Coefficient (PCC) between UBEM output and impact factors, a PCC matrix of total simulated data, b PCC matrix of simulated data on LU-1, c PCC matrix of simulated data on LU-2, d PCC matrix of simulated data on LU-3, e PCC matrix of simulated data on LU-4, f PCC matrix of simulated data on LU-5

Figures 7 (b-f) show the Pearson coefficients of the variables in the sub-datasets corresponding to different LUTs. Generally, they mirror similar patterns as seen in Fig. 7 (a), albeit with variations in correlation strengths. In the correlation matrices for LU-1 and LU-5 (see Fig. 7 (b, f)), RES demonstrates negative correlations for TL, CL, and HL. However, this trend diminishes in the correlation matrices for LU-2–3 (see Fig. 7 (c-e)). In contrast, OFC exhibits significant correlations with TL, CL, and HL solely in LU-4.

figure 8

Regression performance of XGBoost on different LUTs, a trained model of predicting RPV, b trained model of predicting TL, c trained model of predicting CL, d trained model of predicting HL

While the correlation matrix offers insights into the significance of the impact factors, the correlation between the majority of these factors and the UBEM simulation output is negligible. This might be attributed to the Pearson method's inability to capture the nonlinear interactions inherent in real-world data combined with the physics-based simulation process. Consequently, there's a compelling need to further investigate the contribution of impact factors to the UBEM simulation output using interpretable machine learning.

4.2 Result of machine learning modeling

4.2.1 performance of ensemble model.

We evaluated six ensemble models, leading to a total of 24 machine learning training sessions for predicting RPV, TL, CL, and HL. The detailed results are presented in Tables 6, 7, 8, 9 in Appendix . A negative R 2 value indicates that the model's fit is worse than a simple mean model, highlighting its unsuitability for the given data. The results from the training and test sets were used to jointly evaluate the performance of the models. Overall, the three Boosting algorithms outperformed the Bagging method in this study, with XGBoost achieving the best performance in predicting the four simulated outputs. The R 2 of RPV is the highest, reaching 0.987 and 0.914 for the training and test sets, respectively (Table 6 in Appendix ), which is significantly higher than the prediction accuracy of TL, CL, and HL. This might be attributed to the low complexity of the RPV calculation, which solely involves irradiance calculation and influenced only by BM. In contrast to RPV, energy consumption simulation encompasses a high complexity of physical models and is affected by both LU and BM. The R 2 of XGBoost in the test set of predicted TL (0.674) was slightly lower than CL (0.685) and HL (0.749) (see Tables 6, 7, 8, 9 in Appendix ), potentially due to the increased uncertainty in components of TL other than CL and HL, such as equipment load. The impact of equipment use on neighborhood energy consumption has been discussed above, and this also highlights the necessity of predicting the energy consumption subsections.

As the correlation analysis reveals, different LUTs influence factors have varying effects on the simulated output of the four UBEM. We utilized the advantageous XGBoost model, trained using the sub-dataset of LU1-5. The model performance is given in Table 2 . The results indicate that the model achieves the best performance in the sub-dataset of LU-1. The R 2 of the training set for predicting the RPV is 0.910 (Table 2 ) slightly lower than the R 2 of the training set for predicting the full data set (Table 6 in Appendix ). When trained on the LU-1 subset, the accuracy of the models predicting TL, CL, and HL markedly outperforms those trained on the complete dataset. Specifically, for TL, the test set R2 is 0.726 on the LU-1 subset, as opposed to 0.674 on the full dataset; for CL, it's 0.751 on the LU-1 subset versus 0.685 on the full dataset; and for HL, it's 0.792 on the LU-1 subset compared to 0.749 on the full dataset. This suggests that RPV's prediction is more influenced by data volume and is less sensitive to LUT variations than TL, CL, and HL. Notably, TL, CL, and HL predictions exhibit pronounced accuracy disparities across different LUTs, with enhanced accuracy particularly in LU-1 and LU-5. This could be attributed to LU-1 and LU-5 being predominantly residential, leading to more consistent building energy consumption. In contrast, LU2-3, which comprise more public buildings, display greater energy consumption variability across different building functions.

4.2.2 Explainable analysis of the best model

Figure 9 illustrates the global feature importance of the optimal model. Here, the global importance of each feature is determined by the average absolute value of that feature's SHAP value across all samples. Figure 9 (a-d) highlights the varying contributions of each feature to the XGBoost algorithm for different prediction objectives. For RPV, BM impact factors, including SA, BCR, NoB, play a pivotal role in prediction outcomes, whereas LU impact factors exert minimal influence on the model output (Fig. 9 (a)). For TL and CL, REST emerges as the most influential contributor to the model output (Fig. 9 (b-c)), aligning with the insights from the Pearson correlation matrix. The SHAP analysis further elucidates the BM influence on UBEM, contrasting with the correlation analysis. For instance, BSC significantly affects TL, CL, and HL (ranking in the top three), while HAVE notably influences both TL and HL (also ranking in the top three).

figure 9

Feature importance of best model, a  feature importance for prediction RPV, b  feature importance for prediction TL, c  feature importance for prediction CL, d  feature importance for prediction HL

Figure 10 displays the Shapley value for each feature across all samples, highlighting the significance of each feature and how the magnitude of the sample influences the model. The feature ranking further underscores the contribution of these features to the model, often referred to as feature importance. Each dot represents a sample, with the color denoting the magnitude of the: red signifies a higher feature value, while blue indicates a lower one. These color variations help elucidate how shifts in feature values impact the model's output. Moreover, broader regions signify a clustering of numerous samples.

figure 10

SHAP summary of the best model, a  SHAP summary of prediction RPV, b  SHAP summary of prediction TL, c  SHAP summary of prediction CL, d  SHAP summary of prediction HL

Regarding RPV (Fig. 10 (a)), samples with larger SA values (represented by red dots), exert a pronounced positive influence on the model's output. Conversely, when the SA value is minimal (blue dots), its impact on the model is relatively muted. Additionally, BCR values exhibit a balanced effect on the model: larger BCR values amplify the positive effect on the model's output, while smaller BCR values enhance its negative effect. For TL and CL (Fig. 10 (b-c)), samples with a higher REST value predominantly boost the model's output. Yet, a majority of samples possess modest REST values (cluster of blue dots), consistently exerting a negative influence on the model's output. For HL (Fig. 10 (d)), samples with a larger BSC value significantly influence the model's output, whereas those with a smaller BSC value have a diminished impact. Furthermore, samples with elevated HAVE values have a restrained positive effect on the model, while those with reduced HAVE values considerably dampen the model's output.

Drawing from the results of the best model's interpretive analysis, we identified and utilized the most contributive features for RPV, TL, CL, and HL. We then trained the XGBoost model, aiming to achieve the “balanced regressor”—a model that maximizes prediction accuracy using the fewest features. Table 3 presents the R2 for both the training and test sets. The results indicate that the XGBoost model, when trained using 9 features, delivers optimal performance and is thus designated as the “balanced regressor”. Notably, it surpasses the accuracy of the best model (trained using 14 features) in predicting RPV, CL, and HL for the test set. This underscores the presence of redundant features in the initial dataset for various prediction objectives. Consequently, we employed the balanced regressor for model generalization.

4.3 Results of model generalization

4.3.1 results of the same distribution test.

For effective model generalization in machine learning, it's imperative to ensure that the input features of both the training data and the generalized data share the same distribution. We analyzed the data distributions of all input features of the balanced regressor, as derived in the previous section. In Fig. 11 , we compare the distributions of various indicators for both the training and generalized sets. The features REST, HOSP, SCH, MALL, RES, and OFC are aggregated into a single indicator represented as FMD, while the distribution comparisons for the remaining BM impact indicators are also presented. The results highlight only minor discrepancies in data distribution. The primary distinction lies in the volume of data; however, the domain of the training data fully encompasses the generalization set. This indicates that the model trained on the training set can be seamlessly applied to the generalization set.

figure 11

Same distribution test

4.4 Energy prediction results for Shanghai central area

We deployed the balanced regressor on the generalization set for prediction, aiming to swiftly estimate the spatial distribution of building energy consumption and PV generation potential in the Shanghai central area. Figure 12 shows all the input features of both the simulated dataset and the generalized set. Figure 13 presents the predication for RPV, TL, CL, and HL in the Shanghai central area. The results offer valuable insights into urban decarbonization. For instance, distributed PV development projects in the city can be prioritized in the hotspots shown in Fig. 13 (a); the hotspots in Fig. 13 (b-d) identify high energy-intensive urban neighborhoods that that require immediate low-carbon retrofitting. The overlapping hotspots in Fig. 13 (a) and (b-d) suggest that a large amount of PV energy can be consumed locally, forming a foundation for the development of PV infrastructure, including energy storage stations. In the early stages of urban planning, projections of energy consumption for buildings in expansive urban neighborhoods can be visualized against local baselines of energy consumption or carbon emissions, ensuring the continued relevance of this methodology across different cities.

figure 12

Impact factors per urban neighborhoods for model generalization, a  LU, b  FMD, representing 6 building function ratios, c  SA, d  NoB, e  BCR, f  FAR, g  HAVE, h  HSTD, i  BSC

figure 13

Urban energy use per urban neighborhoods prediction of Shanghai central area, a  RPV prediction of Shanghai central area, b  TL prediction of Shanghai central area, c  CL prediction of Shanghai central area, d  HL prediction of Shanghai central area

5 Discussion

In this study, we introduce a method that integrates physics-based approaches with data-driven techniques to employ machine learning for predicting energy consumption across large-scale urban neighborhoods. Our proposed method offers a substantial time benefit compared to the traditional URBM approach. In this study, simulating PRV, TL, CL, and HL for a single neighborhood takes approximately 5 min (using Intel 13th i9, RTX2080). With a simulation database comprising 2702 samples, the total time amounts to roughly 225 h or 14,400 core hours (utilizing 64 cores). The time taken for model training and generalization is minimal. The simulated data accounts for 14.5% of all urban neighborhoods in the Shanghai central area, meaning the application of machine learning results in a time-saving of 85.5%. Moreover, the interpretative outcomes enable the identification of the optimal prediction with the fewest features. The balanced regressor predicts RPV, TL, CL, and HL with test set accuracies of 0.956, 0.674, 0.608, yielding an average test set accuracy of 0.7725. This implies that our energy consumption assessment method has a maximum error margin of 22.75%. Future endeavors may further reduce this error by incorporating more simulation data and refining the model. Semi-supervised learning and few-shot learning may offer avenues for further enhancing workflow efficiency and model accuracy in future investigations.

To ensure clarity, it is imperative to elucidate the reliability and applicability of our model. The framework proposed in this study is apt for both planned and unplanned design communities, primarily because the predictors for energy consumption encompass architectural functions and morphological features. These relationships, rooted in thermodynamics, are embedded within the UBEM. Machine learning, with its prowess in fitting non-linear relationships, explicitly manifests these associations. When extending the application across regions, it becomes essential to rigorously assess the alignment between the distributions of the training and generalization sets, stemming from the inherent assumption of independent and identically distributed samples in machine learning algorithms. This signifies that if the architectural functions and morphology of the prediction region deviate significantly from the training set, the predictions might falter. At the algorithmic level, transfer learning could potentially mitigate the accuracy losses due to distribution disparities. Further enhancements can be introduced at the data level by augmenting the training samples. Energy consumption habits at end-use terminals vary across regions, and this variation might be reflected in the parameter settings of the UBEM during the preparation of the training set. For cross-regional applications, settings should be aligned with the local energy consumption simulation standards. Moreover, climatic factors play a pivotal role in energy consumption simulations; hence, when applying this method in diverse regions, it's crucial to incorporate local meteorological data.

Our methodology offers a viable approach to estimate building energy consumption at the urban scale, especially when data availability is limited. Within the scope of this study, the balanced regressor opted for nine indicators for modeling. In reality, in underdeveloped regions, the number of available indicators might be even fewer. Our method can be effectively integrated with workflows that utilize remote sensing and deep learning for the identification of building footprints. This facilitates the estimation of building energy consumption using a minimal set of architectural features, thereby bolstering sustainable energy development in less developed areas.

6 Conclusion

The aim of this paper is to predict urban building energy consumption in the Shanghai central area and to establish a robust method for predicting building energy consumption at the city scale. We amassed a total of 18,689 urban neighborhoods, with 14.5% designated as the simulation dataset and the remaining 85.5% as the prediction dataset. The simulated dataset served for urban building energy modeling and machine learning model training, while the prediction dataset was allocated for the generalization of the machine learning models. The urban building energy consumption simulations executed in batches using Dragonfly. We compiled 14 factors related to land use and building morphology of urban neighborhoods as input features for machine learning. A comparison was made among six prevalent ensemble learning algorithms. The optimal model was analyzed using SHAP to derive a feature importance ranking of the model output. Subsequently, the balanced regressor was characterized as the model achieving optimal performance with the fewest input features. This balanced regressor, when applied to the prediction dataset, facilitated a rapid estimation of building energy consumption in the Shanghai central area.

The findings indicate that the Boosting ensemble learning model, specifically XGBoost, delivers superior performance, with test set accuracies of 0.914, 0.674, 0.685, and 0.749 for predicting RPV, TL, CL, and HL, respectively. There was a notable variance in the feature importance ranking across different prediction objectives. The test set accuracy of the balanced regressor, utilizing the 9 most influential features to predict RPV, TL, CL, and HL, stands at 0.956, 0.674, 0.608, and 0.762, resulting in an average test set accuracy of 0.7725. Compared to traditional approaches, our methodology offers an 85.5%-time advantage and incurs a maximum error of just 22.75%.

The present study has two primary limitations. Firstly, the current urban energy simulation engine does not account for the impacts of green spaces and water bodies. This omission might cause our model to underestimate the influence of the urban microclimate on building energy consumption in the Shanghai central area. Future work could incorporate simulation tools that calculate hydrodynamics, average radiation temperature, and heat island effect, integrating them with the energy consumption simulation engine. Secondly, the accuracy of our machine learning model needs to be improved. Future endeavors will explore the equilibrium between the simulation's time consumption and the potential reduction in model prediction accuracy. Furthermore, while the adoption of deep learning models might substantially boost model accuracy, they require more stringent generalization assessments to prevent overfitting.

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbasabadi, N., & Ashayeri, M. (2019). Urban energy use modeling methods and tools: A review and an outlook. Building and Environment, 161 , 106270. https://doi.org/10.1016/j.buildenv.2019.106270

Article   Google Scholar  

Abbasabadi, N., Ashayeri, M., Azari, R., Stephens, B., & Heidarinejad, M. (2019). An integrated data-driven framework for urban energy use modeling (UEUM). Applied Energy, 253 , 113550. https://doi.org/10.1016/j.apenergy.2019.113550

Ahmad, M. W., Mourshed, M., Yuce, B., & Rezgui, Y. (2016). Computational intelligence techniques for HVAC systems: A review. Building Simulation, 9 (4), 359–398. https://doi.org/10.1007/s12273-016-0285-4

Ali, U., Shamsi, M. H., Bohacek, M., Hoare, C., Purcell, K., Mangina, E., & O’Donnell, J. (2020a). A data-driven approach to optimize urban scale energy retrofit decisions for residential buildings. Applied Energy, 267 , 114861. https://doi.org/10.1016/j.apenergy.2020.114861

Ali, U., Shamsi, M. H., Bohacek, M., Purcell, K., Hoare, C., Mangina, E., & O’Donnell, J. (2020b). A data-driven approach for multi-scale GIS-based building energy modeling for analysis, planning and support decision making. Applied Energy, 279 , 115834. https://doi.org/10.1016/j.apenergy.2020.115834

Barbour, E., Davila, C. C., Gupta, S., Reinhart, C., Kaur, J., & González, M. C. (2019). Planning for sustainable cities by estimating building occupancy with mobile phones. Nature Communications, 10 (1), 3736. https://doi.org/10.1038/s41467-019-11685-w

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Boccalatte, A., Thebault, M., Ménézo, C., Ramousse, J., & Fossa, M. (2022). Evaluating the impact of urban morphology on rooftop solar radiation: A new city-scale approach based on Geneva GIS data. Energy and Buildings, 260 , 111919. https://doi.org/10.1016/j.enbuild.2022.111919

Bourdeau, M., Zhai, X., & qiang, Nefzaoui, E., Guo, X., & Chatellier, P. (2019). Modeling and forecasting building energy consumption: A review of data-driven techniques. Sustainable Cities and Society, 48 , 101533. https://doi.org/10.1016/j.scs.2019.101533

Buckley, N., Mills, G., Reinhart, C., & Berzolla, Z. M. (2021). Using urban building energy modelling (UBEM) to support the new European Union’s Green Deal: Case study of Dublin Ireland. Energy and Buildings, 247 , 111115. https://doi.org/10.1016/j.enbuild.2021.111115

Camarasa, C., Mata, É., Navarro, J. P. J., Reyna, J., Bezerra, P., Angelkorte, G. B., Feng, W., Filippidou, F., Forthuber, S., Harris, C., Sandberg, N. H., Ignatiadou, S., Kranzl, L., Langevin, J., Liu, X., Müller, A., Soria, R., Villamar, D., Dias, G. P., & Yaramenka, K. (2022). A global comparison of building decarbonization scenarios by 2050 towards 1.5–2 °C targets. Nature Communications, 13 (1), 3077. https://doi.org/10.1038/s41467-022-29890-5

Cao, Y., Kong, L., Zhang, L., & Ouyang, Z. (2021). The balance between economic development and ecosystem service value in the process of land urbanization: A case study of China’s land urbanization from 2000 to 2015. Land Use Policy, 108 , 105536. https://doi.org/10.1016/j.landusepol.2021.105536

Chen, Y., & Hong, T. (2018). Impacts of building geometry modeling methods on the simulation results of urban building energy models. Applied Energy, 215 , 717–735. https://doi.org/10.1016/j.apenergy.2018.02.073

Article   ADS   Google Scholar  

Dahlström, L., Broström, T., & Widén, J. (2022). Advancing urban building energy modelling through new model components and applications: A review. Energy and Buildings, 266 , 112099. https://doi.org/10.1016/j.enbuild.2022.112099

Fan, C., Lei, Y., Sun, Y., Piscitelli, M. S., Chiosa, R., & Capozzoli, A. (2022). Data-centric or algorithm-centric: Exploiting the performance of transfer learning for improving building energy predictions in data-scarce context. Energy, 240 , 122775. https://doi.org/10.1016/j.energy.2021.122775

Feliciotti, A., & Fleishmann, M. (2022). Simulating the impact of urban morphology on energy demand in blocks- A case study of dwellings in Nanjing . University of Strathclyde Publishing. https://doi.org/10.17868/80146

Gan, L., Liu, Y., Shi, Q., Cai, W., & Ren, H. (2022). Regional inequality in the carbon emission intensity of public buildings in China. Building and Environment, 225 , 109657. https://doi.org/10.1016/j.buildenv.2022.109657

Groppi, D., de Santoli, L., Cumo, F., & Astiaso Garcia, D. (2018). A GIS-based model to assess buildings energy consumption and usable solar energy potential in urban areas. Sustainable Cities and Society, 40 , 546–558. https://doi.org/10.1016/j.scs.2018.05.005

Heidelberger, E., & Rakha, T. (2022). Inclusive urban building energy modeling through socioeconomic data: a persona-based case study for an underrepresented community. Building and Environment, 222 , 109374. https://doi.org/10.1016/j.buildenv.2022.109374

Hong, T., Chen, Y., Luo, X., Luo, N., & Lee, S. H. (2020a). Ten questions on urban building energy modeling. Building and Environment, 168 , 106508. https://doi.org/10.1016/j.buildenv.2019.106508

Hong, Y., Ezeh, C. I., Deng, W., Hong, S.-H., Peng, Z., & Tang, Y. (2020b). Correlation between building characteristics and associated energy consumption: Prototyping low-rise office buildings in Shanghai. Energy and Buildings, 217 , 109959. https://doi.org/10.1016/j.enbuild.2020.109959

Huang, C., Zhang, G., Yao, J., Wang, X., Calautit, J. K., Zhao, C., An, N., & Peng, X. (2022). Accelerated environmental performance-driven urban design with generative adversarial network. Building and Environment, 224 , 109575. https://doi.org/10.1016/j.buildenv.2022.109575

Huo, T., Cao, R., Xia, N., Hu, X., Cai, W., & Liu, B. (2022). Spatial correlation network structure of China’s building carbon emissions and its driving factors: A social network analysis method. Journal of Environmental Management, 320 , 115808. https://doi.org/10.1016/j.jenvman.2022.115808

Article   CAS   PubMed   Google Scholar  

Jin, X., Xiao, F., Zhang, C., & Li, A. (2022). GEIN: An interpretable benchmarking framework towards all building types based on machine learning. Energy and Buildings, 260 , 111909. https://doi.org/10.1016/j.enbuild.2022.111909

Li, J., Wang, Y., & Xia, Y. (2022). A novel geometric parameter to evaluate the effects of block form on solar radiation towards sustainable urban design. Sustainable Cities and Society, 84 , 104001. https://doi.org/10.1016/j.scs.2022.104001

Ma, W., Fang, S., Liu, G., & Zhou, R. (2017). Modeling of district load forecasting for distributed energy system. Applied Energy, 204 , 181–205. https://doi.org/10.1016/j.apenergy.2017.07.009

Manfren, M., James, P. A. B., & Tronchin, L. (2022). Data-driven building energy modelling – An analysis of the potential for generalisation through interpretable machine learning. Renewable and Sustainable Energy Reviews, 167 , 112686. https://doi.org/10.1016/j.rser.2022.112686

Mishra, A., Humpenöder, F., Churkina, G., Reyer, C. P. O., Beier, F., Bodirsky, B. L., Schellnhuber, H. J., Lotze-Campen, H., & Popp, A. (2022). Land use change and carbon emissions of a transformation to timber cities. Nature Communications, 13 (1), 4889. https://doi.org/10.1038/s41467-022-32244-w

Montealegre, A. L., García-Pérez, S., Guillén-Lambea, S., Monzón-Chavarrías, M., & Sierra-Pérez, J. (2022). GIS-based assessment for the potential of implementation of food-energy-water systems on building rooftops at the urban level. Science of the Total Environment, 803 , 149963. https://doi.org/10.1016/j.scitotenv.2021.149963

Article   ADS   CAS   PubMed   Google Scholar  

Morfeldt, J., & Johansson, D. J. A. (2022). Impacts of shared mobility on vehicle lifetimes and on the carbon footprint of electric vehicles. Nature Communications, 13 (1), 6400. https://doi.org/10.1038/s41467-022-33666-2

Nutkiewicz, A., Choi, B., & Jain, R. K. (2021). Exploring the influence of urban context on building energy retrofit performance: A hybrid simulation and data-driven approach. Advances in Applied Energy, 3 , 100038. https://doi.org/10.1016/j.adapen.2021.100038

Nutkiewicz, A., Yang, Z., & Jain, R. K. (2018). Data-driven Urban Energy Simulation (DUE-S): A framework for integrating engineering simulation and machine learning methods in a multi-scale urban energy modeling workflow. Applied Energy, 225 , 1176–1189. https://doi.org/10.1016/j.apenergy.2018.05.023

Pang, Z., Xu, P., O’Neill, Z., Gu, J., Qiu, S., Lu, X., & Li, X. (2018). Application of mobile positioning occupancy data for building energy simulation: An engineering case study. Building and Environment, 141 , 1–15. https://doi.org/10.1016/j.buildenv.2018.05.030

Parhizkar, T., Rafieipour, E., & Parhizkar, A. (2021). Evaluation and improvement of energy consumption prediction models using principal component analysis based feature reduction. Journal of Cleaner Production, 279 , 123866. https://doi.org/10.1016/j.jclepro.2020.123866

Perera, A. T. D., Javanroodi, K., & Nik, V. M. (2021). Climate resilient interconnected infrastructure: Co-optimization of energy systems and urban morphology. Applied Energy, 285 , 116430. https://doi.org/10.1016/j.apenergy.2020.116430

Pomponi, F., & Moncaster, A. (2017). Circular economy for the built environment: A research framework. Journal of Cleaner Production, 143 , 710–718. https://doi.org/10.1016/j.jclepro.2016.12.055

Quan, S. J., & Li, C. (2021). Urban form and building energy use: A systematic review of measures, mechanisms, and methodologies. Renewable and Sustainable Energy Reviews, 139 , 110662. https://doi.org/10.1016/j.rser.2020.110662

Ratti, C., Baker, N., & Steemers, K. (2005). Energy consumption and urban texture. Energy and Buildings, 37 (7), 762–776. https://doi.org/10.1016/j.enbuild.2004.10.010

Reinhart, C. F., & Cerezo Davila, C. (2016). Urban building energy modeling – A review of a nascent field. Building and Environment, 97 , 196–202. https://doi.org/10.1016/j.buildenv.2015.12.001

Schiefelbein, J., Rudnick, J., Scholl, A., Remmen, P., Fuchs, M., & Müller, D. (2019). Automated urban energy system modeling and thermal building simulation based on OpenStreetMap data sets. Building and Environment, 149 , 630–639. https://doi.org/10.1016/j.buildenv.2018.12.025

Seo, J., Kim, S., Lee, S., Jeong, H., Kim, T., & Kim, J. (2022). Data-driven approach to predicting the energy performance of residential buildings using minimal input data. Building and Environment, 214 , 108911. https://doi.org/10.1016/j.buildenv.2022.108911

Shi, K., Yang, Q., Fang, G., Yu, B., Chen, Z., Yang, C., & Wu, J. (2019). Evaluating spatiotemporal patterns of urban electricity consumption within different spatial boundaries: A case study of Chongqing, China. Energy, 167 , 641–653. https://doi.org/10.1016/j.energy.2018.11.022

Slameršak, A., Kallis, G., & Neill, D. W. O. (2022). Energy requirements and carbon emissions for a low-carbon energy transition. Nature Communications, 13 (1), 6932. https://doi.org/10.1038/s41467-022-33976-5

Song, S., Leng, H., Xu, H., Guo, R., & Zhao, Y. (2020). Impact of Urban Morphology and Climate on Heating Energy Consumption of Buildings in Severe Cold Regions. International Journal of Environmental Research and Public Health, 17 (22), 8354. https://doi.org/10.3390/ijerph17228354

Article   PubMed   PubMed Central   Google Scholar  

Sun, X., Mi, Z., Sudmant, A., Coffman, D., Yang, P., & Wood, R. (2022). Using crowdsourced data to estimate the carbon footprints of global cities. Advances in Applied Energy, 8 , 100111. https://doi.org/10.1016/j.adapen.2022.100111

Article   CAS   Google Scholar  

Wang, C., Ferrando, M., Causone, F., Jin, X., Zhou, X., & Shi, X. (2022a). Data acquisition for urban building energy modeling: a review. Building and Environment, 217 , 109056. https://doi.org/10.1016/j.buildenv.2022.109056

Wang, C., Li, Y., & Shi, X. (n.d.). Information Mining for Urban Building Energy Models (UBEMs) from Two Data Sources: OpenStreetMap and Baidu Map . 3369–3376. https://doi.org/10.26868/25222708.2019.210545

Wang, C., Wu, Y., Shi, X., Li, Y., Zhu, S., Jin, X., & Zhou, X. (2020). Dynamic occupant density models of commercial buildings for urban energy simulation. Building and Environment, 169 , 106549. https://doi.org/10.1016/j.buildenv.2019.106549

Wang, J., Wei, J., Zhang, W., Liu, Z., Du, X., Liu, W., & Pan, K. (2022b). High-resolution temporal and spatial evolution of carbon emissions from building operations in Beijing. Journal of Cleaner Production, 376 , 134272. https://doi.org/10.1016/j.jclepro.2022.134272

Wang, M., Yu, H., Yang, Y., Jing, R., Tang, Y., & Li, C. (2022c). Assessing the impacts of urban morphology factors on the energy performance for building stocks based on a novel automatic generation framework. Sustainable Cities and Society, 87 , 104267. https://doi.org/10.1016/j.scs.2022.104267

Wang, W., Liu, K., Zhang, M., Shen, Y., Jing, R., & Xu, X. (2021). From simulation to data-driven approach: A framework of integrating urban morphology to low-energy urban design. Renewable Energy, 179 , 2016–2035. https://doi.org/10.1016/j.renene.2021.08.024

Wu, Z., Qiao, R., Zhao, S., Liu, X., Gao, S., Liu, Z., Ao, X., Zhou, S., Wang, Z., & Jiang, Q. (2022). Nonlinear forces in urban thermal environment using Bayesian optimization-based ensemble learning. Science of the Total Environment, 838 , 156348. https://doi.org/10.1016/j.scitotenv.2022.156348

Yang, Y., Guangrong, S., Chen, Z., Hao, S., Zhouyiling, Z., & Shan, Y. (2022). Quantitative analysis and prediction of urban heat island intensity on urban-rural gradient: a case study of Shanghai. Science of the Total Environment, 829 , 154264. https://doi.org/10.1016/j.scitotenv.2022.154264

Ye, Z., Cheng, K., Hsu, S.-C., Wei, H.-H., & Cheung, C. M. (2021). Identifying critical building-oriented features in city-block-level building energy consumption: a data-driven machine learning approach. Applied Energy, 301 , 117453. https://doi.org/10.1016/j.apenergy.2021.117453

You, K., Ren, H., Cai, W., Huang, R., & Li, Y. (2023). Modeling carbon emission trend in China’s building sector to year 2060. Resources, Conservation and Recycling, 188 , 106679. https://doi.org/10.1016/j.resconrec.2022.106679

Zhang, H., Han, J., Zhou, R., Zhao, A., Zhao, X., & Kang, M. (2022a). Quantifying the relationship between land parcel design attributes and intra-urban surface heat island effect via the estimated sensible heat flux. Urban Climate, 41 , 101030. https://doi.org/10.1016/j.uclim.2021.101030

Zhang, J., Xu, L., Shabunko, V., Tay, S. E. R., Sun, H., Lau, S. S. Y., & Reindl, T. (2019). Impact of urban block typology on building solar potential and energy use efficiency in tropical high-density city. Applied Energy, 240 , 513–533. https://doi.org/10.1016/j.apenergy.2019.02.033

Zhang, L. (2021). Data-driven building energy modeling with feature selection and active learning for data predictive control. Energy and Buildings, 252 , 111436. https://doi.org/10.1016/j.enbuild.2021.111436

Zhang, L., Wen, J., Li, Y., Chen, J., Ye, Y., Fu, Y., & Livingood, W. (2021). A review of machine learning in building load prediction. Applied Energy, 285 , 116452. https://doi.org/10.1016/j.apenergy.2021.116452

Zhang, Y., Teoh, B. K., Zhang, L., & Chen, J. (2022b). Spatio-temporal heterogeneity analysis of energy use in residential buildings. Journal of Cleaner Production, 352 , 131422. https://doi.org/10.1016/j.jclepro.2022.131422

Zhou, S., Liu, Z., Wang, M., Gan, W., Zhao, Z., & Wu, Z. (2022). Impacts of building configurations on urban stormwater management at a block scale using XGBoost. Sustainable Cities and Society, 87 , 104235. https://doi.org/10.1016/j.scs.2022.104235

Zhu, J., Niu, J., Tian, Z., Zhou, R., & Ye, C. (2022). Rapid quantification of demand response potential of building HAVC system via data-driven model. Applied Energy, 325 , 119796. https://doi.org/10.1016/j.apenergy.2022.119796

Download references

Acknowledgements

The authors gratefully acknowledge the contributions of the editors and peer reviewers for their valuable feedback and suggestions. Their insights have been crucial in refining and strengthening the manuscript.

Research on Multi-Modal Scenario Intelligent Simulation Information Platform for Sustainable Urban Planning and Construction under the National Key Research and Development Programme of the 14th Five-Year Plan (2022YFC3800205).

The National Natural Science Foundation of China under Grant NO. 52278041 and the Fundamental Research Funds for the Central Universities.

The International Knowledge Centre for Engineering Sciences and Technology (IKCEST) under the Auspices of UNESCO, Beijing 100088, China.

Author information

Authors and affiliations.

College of Architecture and Urban Planning, Tongji University, 1239 Siping Road, Shanghai, People’s Republic of China

Qingrui Jiang, Chenyu Huang, Zhiqiang Wu, Jiawei Yao, Jinyu Wang, Xiaochang Liu & Renlu Qiao

Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, 1239 Siping Road, Shanghai, People’s Republic of China

Qingrui Jiang, Zhiqiang Wu & Renlu Qiao

You can also search for this author in PubMed   Google Scholar

Contributions

Qingrui Jiang: Conceptualization, Visualization, Methodology, and Writing. Chenyu Huang: Conceptualization, Supervision, Writing, and Methodology. Zhiqiang Wu: Conceptualization, Investigation, Funding Acquisition and Supervision. Jiawei Yao: Conceptualization, Investigation, Funding Acquisition and Supervision. Jinyu Wang: Data curation. Xiaochang Liu: Review and editing. Renlu Qaio: Review and editing. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Zhiqiang Wu or Jiawei Yao .

Ethics declarations

Competing interests.

In the interest of transparency, we disclose that Zhiqiang Wu is a co-author of this paper and also serves as the editor of Frontiers of Urban and Rural Planning. To ensure the integrity of the review process, Zhiqiang Wu will recuse themselves from any involvement in the editorial decision-making for this submission. An alternative editor has been designated to handle the peer review process for this paper. The journal's commitment to editorial independence and ethical standards will be upheld throughout the review process.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Jiang, Q., Huang, C., Wu, Z. et al. Predicting building energy consumption in urban neighborhoods using machine learning algorithms. FURP 2 , 6 (2024). https://doi.org/10.1007/s44243-024-00032-3

Download citation

Received : 24 May 2023

Revised : 11 January 2024

Accepted : 12 January 2024

Published : 16 February 2024

DOI : https://doi.org/10.1007/s44243-024-00032-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Urban building energy modeling (UBEM)
  • Interpretable machine learning
  • Ensemble learning
  • Shanghai central area
  • Energy consumption
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. (PDF) A systematic review of the machine learning algorithms for the

    machine learning algorithms research papers

  2. (PDF) The Most Popular Machine Learning Algorithms: A Basic Introduction

    machine learning algorithms research papers

  3. 1 Classification of machine learning algorithms

    machine learning algorithms research papers

  4. (PDF) Machine Learning Algorithms -A Review

    machine learning algorithms research papers

  5. A general framework of the machine learning algorithm.

    machine learning algorithms research papers

  6. Machine learning algorithms and common applications.

    machine learning algorithms research papers

VIDEO

  1. Top 10 Machine Learning Algorithms

  2. Learn About Different Types Of Machine Learning Algorithms !Don't forget to save this post for later

  3. Machine Learning Algorithms

  4. Goals of Machine Learning Algorithms Part 1 #machinelearning

  5. Machine Learning 2

  6. Most used machine learning algorithms.#algorithm #machinelearning #shorts #youtubeshorts

COMMENTS

  1. Machine Learning: Algorithms, Real-World Applications and Research

    Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning, which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale.

  2. Machine Learning: Algorithms, Real-World Applications and Research

    In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application.

  3. PDF Machine Learning: Algorithms, Real-World Applications and Research

    Basedontheimportanceandpotentialityof"Machine Learning" to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machineklearningkalgorithmsthatcanbeappliedto enhancetheintelligenceandthecapabilitiesofanappli- cation.Thus,thekeycontributionofthisstudyisexplain- ingtheprinciplesandpotentialityofdiere...

  4. (PDF) Machine Learning Algorithms -A Review

    ... SVM boasts global convergence, insensitivity to sample dimensions, and robust generalization capacity and is particularly suited for handling small samples and nonlinear problems. It has been...

  5. Machine learning-based approach: global trends, research directions

    In this paper, we present a comprehensive view on geo worldwide trends (taking into account China, the USA, Israel, Italy, the UK, and the Middle East) of ML-based approaches highlighting the rapid growth in the last 5 years attributable to the introduction of related national policies.

  6. A Comparative Analysis of Machine Learning Algorithms for

    Volume 215, 2022, Pages 422-431 A Comparative Analysis of Machine Learning Algorithms for Classification Purpose Author links open overlay panelVrajShetha, UrvashiTripathia, AnkitSharmaa Show more Add to Mendeley Share Cite https://doi.org/10.1016/j.procs.2022.12.044Get rights and content Under a Creative Commons license open access Abstract

  7. Machine Learning

    Machine Learning Algorithms, Models and Applications Edited by Jaydip Sen Edited by Jaydip Sen Recent times are witnessing rapid development in machine learning algorithm systems, especially in reinforcement learning, natural language processing, computer and robot vision, image processing, speech, and emotional processing and understanding.

  8. PDF A Survey of Optimization Methods from a Machine Learning Perspective

    In this paper, we first describe the optimization problems in machine learning. Then, we introduce the principles and progresses of commonly used optimization methods. Next, we summarize the applications and developments of optimization methods in some popular machine learning fields.

  9. A Quick Review of Machine Learning Algorithms

    A Quick Review of Machine Learning Algorithms Abstract: Machine learning is predominantly an area of Artificial Intelligence which has been a key component of digitalization solutions that has caught major attention in the digital arena.

  10. Machine Learning: Algorithms, Models, and Applications

    In tune with the increasing importance and relevance of machine learning models, algorithms, and their applications, and with the emergence of more innovative uses cases of deep learning and artificial intelligence, the current volume presents a few innovative research works and their applications in real world, such as stock trading, medical an...

  11. Machine Learning: Algorithms, Real-World Applications and Research

    Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale.

  12. A review of machine learning algorithms for identification and

    (1) 16 different ML algorithms are found in these approaches; of which supervised learning algorithms are most popular. (2) All 24 approaches have followed a standard process in identifying and classifying NFRs. (3) Precision and recall are the most used matrices to measure the performance of these approaches.

  13. Ethical principles in machine learning and artificial intelligence

    Decision-making on numerous aspects of our daily lives is being outsourced to machine-learning (ML) algorithms and artificial intelligence (AI), motivated by speed and efficiency in the decision ...

  14. Machine learning: Trends, perspectives, and prospects

    A diverse array of machine-learning algorithms has been developed to cover the wide variety of data and problem types exhibited across different machine-learning problems (1, 2).Conceptually, machine-learning algorithms can be viewed as searching through a large space of candidate programs, guided by training experience, to find a program that optimizes the performance metric.

  15. An Overview of Supervised Machine Learning Algorithm

    Machine learning is a subset of Artificial intelligence. Algorithms for machine learning automatically learn from experience and improve from it without being explicitly programmed. Machine learning defines Supervised, Unsupervised and Reinforcement Learning. Supervised algorithms are worked on under guidance but unsupervised algorithms are worked on without guidance. Machine learning provides ...

  16. Machine learning

    Machine learning methods enable computers to learn without being explicitly programmed and have multiple applications, for example, in the improvement of data mining algorithms. Featured

  17. Advancing agricultural research using machine learning algorithms

    The responses in the ALE plots (Fig. 3) suggest that these algorithms have captured the general expected average responses for important single features.Nevertheless, our databases include ...

  18. The ethics of algorithms: key problems and solutions

    Research on the ethics of algorithms has grown substantially over the past decade. Alongside the exponential development and application of machine learning algorithms, new ethical problems and solutions relating to their ubiquitous use in society have been proposed. This article builds on a review of the ethics of algorithms published in 2016 (Mittelstadt et al. Big Data Soc 3(2), 2016). The ...

  19. Artificial intelligence, machine learning and deep learning in advanced

    Machine learning is a subset of AI that focuses on training machines to improve their performance on specific tasks by providing them with data and algorithms [124]. Deep learning is a subset of machine learning that involves the use of neural networks to analyze large amounts of data and learn patterns [125]. In the context of robotics taxi ...

  20. Algorithms

    Machine-learning-based text classification is one of the leading research areas and has a wide range of applications, which include spam detection, hate speech identification, reviews, rating summarization, sentiment analysis, and topic modelling. Widely used machine-learning-based research differs in terms of the datasets, training methods, performance evaluation, and comparison methods used ...

  21. [2201.04093] Systematic Literature Review: Quantum Machine Learning and

    This manuscript aims to present a Systematic Literature Review of the papers published between 2017 and 2023 to identify, analyze and classify the different algorithms used in quantum machine learning and their applications. Consequently, this study identified 94 articles that used quantum machine learning techniques and algorithms. ...

  22. Machine Learning: Algorithms, Real-World Applications and Research

    motivate us to study on machine learning in this pa-per, which can play an important role in the real-world through Industry 4:0 automation. In general, the e ectiveness and the e ciency of a machine learning solution depend on the nature and characteristics of data and the performance of the learn-ing algorithms. In the area of machine ...

  23. Crop yield prediction using machine learning: A ...

    The most widely used ML algorithm is Neural Networks. • The most widely used deep learning algorithm is CNN. Keywords Crop yield prediction Systematic literature review Machine learning Deep learning 1. Introduction

  24. Predicting building energy consumption in urban ...

    The research presented in this paper aligns with the latter category. ... Quan and Li, proposed a multi-scale data-driven energy use modeling framework, comparing various machine learning algorithms and emphasizing the pronounced impact of building size and height on building EUI. In our study, we utilized land use data sourced from POI to ...