A Novel Hyperparameter Optimization Approach for Supervised Classification: Phase Prediction of Multi-Principal Element Alloys

ANN-Based Classification

To establish a framework for phase prediction, a supervised classification approach is formulated, where each alloy composition is characterized by ten distinct features as inputs and the phase as the output class. The ANN architecture can be described with consecutive input, hidden, and output layers. An initial network is proposed, consisting of two hidden layers and dropout regularization. Dropout is a regularization method used to prevent overfitting and can be easily implemented without significant computational overhead [18]. Each hidden layer employs the ReLU activation function [39]. Cross-entropy is chosen as the loss function for the network, and it is defined as

$$\begin \text = -\frac \sum _^ \sum _^ T_ \text Y_ \end$$

(1)

where N is number of observations, K is number of classes, \(T_\) is target value, and \(Y_\) is predicted value of the network.

The training process of an ANN is an iterative procedure in which a series of linear operations between consecutive layers is used to enhance the performance of the network by tuning its parameters, specifically the weights and biases. The operation process can be defined as

$$\begin x_^ = f \left( \sum _ w_^ x_^ + b_^ \right) \end$$

(2)

where \(x_^\) is the output of \(j^\) neuron in \(l^\) layer, \(x_^\) is input of \(i^\) neuron in previous layer, whereas \(w_^\) and \(b_^\) are weights and biases of \(j^\) node in \(l^\) layer. f represents a non-linear activation function. In this study, the architecture of ANN consists of an input layer for alloy features, two hidden layers with equal number of nodes in each layer, and an output layer for alloy phase.

A key point to note is that our network is not fully connected due to the presence of dropout regularization, which randomly removes connections between nodes based on a fixed probability \(p_\). This process is essential in overcoming overfitting, a phenomenon that occurs when a model fits too closely to its training data, thereby reducing its ability to generalize and predict testing data. An added benefit of using dropout is that it makes the ANN more lightweight and significantly reduces computational complexity by removing certain node connections.

Genetic Algorithm

The GA is a type of optimization algorithm inspired by the processes of natural selection and evolution. It is used to solve optimization and search problems by mimicking the process of natural selection, allowing a population of solutions to evolve over successive generations. GA is particularly useful for problems involving large and complex search spaces, making it an ideal candidate for use in this study.

Before the implementation of GA, the data must be encoded so that each individual solution is represented as a chromosome, with each gene within a chromosome signifying decision variables.

The improvement of solutions in GA depends on two key operators: crossover and mutation. Crossover generates new solutions by combining solutions from the previous generation, while mutation randomly alters a decision variable within a particular solution. The process of GA can be depicted as follows:

1)

Initialization of population to represent search space;

2)

Evaluation of fitness through an objective function;

3)

Selection of suitable individuals for reproduction;

4)

Crossover through combination of individuals;

5)

Mutation by randomly altering an offspring’s genetic code.

This above process is iterative, with each iteration referred to as a generation. In every generation, evaluation, selection, crossover, and mutation take place. The process continues until a termination criterion is met, which could be based on the number of generations or a minimum tolerance in the fitness value. The GA process is illustrated in Fig. 2 as a flowchart. The performance of GA can be sensitive to the choice of parameters such as population size, crossover rate, mutation rate, and selection criteria. In this study, GA will be employed for hyperparameter optimization, incorporating a novel strategy for handling variable-sized chromosomes.

Fig. 2figure 2

Flowchart of genetic algorithm. \(p_\) is mutation probability, and r is a uniformly distributed random number between 0 and 1

Encoding data within the framework of genes is crucial for the operation of GA. The vast majority of GA studies employ binary encoding, where data is represented as 0 s and 1 s. However, in this case, due to the presence of both real-numbered and categorical data, it is more appropriate to utilize real encoding. Each chromosome is composed of N genes, where N represents the number of hyperparameters in the problem.

In this study, single-point crossover and single-point mutation are employed, with a fixed mutation probability of \(p_=0.1\). Selection is based on the fitness of each individual, and instead of replacing all members of the previous generation, a few of the best individuals from the current generation are retained and passed on to the next. This method, known as elitism, ensures that the best solutions found so far are preserved across generations, preventing the loss of valuable genetic information due to random variation. Elitism helps maintain stability and ensures that promising solutions are not lost prematurely. It is particularly useful when the fitness landscape is dynamic or when the algorithm is prone to premature convergence, a situation where the population converges to suboptimal solutions too early. By preserving the best individuals, elitism helps guide GA toward better solutions over time. A comprehensive depiction of the GA employed in the current study is presented as pseudocode in Algorithm 1.

Algorithm 1figure a

The procedure of genetic algorithm.

Dynamic Length Chromosome Operators

Canonical GA and its respective operators are designed based on a fixed number of decision variables, with each variable representing a gene in a fixed-length chromosome and multiple chromosomes forming a population. A significant limitation of this approach is that GA is not inherently suited to handle variable-length data, whether it involves a variable number of genes or a variable length of a particular gene. To address this deficiency, several studies have been conducted to develop approaches for variable-length chromosome GA by modifying the crossover and mutation operators to suit the specific problem. These approaches have been applied to various domains, such as topology design [23, 24] and hyperparameter optimization [58].

Fig. 3figure 3

Depiction of chromosomes where a shows normal chromosome with fixed-length genes; b shows chromosome having a single variable-length gene which consists of values \(b_, b_\ldots b_\), where n is an integer which can vary between a lower and upper bound

GA strategies exist for both fixed and variable-length chromosomes. However, a gap in the current research lies in the ability to handle variable-length genes, meaning the capability to manage chromosomes containing both single and multiple values within individual genes. An illustration of this phenomenon is shown in Fig. 3, where a standard chromosome is contrasted with a chromosome containing variable-length genes.

By examining Fig. 3, it can be discerned that there is a need for modified crossover and mutation operators capable of handling chromosomes containing both single-valued and multi-valued genes, where the length of multi-valued genes can vary. To address this, a robust framework of genetic operators has been developed, enabling effective crossover and mutation operations for the specific problem at hand.

To simplify the problem, two different crossover and mutation operators are defined: one for single-length genes and another for variable-length genes. The genetic operators for single-length genes function as standard crossover and mutation, where parents are segmented and joined together, with random mutations occurring at a single point. For the variable-length operators, only the gene with variable array length is handled. Consequently, both single-length and variable-length operators function simultaneously for each individual in the population.

The variable-length crossover operates by selecting the variable-length gene from two parents, ensuring that the length of parent 1 is less than the length of parent 2, i.e., \(len(P_1)<len(P_2)\). Next, the length of the offspring is defined within the range \([len(P_1), len(P_2)]\), ensuring that the offspring’s length always lies between the lengths of both parents. The crossover point c is then selected within the range \([1, len(P_1)]\) such that the point lies within bounds of \(P_1\). The first half of offspring \(O_1\) is initialized by selecting values within \(P_1\) until point of crossover c, such that \(O_1 = P_1(1,... c)\). The second half of \(O_1\) is defined by checking each member of \(P_2\). If the member is not present in \(O_1\), it is added to \(O_1\); otherwise, it is rejected. This process avoids repetitions and ensures that the characteristics of both parents are represented in the resulting offspring. The pseudocode for this process is shown in Algorithm 2.

Algorithm 2figure b

The procedure of variable length crossover.

For the variable-length mutation, the same strategy is applied as used for the traveling salesman problem. A random permutation of length l is generated, where l is a randomly selected length within the bounds of the decision variable. Thus, the mutation process involves sampling from a distribution of random permutations of the variable-length decision variable.

Hyperparameter Optimization

In an ANN, numerous tunable parameters, referred to as hyperparameters, influence its performance and control various aspects of the learning process. These parameters are often selected on an ad hoc basis, which is an ineffective practice that limits the capabilities of the ANN. To address this, hyperparameter optimization is employed to select the optimal parameters using an optimization technique that aims to maximize the network’s performance, specifically by improving testing accuracy.

In this study, hyperparameter optimization is formulated as a minimization problem, where the goal is to identify the set of hyperparameters for a given model that returns the best performance when measured on a test set. This process can be mathematically represented as

$$\begin x^* = \text \, \min _ f(x) \end$$

(3)

where f(x) is the objective function to be minimized; \(x^*\) is the set of hyperparameters that yield the lowest value of the objective function. Here, x can take any value within the domain \(\Theta \), which consists of all possible combinations of hyperparameters. In this study, the objective function is the root mean squared error (RMSE), evaluated on the test set.

Canonical hyperparameters for ANN considered in this study are

The above parameters have been widely utilized in the literature across various types of optimization techniques, datasets, and problems involving both classification and regression [2, 12, 35].

In this study, two new concepts are incorporated as hyperparameters within the optimization framework: outlier detection and feature subset selection. These topics are discussed in more detail in the following subsection.

Outlier Detection

Outlier detection is the process of identifying data points or observations that deviate significantly from the majority of the data in a dataset. These outliers are typically observations that are unusual compared to the rest of the data and may indicate anomalies or errors. A well-known statistical approach for identifying outliers in a distribution is the boxplot method [13]. A boxplot provides a visual summary of key statistics for a sample dataset, including the \(25^\) and \(75^\) percentiles, maximum and minimum values, as well as the median of the distribution. Outliers, in the context of boxplots, are defined as data points that fall outside the interquartile range (IQR) multiplied by a constant, \(k_\). The interquartile range is the difference between the \(25^\) and \(75^\) percentiles, with outliers identified as those data points beyond the range of \(25^ \text \ge \text \ge 75^ \text \). The constant value essentially determines the range beyond which any data point is considered an outlier. A larger value for the constant will result in fewer data points being classified as outliers, and vice versa. Therefore, the choice of \(k_\) is crucial, as it allows for the elimination of the maximum number of outliers without negatively affecting the information contained within each distribution.

The trade-off associated with \(k_\) and number of observations considered by the model is illustrated in Fig. 4, where the value of \(k_\) influences the number of outliers identified by the boxplot method. It is evident that a low \(k_\) value is undesirable, as it identifies large portions of the data as outliers and eliminates them from the dataset. A reasonably moderate to high \(k_\) value is needed for efficient identification and removal of outliers. Therefore, \(k_\) is treated as a hyperparameter for our model, and its value will be determined through hyperparameter optimization.

Fig. 4figure 4

Variation of number of outliers with \(k_\)

Table 1 List of hyperparameters and their respective rangesTable 2 Descriptive statistics values of features in our datasetFeature Subset Selection

Feature subset selection is the process of choosing a subset of relevant features (variables, predictors) from a larger set to build a model. The objective is to enhance the model’s performance by reducing overfitting, simplifying the model, and potentially improving interpretability. In many real-world datasets, numerous features may be present, but not all contribute equally to the predictive capability of the model. Some features may be redundant or irrelevant, adding noise to the model or leading to overfitting. Feature subset selection methods aim to identify and select the most informative and relevant features, while discarding the less useful ones. This process is crucial for building efficient and effective machine learning models, particularly when dealing with high-dimensional data.

Many feature subset selection methods have been utilized in the past, such as correlation-based filtration [63], wrapper methods [48], and embedded methods like LASSO [38]. Each method has its own merits and demerits. However, feature subset selection has not yet been incorporated within the framework of hyperparameter optimization. Integrating feature selection with hyperparameter optimization allows for the simultaneous selection of optimum features and suitable network hyperparameters, removing extra computational overhead and optimizing the process to maximize testing accuracy.

A comprehensive list of hyperparameters considered in this study, along with their respective ranges, is shown in Table 1. The range of the feature subset indicates the lowest and highest number of possible features that can be considered by the algorithm, which in this case is 5 and 10.

Comments (0)

No login
gif