
10 Python single lines can simplify functional engineering
Edit pictures | Midjourney
Functional engineering is a critical process in most data analytics workflows, especially when building machine learning models. It involves creating new features based on existing raw data capabilities to extract deeper analytical insights and enhance model performance. To help turbocharge and optimize your functional engineering and data preparation workflow, this article introduces 10 single lines (single-line code that effectively and concisely completes meaningful tasks) specifically. 10 practical single lines to keep Perform feature engineering processes in various situations and types of data, all in a simplified way.
Before you start, you may need to import some of the key Python libraries and modules we will use. Additionally, we will be on Scikit-Learn datasets
Modules: Wine dataset and Boston housing dataset.
from sklearn.datasets import load_wine, fetch_openml import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, KBinsDiscretizer, PolynomialFeatures from sklearn.feature_selection import VarianceThreshold from sklearn.decomposition import PCA # Dataset loading to pandas dataframes wine = load_wine(as_frame = true)df_wine = wine.frame boston = fetch_openml(name = “boston”, version = 1, as_frame = true)df_boston = boston = boston.frame.frame
from Sklearn.Dataset import load_wine,,,,, fetch_openml import panda As PD import numpy As NP from Sklearn.Preprocessing import Standard Standards,,,,, minmaxscaler,,,,, A-Head Encoder,,,,, kbinsdiscretizer,,,,, Polynomial way from Sklearn.feature_selection import Variancethreshold from Sklearn.break down import PCA #Dataset loads into pandas data range wine = load_wine((as_frame=real) df_wine = wine.frame Boston = fetch_openml((Name=“Boston”,,,,, Version=1,,,,, as_frame=real) DF_Boston = Boston.frame |
Note that the two datasets are loaded into two PANDAS DataFrames whose variables are named df_wine
and df_boston
respectively.
1. Normalization of numerical features (z-score scaling)
Normalization is a common approach, that is, when its values span different ranges or sizes, the numerical characteristics in the data set are a scale approach, and there may be some moderate outliers. This scaling method converts the numerical value of the attribute to follow a standard normal distribution with an average of 0 and a standard deviation of 1. Scikit-Learn’s StandardScaler
The class provides a seamless implementation of this method: all you need to do is call it fit_transform
Methods, passing the function of a data frame that requires standardization:
df_wine_std = pd.dataframe(standardscaler().fit_transform(df_wine.drop(‘target’, axis = 1)), column = df_wine.columns[:-1])
df_wine_std = PD.Data Framework((Standard Standards(().fit_transform((df_wine.reduce((‘Target’,,,,, axis=1)),,,,, List=df_wine.List[:–1]) |
The resulting standardized properties will now have smaller values of about 0, some are positive and some are negative. This is completely normal, even if your original eigenvalues are all positive, because standardization not only scales the data, but also centers around the mean of the original attribute.
2. Minimum maximum zoom
When values in a function vary greatly across individual instances (e.g., the number of students in each classroom in a high school), minimum maximum scaling is an appropriate way to scale data: it involves normalizing the function value to move in unit intervals [0,1]by applying this formula to each value x: x’=(x-min)/(maximum-min)based on the maximum value (minimum value) value in the function x belong. Scikit-Learn offers a similar category to standardization.
df_boston_scaled = pd.dataframe(minmaxscaler().fit_transform(df_boston.drop(‘medv’, axis = 1)), column = df_boston.columns[:-1])
df_boston_scaled = PD.Data Framework((minmaxscaler(().fit_transform((DF_Boston.reduce((‘MEDV’,,,,, axis=1)),,,,, List=DF_Boston.List[:–1]) |
In the example above, we use the Boston shell dataset to extend all features except MEDV (Intermediate House Value), which is designed to be the target variable for machine learning tasks such as regression, and therefore, it has been removed before normalization.
3. Add polynomial features
Adding polynomial features is useful when the data is not strictly linear but exhibits nonlinear relationships. The process boils down to the addition of new features resulting from increasing the original functionality to power and the interaction between them. This example uses PolynomialFeatures
To be created from two functions describing the alcohol and malic properties of wine, the new function is the square of the original two (degree = 2), and another function that shows the interaction between the two functions by applying the product operator:
df_interactions = pd.dataframe(polyenomialfeatures(=2, include_bias = false).fit_transform(df_wine[[‘alcohol’, ‘malic_acid’])))
DF_ Intractions = PD.Data Framework((Polynomial way((degree=2,,,,, Including _bias=Wrong).fit_transform((df_wine[[‘alcohol’, ‘malic_acid’]This is given)) |
The result is that three new features are created on top of the original two: “Alcohol^2”, “malic_acid^2” and “Alcohol malic_acid”.
4. One wall coded classification variable
The coding consists of taking a categorical variable that occupies a possible value or category of “m” and creating a numerical (or rather, binary) feature, each characterizing the occurrence or absence of a category in a data instance, using values of 1 and 0, respectively. Thanks Pandas’ get_dummies
Function, this process cannot be easier. In the following example, we assume that the CHAS attribute should be treated as a classification and apply the above function to the single hot encoding of the function.
df_boston_ohe = pd.get_dummies(df_boston.astype({‘chas’:’category’}), column =[‘CHAS’])
df_boston_ohe = PD.get_dummies((DF_Boston.astype(({‘chas’: ‘category’}),,,,, List=[‘CHAS’]) |
Since this feature initially takes two possible values, two new binary functions are built on it. In many data analysis and machine learning processes, single-rotation coding is a very important process in which purely cannot be handled and requires encoding.
5. Discrete continuous variables
Discrete continuous numerical variables in several equal width sub-spaces or bins are regular processes in the analysis process, such as visualization processes, that help to obtain plots such as histograms or line charts that may not seem so overwhelming but still capture “large figures”. This example single line shows how to discrete the “alcohol” attributes in the wine dataset into four bins, marked 0 to 3:
df_wine[‘alcohol_bin’] = pd.qcut(df_wine[‘alcohol’]q = 4, label = false)
df_wine[‘alcohol_bin’] = PD.QCUT((df_wine[‘alcohol’],,,,, ask=4,,,,, Label=Wrong) |
6. Logarithmic conversion of skewed features
If one of your numerical features is correct rotation or frontal skew, that is, it visually exhibits a long tail to the right due to several values larger than the rest, the logarithmic transformation helps to extend them into better forms for further analysis. Numpy’s log1p
This transformation is performed only by specifying the functions in the data frame that need to be converted. The results are stored in the newly built dataset feature.
df_wine[‘log_malic’] = np.log1p(df_wine[‘malic_acid’])
df_wine[‘log_malic’] = NP.log1p((df_wine[‘malic_acid’]) |
7. Create ratios between two functions
One of the most direct but common feature engineering steps in data analysis and preprocessing is to create new features as a ratio (division) between the two semantically related. For example, given the alcohol and malic acid levels of wine samples, we may be interested in having a new property to describe the ratio between these two chemical properties, as shown below:
df_wine[‘alcohol_malic_ratio’] = df_wine[‘alcohol’] /df_wine[‘malic_acid’]
df_wine[‘alcohol_malic_ratio’] = df_wine[‘alcohol’] / df_wine[‘malic_acid’] |
Thanks to the power of giant pandas, departmental operations that lead to new functionality are performed at the instance level of each instance (row) in the dataset without any loops.
8. Remove features with lower differences
Often, some functions may show so little variability between their values that it not only contributes little to analysis or machine learning models trained on data, but may even make the results worse. Therefore, identifying and removing these features with lower differences is not a bad idea. This single line shows how to use Scikit-Learn VarianceThreshold
The class automatically deletes features whose variance is below the threshold. Try adjusting the threshold to see how it affects the resulting feature removal, more or less aggressive based on the specified variance threshold.
df_boston_high_var = pd.dataframe(variancethreshold(threshold=0.1).fit_transform(df_boston.drop(‘medv’, axis=1))))))
df_boston_high_var = PD.Data Framework((Variancethreshold((Critical point=0.1).fit_transform((DF_Boston.reduce((‘MEDV’,,,,, axis=1))) |
Note: The MEDV attribute is manually deleted because it is a target variable for the dataset, regardless of other functions being deleted due to a lower variance threshold.
9. Multiplication Interaction
Assume our client is a wine producer in Lanzarote (Spain) which is used for marketing purposes, and its mass score combines information about alcohol and the color intensity of the wine into a single score. This can be done through functional engineering, simply participating in the functionality that will be calculated for new scores for each wine and applying the mathematics that the customer wants us to reflect on. For example, the products of these two functions:
df_wine[‘wine_quality’] = df_wine[‘alcohol’] * df_wine[‘color_intensity’]
df_wine[‘wine_quality’] = df_wine[‘alcohol’] * df_wine[‘color_intensity’] |
10. Track outliers
Although outliers are often removed from the data set in most data analysis scenarios, it can sometimes be interesting to stay on track after they are identified. Why not do this by creating a new feature that indicates whether the data instance is more exceptional?
DF_Boston[‘tax_outlier’] =((df_boston[‘TAX’]
DF_Boston[‘tax_outlier’] = ((((DF_Boston[‘TAX’] < DF_Boston[‘TAX’].Quantile((0.25) – 1.5 * ((DF_Boston[‘TAX’].Quantile((0.75) – DF_Boston[‘TAX’].Quantile((0.25))) | ((DF_Boston[‘TAX’] > DF_Boston[‘TAX’].Quantile((0.75) + 1.5 * ((DF_Boston[‘TAX’].Quantile((0.75) – DF_Boston[‘TAX’].Quantile((0.25)))).astype((int) |
Single-line manually uses the interquartile range (IQR) method to discover possible outliers for tax attributes, which is why it spans length compared to previous examples. Depending on the dataset and target features, you want to analyze to discover outliers, in which case the newly added feature has a value of 0 for all instances in the dataset.
in conclusion
This article has a glimpse of ten effective Python single lines that, once familiar, will effectively perform the process of various functional engineering steps, thus turning your data into a good shape for further analyzing or building trained machine learning models.