Python Data Processing - Data Dimensionality Reduction

Several simple data dimensionality reduction methods and demos PCA principal component analysis method
PCA is a principal component analysis method, which is an unsupervised linear dimensionality reduction method. Subtract certain features that are relatively less weighted.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets
digits = datasets.load_digits()
features = StandardScaler().fit_transform(digits.data)#normalization plus training 
# PCA perform principal component analysis, where whiten it's whitening, check if the deviation of each feature is consistent ;
# n_components when the value of is a floating-point number between 0 and 1, it represents the proportion of information that you want to retain. 
pca = PCA(n_components=0.99, whiten=True)
features_pca = pca.fit_transform(features)
print("Original number of features:", features.shape[1])# shape the function reads the length of a one-dimensional matrix 
print("Reduced number of features:", features_pca.shape[1])

KPCA - Linear indivisible data
Linear inseparability, such as a straight line or a hyperplane between two classes. KPCA can simultaneously reduce dimensionality and make data linearly separable, mapping data to high-dimensional space to make it linearly separable.

from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles
#generate a dataset to form a two-dimensional large circle containing a small circle, which is linearly inseparable. 
# noise the smaller, the more concentrated, factor it is the scaling factor of the size circle 
features, _= make_circles(n_samples=1000, random_state=1, noise=0.1, factor=0.1)
# 'rbf’: gaussian kernel function, optional linear’, 'poly’, 'rbf’, 'sigmoid’, 'cosine’, 'precomputed’
# gamma yes rbf, poly and sigmoid the nuclear coefficient of the nucleus. 
kpca = KernelPCA(kernel="rbf", gamma=15, n_components=1)
features_kpca = kpca.fit_transform(features)
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kpca.shape[1])

LDA - Maximizing Class Separability
Similar to PCA, but with more freedom to choose the dimension of segmentation, it is not simply limited to principal components.

from sklearn import datasets
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
iris = datasets.load_iris()# iris the dataset is a commonly used classification experiment dataset 
features = iris.data
target = iris.target
lda = LinearDiscriminantAnalysis(n_components=1)
features_lda = lda.fit(features, target).transform(features)

NMF matrix factorization method
NMF is an unsupervised linear dimensionality reduction method, V= WH, V has n samples and r features.

from sklearn import datasets
from sklearn.decomposition import NMF
digits = datasets.load_digits()
features = digits.data
nmf = NMF(n_components=10, random_state=1)
features_nmf = nmf.fit_transform(features)

TSVD - Sparse Data .

from sklearn.preprocessing import StandardScaler
from sklearn import datasets
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
digits = datasets.load_digits()
features = StandardScaler().fit_transform(digits.data)
features_sparse = csr_matrix(features)
tsvd = TruncatedSVD(n_components=10)
features_sparse_tsvd = tsvd.fit(features_sparse).transform(features_sparse)

Python Data Processing - Data Dimensionality Reduction

Related articles

Latest articles

Hot tags：