首页 > 其他分享 >C-MAPSS数据集预处理代码

C-MAPSS数据集预处理代码

时间:2023-06-01 11:26:25浏览次数:41  
标签:df MAPSS 代码 train cond test 预处理 unit op

数据预处理代码(语言为python)

代码来源于《Variational encoding approach for interpretable assessment of remaining useful life estimation》作者的公开代码,笔者有更改,不保证绝对正确,请谨慎使用。
github: https://github.com/NahuelCostaCortez/RemainingUseful-Life-Estimation-Variational

import numpy as np
from sklearn.model_selection import GroupShuffleSplit
from sklearn.preprocessing import StandardScaler
import pandas as pd

def add_remaining_useful_life(df):
    # Get the total number of cycles for each unit
    grouped_by_unit = df.groupby(by="unit_nr")
    max_cycle = grouped_by_unit["time_cycles"].max()
    
    # Merge the max cycle back into the original frame
    result_frame = df.merge(max_cycle.to_frame(name='max_cycle'), left_on='unit_nr', right_index=True)
    
    # Calculate remaining useful life for each row
    remaining_useful_life = result_frame["max_cycle"] - result_frame["time_cycles"]
    result_frame["RUL"] = remaining_useful_life
    
    # drop max_cycle as it's no longer needed
    result_frame = result_frame.drop("max_cycle", axis=1)
    return result_frame

def add_operating_condition(df):
    df_op_cond = df.copy()
    
    df_op_cond['setting_1'] = abs(df_op_cond['setting_1'].round())
    df_op_cond['setting_2'] = abs(df_op_cond['setting_2'].round(decimals=2))
    
    # converting settings to string and concatanating makes the operating condition into a categorical variable
    df_op_cond['op_cond'] = df_op_cond['setting_1'].astype(str) + '_' + \
                        df_op_cond['setting_2'].astype(str) + '_' + \
                        df_op_cond['setting_3'].astype(str)
    
    return df_op_cond

def condition_scaler(df_train, df_test, sensor_names):
    # apply operating condition specific scaling
    scaler = StandardScaler()
    for condition in df_train['op_cond'].unique():
        scaler.fit(df_train.loc[df_train['op_cond']==condition, sensor_names])
        df_train.loc[df_train['op_cond']==condition, sensor_names] = scaler.transform(df_train.loc[df_train['op_cond']==condition, sensor_names])
        df_test.loc[df_test['op_cond']==condition, sensor_names] = scaler.transform(df_test.loc[df_test['op_cond']==condition, sensor_names])
    return df_train, df_test

def exponential_smoothing(df, sensors, n_samples, alpha=0.4):
    df = df.copy()
    # first, take the exponential weighted mean
    df[sensors] = df.groupby('unit_nr')[sensors].apply(lambda x: x.ewm(alpha=alpha).mean()).reset_index(level=0, drop=True)
    
    # second, drop first n_samples of each unit_nr to reduce filter delay
    def create_mask(data, samples):
        result = np.ones_like(data)
        result[0:samples] = 0
        return result
    
    mask = df.groupby('unit_nr')['unit_nr'].transform(create_mask, samples=n_samples).astype(bool)
    df = df[mask]
    
    return df

def gen_train_data(df, sequence_length, columns):
    data = df[columns].values
    num_elements = data.shape[0]

    # -1 and +1 because of Python indexing
    for start, stop in zip(range(0, num_elements-(sequence_length-1)), range(sequence_length, num_elements+1)):
        yield data[start:stop, :]
        
def gen_data_wrapper(df, sequence_length, columns, unit_nrs=np.array([])):
    if unit_nrs.size <= 0:
        unit_nrs = df['unit_nr'].unique()
        
    data_gen = (list(gen_train_data(df[df['unit_nr']==unit_nr], sequence_length, columns))
               for unit_nr in unit_nrs)
    data_array = np.concatenate(list(data_gen)).astype(np.float32)
    return data_array

def gen_labels(df, sequence_length, label):
    data_matrix = df[label].values
    num_elements = data_matrix.shape[0]

    # -1 because I want to predict the rul of that last row in the sequence, not the next row
    return data_matrix[sequence_length-1:num_elements, :]  

def gen_label_wrapper(df, sequence_length, label, unit_nrs=np.array([])):
    if unit_nrs.size <= 0:
        unit_nrs = df['unit_nr'].unique()
        
    label_gen = [gen_labels(df[df['unit_nr']==unit_nr], sequence_length, label) 
                for unit_nr in unit_nrs]
    label_array = np.concatenate(label_gen).astype(np.float32)
    return label_array

def gen_test_data(df, sequence_length, columns, mask_value):
    if df.shape[0] < sequence_length:
        data_matrix = np.full(shape=(sequence_length, len(columns)), fill_value=mask_value) # pad
        idx = data_matrix.shape[0] - df.shape[0]
        data_matrix[idx:,:] = df[columns].values  # fill with available data
    else:
        data_matrix = df[columns].values
        
    # specifically yield the last possible sequence
    stop = data_matrix.shape[0]
    start = stop - sequence_length
    for i in list(range(1)):
        yield data_matrix[start:stop, :]  
        
	
def get_data(dataset, sensors, sequence_length, alpha, threshold):
	# files
	dir_path = './data/'
	train_file = 'train_'+dataset+'.txt'
	test_file = 'test_'+dataset+'.txt'
    # columns
	index_names = ['unit_nr', 'time_cycles']
	setting_names = ['setting_1', 'setting_2', 'setting_3']
	sensor_names = ['s_{}'.format(i+1) for i in range(0,21)]
	col_names = index_names + setting_names + sensor_names
    # data readout
	train = pd.read_csv((dir_path+train_file), sep=r'\s+', header=None, 
					 names=col_names)
	test = pd.read_csv((dir_path+test_file), sep=r'\s+', header=None, 
					 names=col_names)
	y_test = pd.read_csv((dir_path+'RUL_'+dataset+'.txt'), sep=r'\s+', header=None, 
					 names=['RemainingUsefulLife'])

    # create RUL values according to the piece-wise target function
	train = add_remaining_useful_life(train)
	train['RUL'].clip(upper=threshold, inplace=True)
	y_test['RemainingUsefulLife'].clip(upper=threshold, inplace=True)
    # remove unused sensors
	drop_sensors = [element for element in sensor_names if element not in sensors]

    # scale with respect to the operating condition
	X_train_pre = add_operating_condition(train.drop(drop_sensors, axis=1))
	X_test_pre = add_operating_condition(test.drop(drop_sensors, axis=1))
	X_train_pre, X_test_pre = condition_scaler(X_train_pre, X_test_pre, sensors)

    # exponential smoothing
	X_train_pre= exponential_smoothing(X_train_pre, sensors, 0, alpha)
	X_test_pre = exponential_smoothing(X_test_pre, sensors, 0, alpha)

	# train-val split
	gss = GroupShuffleSplit(n_splits=1, train_size=0.80, random_state=42)
	# generate the train/val for *each* sample -> for that we iterate over the train and val units we want
	# this is a for that iterates only once and in that iterations at the same time iterates over all the values we want,
	# i.e. train_unit and val_unit are not a single value but a set of training/vali units
	for train_unit, val_unit in gss.split(X_train_pre['unit_nr'].unique(), groups=X_train_pre['unit_nr'].unique()): 
		train_unit = X_train_pre['unit_nr'].unique()[train_unit]  # gss returns indexes and index starts at 1
		val_unit = X_train_pre['unit_nr'].unique()[val_unit]

		x_train = gen_data_wrapper(X_train_pre, sequence_length, sensors, train_unit)
		y_train = gen_label_wrapper(X_train_pre, sequence_length, ['RUL'], train_unit)
		
		x_val = gen_data_wrapper(X_train_pre, sequence_length, sensors, val_unit)
		y_val = gen_label_wrapper(X_train_pre, sequence_length, ['RUL'], val_unit)

	# create sequences for test 
	test_gen = (list(gen_test_data(X_test_pre[X_test_pre['unit_nr']==unit_nr], sequence_length, sensors, -99.))
			   for unit_nr in X_test_pre['unit_nr'].unique())
	x_test = np.concatenate(list(test_gen)).astype(np.float32)
	
	return x_train, y_train, x_val, y_val, x_test, y_test['RemainingUsefulLif

代码来源于《Variational encoding approach for interpretable assessment of remaining useful life estimation》作者的公开代码,笔者有更改,不保证绝对正确,请谨慎使用,谢谢。
github: https://github.com/NahuelCostaCortez/RemainingUseful-Life-Estimation-Variational

标签:df,MAPSS,代码,train,cond,test,预处理,unit,op
From: https://www.cnblogs.com/huxiaohu52/p/17448396.html

相关文章

  • 一个有趣的问题调查,网页上的代码块全部变成 [object Object]
    问题如图,网页上的代码全部显示成了[objectObject],而且与特定网站无关,大部分网站都会有问题。调查无痕模式打开,换个浏览器打开,没有问题,看来是有插件或者油猴脚本捣乱了。调试跟踪HTML元素变更,前几次都是网站自己的js变更,后面出现了一个可疑的插件拿到插件ID之后,看......
  • HTTP代理IP错误代码403什么意思
    我们在使用HTTP代理的时候,经常会出现各种错误代码,其中错误代码403尤为突出。那么错误代码403出现的原因是什么呢?应该如何解决呢?让我们来学习一下吧。1.访问被拒绝HTTP代理服务器通常会限制对某些资源的访问权限,如果客户端试图访问受限资源,则代理服务器会返回403错误。......
  • HTTP代理IP错误代码400什么意思
    HTTP代理服务器是一种用于代理客户端请求的服务器,在转发用户请求时可能会出现各种错误。其中,400错误是常见的错误之一。当HTTP代理服务器收到一个不正确的请求时,它会返回一个400错误响应。本文将为您解释HTTP代理IP错误代码400的含义以及如何解决这个问题。1.IP地址无效......
  • MATLAB用改进K-Means(K-均值)聚类算法数据挖掘高校学生的期末考试成绩|附代码数据
    全文链接:http://tecdat.cn/?p=30832最近我们被客户要求撰写关于K-Means(K-均值)聚类算法的研究报告,包括一些图形和统计输出。本文首先阐明了聚类算法的基本概念,介绍了几种比较典型的聚类算法,然后重点阐述了K-均值算法的基本思想,对K-均值算法的优缺点做了分析,回顾了对K-均值改进......
  • Python进行多输出(多因变量)回归:集成学习梯度提升决策树GRADIENT BOOSTING,GBR回归训练
    原文链接: http://tecdat.cn/?p=25939最近我们被客户要求撰写关于多输出(多因变量)回归的研究报告,包括一些图形和统计输出。在之前的文章中,我们研究了许多使用多输出回归分析的方法。在本教程中,我们将学习如何使用梯度提升决策树GRADIENTBOOSTINGREGRESSOR拟合和预测多输出回归......
  • Python基于粒子群优化的投资组合优化研究|附代码数据
    全文链接:http://tecdat.cn/?p=6811最近我们被客户要求撰写关于粒子群优化的研究报告,包括一些图形和统计输出。粒子群优化(PSO)在PSO中,群中的每个粒子表示为向量。在投资组合优化的背景下,这是一个权重向量,表示每个资产的分配资本。矢量转换为多维搜索空间中的位置。每个粒子也会记......
  • R语言状态空间模型和卡尔曼滤波预测酒精死亡人数时间序列|附代码数据
    原文链接:http://tecdat.cn/?p=22665最近我们被客户要求撰写关于状态空间模型的研究报告,包括一些图形和统计输出。状态空间建模是一种高效、灵活的方法,用于对大量的时间序列和其他数据进行统计推断摘要本文介绍了状态空间建模,其观测值来自指数族,即高斯、泊松、二项、负二项和伽......
  • 源代码管理工具介绍博客
    源代码管理工具是用于跟踪和管理软件开发过程中的源代码的工具。它们提供了一种协作和版本控制的方法,使团队成员能够同时开发和修改代码,同时记录和跟踪代码的变更历史。以下是几个常见的源代码管理工具:Git:Git是目前最流行的分布式版本控制系统。它具有高效的分支和合并......
  • 【socket】服务端与客户端简单代码
    1、C实现代码ExampleofClient-ServerPrograminC(UsingSocketsandTCP)|ProgrammingLogic  2、问题调试经验--缺少头文件导致的段错误-戴安澜式编程-博客园(84条消息)【C语言】warning:implicitdeclarationoffunction‘xxx’[-Wimplicit-function-de......
  • 当鼠标滑过文本框自动选中输入框内容JS代码
    代码:<html><head><title>响应鼠标自动选中文本框内容</title></head><body><inputid="a"type="text"value="请输入搜索词"οnmοuseοver="selectInputContent(this.id)"/><scripttype="text/......