首页 > 系统相关 >tflearn 数据集太大无法加载进内存问题?——使用image_preloader 或者是 hdf5 dataset to deal with that issue

tflearn 数据集太大无法加载进内存问题?——使用image_preloader 或者是 hdf5 dataset to deal with that issue

时间:2023-08-03 22:01:48浏览次数:41  
标签:deal hdf5 image preloader shape file tflearn net True

tflearn 数据集太大无法加载进内存问题?

Hi, all!
I'm trying to train deep net on a big dataset that doesn't fit into memory.
Is there any way to use generators to read batches into memory on every training step?
I'm looking for behaviour similar to fit_generator method in Keras.

I know that in pure tensorflow following snippet can be wrapped by for loop to train on several batches:


batch_gen = generator(data)
batch = batch_gen.next()

sess.run([optm, loss, ...], feed_dict = {X: batch[0], y: batch[1]})


 

tflearn 数据集太大无法加载进内存问题?——使用image_preloader 或者是 hdf5 dataset to deal with that issue_深度学习

 

Owner

aymericdamien commented on 11 Jan 2017

That is a good idea! While this get implemented, you can use image_preloader or hdf5 dataset to deal with that issue.

 

Image PreLoader

tflearn.data_utils.image_preloader (target_path, image_shape, mode='file', normalize=True, grayscale=False, categorical_labels=True, files_extension=None, filter_channel=False)

Create a python array (Preloader) that loads images on the fly (from disk or url). There is two ways to provide image samples 'folder' or 'file', see the specifications below.

'folder' mode: Load images from disk, given a root folder. This folder should be arranged as follow:

ROOT_FOLDER -> SUBFOLDER_0 (CLASS 0) -> CLASS0_IMG1.jpg -> CLASS0_IMG2.jpg -> ...-> SUBFOLDER_1 (CLASS 1) -> CLASS1_IMG1.jpg -> ...-> ...

Note that if sub-folders are not integers from 0 to n_classes, an id will be assigned to each sub-folder following alphabetical order.

'file' mode: A plain text file listing every image path and class id. This file should be formatted as follow:

/path/to/img1 class_id
/path/to/img2 class_id
/path/to/img3 class_id

Note that load images on the fly and convert is time inefficient, so you can instead use build_hdf5_image_dataset to build a HDF5 dataset that enable fast retrieval(this function takes similar arguments).

Examples

# Load path/class_id image file:
dataset_file = 'my_dataset.txt'

# Build the preloader array, resize images to 128x128
from tflearn.data_utils import image_preloader
X, Y = image_preloader(dataset_file, image_shape=(128, 128),   mode='file', categorical_labels=True,   normalize=True)

# Build neural network and train
network = ...
model = DNN(network, ...)
model.fit(X, Y)

Arguments

  • target_path: str. Path of root folder or images plain text file.
  • image_shape: tuple (height, width). The images shape. Images that doesn't match that shape will be resized.
  • mode: str in ['file', 'folder']. The data source mode. 'folder' accepts a root folder with each of his sub-folder representing a class containing the images to classify. 'file' accepts a single plain text file that contains every image path with their class id. Default: 'folder'.
  • categorical_labels: bool. If True, labels are converted to binary vectors.
  • normalize: bool. If True, normalize all pictures by dividing every image array by 255.
  • grayscale: bool. If true, images are converted to grayscale.
  • files_extension: list of str. A list of allowed image file extension, for example ['.jpg', '.jpeg', '.png']. If None, all files are allowed.
  • filter_channel: bool. If true, images which the channel is not 3 should be filter.

Returns

(X, Y): with X the images array and Y the labels array.

参考:https://github.com/tflearn/tflearn/issues/555

 

I try preloader, but seems have bugs. Code as below:

`from future import division, print_function, absolute_import
import tflearn
n = 5
train_dataset_file = '/home/lfwin/imagenet-data/raw-data/train_10c'

test_dataset_file = '/home/lfwin/imagenet-data/raw-data/validation_10c/'from tflearn.data_utils import image_preloader

X, Y = image_preloader(train_dataset_file, image_shape=(299, 299, 3), mode='folder',

categorical_labels=True, normalize=True)(testX, testY) = image_preloader(test_dataset_file, image_shape=(299, 299, 3), mode='folder',

categorical_labels=True, normalize=True)

net = tflearn.input_data(shape=[None, 299, 299, 3])

net = tflearn.conv_2d(net, 16, 3, regularizer='L2', weight_decay=0.0001)

net = tflearn.residual_block(net, n, 16)

net = tflearn.residual_block(net, 1, 32, downsample=True)

net = tflearn.residual_block(net, n-1, 32)

net = tflearn.residual_block(net, 1, 64, downsample=True)

net = tflearn.residual_block(net, n-1, 64, downsample=True)

net = tflearn.batch_normalization(net)

net = tflearn.activation(net, 'relu')

net = tflearn.global_avg_pool(net)net = tflearn.fully_connected(net, 20, activation='softmax')

mom = tflearn.Momentum(0.1, lr_decay=0.1, decay_step=32000, staircase=True)

net = tflearn.regression(net, optimizer=mom,

loss='categorical_crossentropy')model = tflearn.DNN(net, checkpoint_path='model_resnet_cifar10',

max_checkpoints=10, tensorboard_verbose=0,

clip_gradients=0.)model.fit(X, Y, n_epoch=1, validation_set=(testX, testY),

snapshot_epoch=False, snapshot_step=500,

show_metric=True, batch_size=16, shuffle=True,

run_id='resnet_imagenet')

`

During debugging, following bugs appeared:

Momentum | epoch: 000 | loss: -717286.50000 - acc: -30021.9395 -- iter: 00032/26000

...run_id=run_id)File "/home/lfwin/hello/tflearn/tflearn/helpers/trainer.py", line 289, in fit

show_metric)

File "/home/lfwin/hello/tflearn/tflearn/helpers/trainer.py", line 706, in _train

feed_batch)

File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 372, in run

run_metadata_ptr)

File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 619, in _run

np_val = np.array(subfeed_val, dtype=subfeed_dtype)

ValueError: could not broadcast input array from shape (299,299,3) into shape (299,299)1 epoch is 000 always, loss is Nan after 1st step.

2 broadcasting error from shape (299,299,3) into shape (299,299)

tflearn 数据集太大无法加载进内存问题?——使用image_preloader 或者是 hdf5 dataset to deal with that issue_深度学习_02 lfwin referenced this issue on 9 Aug 2016

Open

#262

tflearn 数据集太大无法加载进内存问题?——使用image_preloader 或者是 hdf5 dataset to deal with that issue_ide_03

 

pankap commented on 15 Oct 2016

I thought posting this might help you somehow: I came across the same ValueError: could not broadcast input array from shape (x,x,3) into shape (x,x) when I tried to load the Caltech 101 images using build_image_dataset_from_dir (specifically: arrs[i] = np.array(arr) into the shuffle method). I identified the root cause to be some 8bit Grayscale JPG files in the dataset. Having the files converted from Grayscale to 24bit RGB, using an external util that I wrote, solved the issue. I am not sure if in-memory conversion to RGB using PIL will create the proper 3-byte JPEG format.

 

tflearn 数据集太大无法加载进内存问题?——使用image_preloader 或者是 hdf5 dataset to deal with that issue_深度学习_04

 

0fork commented on 24 May 2017

I ran into this also and solved this by using tflearn.reshape(net, new_shape=[-1, 300, 300, 1]) after input_data. My problem was that grayscale=True with image_preloader caused (300, 300) shape so conv_2d wasn't my friend anymore and I didn't find any way to use normal np.reshape with image_preloader instance. Now everything gets jammed nicely into the right shape.

标签:deal,hdf5,image,preloader,shape,file,tflearn,net,True
From: https://blog.51cto.com/u_11908275/6953693

相关文章

  • MySQL报错: Unknown prepared statement handler (stmt2) given to DEALLOCATE PREPAR
    上面的报错,是在MySQL里执行动态拼接SQL后报错的。--先定义两段SQLset@update_sql_fm=concat('updateads_gcl3e_patient_',@base_group_short_name,'_detail1t1jointemp_gcl3e_record_listtmpont1.report_info_id=tmp.report_info_idleftjoin(',@select_sql_fm,�......
  • L11U3-3 Dealing with flight problems
    1ExpressionsFlightproblemsListentodiscussbadnewshereceivesabouthisflight.hasbeendelayed.mechanicalproblems.hasbeencanceledduetomaintenanceissues.It'simportantthatyouunderstandmessagesaboutflightproblemswhetheryoug......
  • loan deal/loan facility
    loandeal:bankgivealoantoborrowerLOANDEALIDVERSIONLDRDEALIDLDRVERSIONDEALSNPCUSIPDEALTYPEDEALNAMEDEALDESCDEALSTATUSCRAGREEMENTDATECRAGREEMENTCLSDATECRAGREEMENTEFFDATECRAGREEMENTMATDATEGLOBALSIZEGLOBALCCYSYNDCOUNTRYREFINANCEFLAGAMENDFLAGRESTATEFLAGIS......
  • deal.II — an open source finite element library
    简介:Whatitis: AC++softwarelibrarysupportingthecreationoffiniteelementcodesandanopencommunityofusersanddevelopers.(Learnmore.)Mission: Toprovidewell-documentedtoolstobuildfiniteelementcodesforabroadvarietyofPDEs,froml......
  • Firms sign investment deals worth $8.6b at Invest Beijing Global Summit
    Beijingwillcontinuetobeanattractiveinvestmentdestinationforforeigninvestors,asthecityisattheforefrontoftechnologicalinnovation,hasasoundbusinessenvironment,andmakesgreatereffortsindrivingreformandopening-up,saidexecuti......
  • ideal控制台乱码
    1、中文乱码原因  IDEA的下方log输出的部分的编码是GBK的,而Tomcat默认log输出是UTF-8编码的,采用了两种不同的编码方式就是乱码2、Tomcat乱码解决 2-1)右键打开IDEA文件位置,打开下图选中文件 为其添加下图选中代码  -Dfile.encoding=UTF-8 2-2)在IDEA中打开下图圈中 ......
  • KingbaseES V8R6 Deallocate 语句使用说明
    用途DEALLOCATE被用来释放一个之前PREPARE好的SQL语句。如果不显式地释放一个PREPARE语句,那么会话结束时会释放它。prepare语句类似oracle的绑定变量绑定过程:1)PREPARE,准备绑定变量SQL2)EXECUTE,绑定并执行3)DEALLOCATE,释放绑定变量测试1.只有本地会话可以看的prepare语句......
  • HDF5介绍
    PythonandHDF5,AndrewColletteHDF5,themostrecentversionofthe“HierarchicalDataFormat”originallydevelopedattheNationalCenterforSupercomputingApplications(NCSA),hasrapidlyemergedasthemechanismofchoiceforstoringscientificdata......
  • B. Ideal Point
    B.IdealPoint思路首先删除不包含点k的线段,因为这些线段对使\(f(k)>f(x)\)没有贡献然后再考虑剩余的线段中覆盖得到的f(x)最大值是否唯一(由于前面的处理,所有线段均......
  • ideal的基础使用2022版本,黑马程序员的基础使用
    1.    2.配xml    <dependencies>    <dependency>        <groupId>javax.servlet</groupId>        <artifactId>javax.servl......