|
8 years ago | |
---|---|---|
.. | ||
README.md | 8 years ago |
README.md
Python Data Reader Design Doc
At training and testing time, PaddlePaddle programs need to read data. To ease the users' work to write data reading code, we define that
- A reader is a function that reads data (from file, network, random number generator, etc) and yields data items.
- A reader creator is a function that returns a reader function.
- A reader decorator is a function, which accepts one or more readers, and returns a reader.
and provide frequently used reader creators and reader decorators.
Data Reader Interface
Indeed, data reader doesn't have to be a function that reads and yields data items. It can be any function with no parameter that creates a iterable (anything can be used in for x in iterable
):
iterable = data_reader()
Element produced from the iterable should be a single entry of data, not a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of supported type (e.g., numpy 1d array of float32, int, list of int)
An example implementation for single item data reader creator:
def reader_creator_random_image(width, height):
def reader():
while True:
yield numpy.random.uniform(-1, 1, size=width*height)
return reader
An example implementation for multiple item data reader creator:
def reader_creator_random_imageand_label(widht, height, label):
def reader():
while True:
yield numpy.random.uniform(-1, 1, size=width*height), label
return reader
Usage
data reader, mapping from item(s) read to data layer, batch size and number of total pass will be passed into paddle.train
:
# two data layer is created:
image_layer = paddle.layer.data("image", ...)
label_layer = paddle.layer.data("label", ...)
# ...
paddle.train(paddle.dataset.mnist, {"image":0, "label":1}, 128, 10, ...)
Data Reader Decorator
Data reader decorator takes a single or multiple data reader, returns a new data reader. It is similar to a python decorator, but it does not use @
syntax.
Since we have a strict interface for data readers (no parameter, return a single data item). Data reader can be used flexiable via data reader decorators. Following are a few examples:
Prefetch Data
Since reading data may take time and training can not proceed without data. It is generally a good idea to prefetch data.
Use paddle.reader.buffered
to prefetch data:
buffered_reader = paddle.reader.buffered(paddle.dataset.mnist, 100)
buffered_reader
will try to buffer (prefetch) 100
data entries.
Compose Multiple Data Readers
For example, we want to use a source of real images (reusing mnist dataset), and a source of random images as input for Generative Adversarial Networks.
We can do:
def reader_creator_random_image(width, height):
def reader():
while True:
yield numpy.random.uniform(-1, 1, size=width*height)
return reader
def reader_creator_bool(t):
def reader:
while True:
yield t
return reader
true_reader = reader_creator_bool(True)
false_reader = reader_creator_bool(False)
reader = paddle.reader.compose(paddle.dataset.mnist, data_reader_creator_random_image(20, 20), true_reader, false_reader)
# Skipped 1 because paddle.dataset.mnist produces two items per data entry.
# And we don't care second item at this time.
paddle.train(reader, {"true_image":0, "fake_image": 2, "true_label": 3, "false_label": 4}, ...)
Shuffle
Given shuffle buffer size n
, paddle.reader.shuffle
will return a data reader that buffers n
data entries and shuffle them before a data entry is read.
Example:
reader = paddle.reader.shuffle(paddle.dataset.mnist, 512)
Q & A
Why return only a single entry, but not a mini batch?
If a mini batch is returned, data reader need to take care of batch size. But batch size is a concept for training, it makes more sense for user to specify batch size as a parameter for train
.
Practically, always return a single entry make reusing existing data readers much easier (e.g., if existing reader return not a single entry but 3 entries, training code will be more complex because it need to handle cases like batch size 2).
Why use a dictionary but not a list to provide mapping?
We decided to use dictionary ({"image":0, "label":1}
) instead of list (["image", "label"]
) is because that user can easily resue item (e.g., using {"image_a":0, "image_b":0, "label":1}
) or skip item (e.g., using {"image_a":0, "label":2}
).
How to create custom data reader creator
def image_reader_creator(image_path, label_path, n):
def reader():
f = open(image_path)
l = open(label_path)
images = numpy.fromfile(
f, 'ubyte', count=n * 28 * 28).reshape((n, 28 * 28)).astype('float32')
images = images / 255.0 * 2.0 - 1.0
labels = numpy.fromfile(l, 'ubyte', count=n).astype("int")
for i in xrange(n):
yield images[i, :], labels[i] # a single entry of data is created each time
f.close()
l.close()
return reader
# images_reader_creator creates a reader
reader = image_reader_creator("/path/to/image_file", "/path/to/label_file", 1024)
paddle.train(reader, {"image":0, "label":1}, ...)
How is paddle.train
implemented
An example implementation of paddle.train could be:
def make_minibatch(reader, minibatch_size):
def ret():
r = reader()
buf = [r.next() for x in xrange(minibatch_size)]
while len(buf) > 0:
yield buf
buf = [r.next() for x in xrange(minibatch_size)]
return ret
def train(reader, mapping, batch_size, total_pass):
for pass_idx in range(total_pass):
for mini_batch in make_minibatch(reader): # this loop will never end in online learning.
do_forward_backward(mini_batch, mapping)