21/10: structured data in numpy

Tags:
Numpy supports structured arrays, which are the nearest thing to R's data.frame class. Data are organized into fields and records. Each field (column) has a name and data type, and each record (row) has a value for all the fields. Columns are indexed by name, and rows are indexed by integers. Recarray objects can be generated from nested Python iterable objects using numpy.rec.fromrecords:


>>> D = [('fair',6.0,1), ('good',12,2)]
>>> D = numpy.rec.fromrecords(D, names='quality,price,size')
>>> D

rec.array([('fair', 6.0, 1), ('good', 12.0, 2)],
dtype=[('quality', '|S4'), ('price', '
>>> D['quality']

rec.array(['fair', 'good'],
dtype='|S4')

>> D[0]

('fair', 6, 1)


Note that the 'price' field has a float data type because one of the records has a float value, and the field is promoted to the most general data type. For more precise control over field data types, fromrecords() takes a format argument, which is a comma-delimited list of format strings. For instance, to force 'price' to be an integer, call D = numpy.rec.fromrecords(D, names='quality,price,size', formats='S4,i4,i4')

For reading and writing recarrays, use matplotlib.mlab.rec2csv() and matplotlib.mlab.csv2rec(). The format of each field can be specified using a dictionary. There are a number of arguments to both functions that can be used to control how the data is read in (e.g. delimiter, is the first row a list of field names, etc), most of which are documented. The rec2csv() function always outputs field names as headers. To avoid this behavior, or to avoid having a dependency on matplotlib, use numpy.savetxt()


>>> from matplotlib import mlab

>>> formatd = {'quality' : mlab.FormatString(), 'price' : mlab.FormatFloat(2),}
>>> mlab.rec2csv(D, 'test.csv', formatd=formatd)
>>> mlab.csv2rec('test.csv')

rec.array([('fair', 6.0, 1), ('good', 12.0, 2)],
dtype=[('quality', '|S4'), ('price', '
>>> numpy.savetxt('test.csv', D, delimiter=',', fmt=('%s','%3.2f','%d'))
>>> numpy.loadtxt('test.csv', delimiter=',', dtype={'names': ('quality','price','size'), 'formats' : ('S4', 'f8', 'i4')})

array([('fair', 6.0, 1), ('good', 12.0, 2)],
dtype=[('quality', '|S4'), ('price', '

Comments

ep wrote:

Thank you so very much.. It was impossible to find what the standard was...

is mlab the fastest read-in fcn around?
21/07 23:29:59

wrote:

I haven't benchmarked it or anything, but I imagine it's not the fastest out there because it has to figure out what kinds of values are in each column. If you're dealing with massive quantities of data you're probably better off writing a custom function.
27/07 12:29:48