August 11, 2017

Pandas Dataframe: Sequencial Operation V.S. Matrix Operation

Recently, I am dealing with Pandas for a data science project. The dataframe I got has data of time (in minutes) and weight. At the beginning of the project, I want to process data day by day, draw a histogram of weight at each day, and find peaking values of the histograms.

At first, it’s very easy to think of a sequential solution: using a for loop to process data day by day. For each day, do data processes.

However, after I’ve finished the code, a friend of mine suggested me to try a new solution of solving the problem: using matrix operation. The basic logic is, we can get new matrixes by having sub-matrixes of current matrix. So data processes will be within the sub-matrix we get.

The main function I use is resample in Pandas. Operations become much with this function. How does resample work? In the documentation, here’s a easy example:

>>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd'])
>>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T')
>>> df.resample('3T', on='time').sum()
                     a  b  c  d
time
2000-01-01 00:00:00  0  3  6  9
2000-01-01 00:03:00  0  3  6  9
2000-01-01 00:06:00  0  3  6  9

In the above program, we assign parameter rule to 3T, and set column that executes the ‘sum()’ function to ‘time’. The program will resample the dataframe by three hours, and compute the sum of data in each three hours. For a complex user-defined function, we need to use ‘apply’ function.

I refactored my previous program from sequencial operation to matrix operation, which reduces more than half of code, and makes logic more clear.

def read_sensor(i):
    data = pd.read_csv('test.csv', index_col=0)

def get_single(array_like):
    print str(array_like.iloc[0].timestamp)[0:10]
    categories, bins = pd.cut(array_like.value, 20, retbins=True)
    histogram = array_like.value.groupby(categories).apply(lambda g: {'gcount': g.count()}).unstack()
    histogram['meanvalue'] = (histogram.index.categories.left + histogram.index.categories.right) / 2
    # print histogram
    from scipy import signal
    peakind = signal.find_peaks_cwt(histogram.gcount, np.arange(1, 10))
    print histogram.meanvalue.iloc[peakind - 1]
    print '---'
    return 1

sensor = read_sensor(sensor_name)
sensor.resample('1D', on='timestamp').apply(get_single)