August 11, 2017

# Pandas Dataframe: Sequencial Operation V.S. Matrix Operation

Recently, I am dealing with Pandas for a data science project. The dataframe I got has data of time (in minutes) and weight. At the beginning of the project, I want to process data day by day, draw a histogram of weight at each day, and find peaking values of the histograms.

At first, it’s very easy to think of a sequential solution: using a `for` loop to process data day by day. For each day, do data processes.

However, after I’ve finished the code, a friend of mine suggested me to try a new solution of solving the problem: using matrix operation. The basic logic is, we can get new matrixes by having sub-matrixes of current matrix. So data processes will be within the sub-matrix we get.

The main function I use is `resample` in Pandas. Operations become much with this function. How does `resample` work? In the documentation, here’s a easy example:

``````>>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd'])
>>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T')
>>> df.resample('3T', on='time').sum()
a  b  c  d
time
2000-01-01 00:00:00  0  3  6  9
2000-01-01 00:03:00  0  3  6  9
2000-01-01 00:06:00  0  3  6  9
``````

In the above program, we assign parameter `rule` to `3T`, and set column that executes the ‘sum()’ function to ‘time’. The program will resample the dataframe by three hours, and compute the sum of data in each three hours. For a complex user-defined function, we need to use ‘apply’ function.

I refactored my previous program from sequencial operation to matrix operation, which reduces more than half of code, and makes logic more clear.

``````def read_sensor(i):

def get_single(array_like):
print str(array_like.iloc.timestamp)[0:10]
categories, bins = pd.cut(array_like.value, 20, retbins=True)
histogram = array_like.value.groupby(categories).apply(lambda g: {'gcount': g.count()}).unstack()
histogram['meanvalue'] = (histogram.index.categories.left + histogram.index.categories.right) / 2
# print histogram
from scipy import signal
peakind = signal.find_peaks_cwt(histogram.gcount, np.arange(1, 10))
print histogram.meanvalue.iloc[peakind - 1]
print '---'
return 1