Recently, I am dealing with Pandas for a data science project. The dataframe I got has data of time (in minutes) and weight. At the beginning of the project, I want to process data day by day, draw a histogram of weight at each day, and find peaking values of the histograms.
At first, it’s very easy to think of a sequential solution: using a for
loop to process data day by day. For each day, do data processes.
However, after I’ve finished the code, a friend of mine suggested me to try a new solution of solving the problem: using matrix operation. The basic logic is, we can get new matrixes by having sub-matrixes of current matrix. So data processes will be within the sub-matrix we get.
The main function I use is resample
in Pandas. Operations become much with this function. How does resample
work? In the documentation, here’s a easy example:
>>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd'])
>>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T')
>>> df.resample('3T', on='time').sum()
a b c d
time
2000-01-01 00:00:00 0 3 6 9
2000-01-01 00:03:00 0 3 6 9
2000-01-01 00:06:00 0 3 6 9
In the above program, we assign parameter rule
to 3T
, and set column that executes the ‘sum()’ function to ‘time’. The program will resample the dataframe by three hours, and compute the sum of data in each three hours. For a complex user-defined function, we need to use ‘apply’ function.
I refactored my previous program from sequencial operation to matrix operation, which reduces more than half of code, and makes logic more clear.
def read_sensor(i):
data = pd.read_csv('test.csv', index_col=0)
def get_single(array_like):
print str(array_like.iloc[0].timestamp)[0:10]
categories, bins = pd.cut(array_like.value, 20, retbins=True)
histogram = array_like.value.groupby(categories).apply(lambda g: {'gcount': g.count()}).unstack()
histogram['meanvalue'] = (histogram.index.categories.left + histogram.index.categories.right) / 2
# print histogram
from scipy import signal
peakind = signal.find_peaks_cwt(histogram.gcount, np.arange(1, 10))
print histogram.meanvalue.iloc[peakind - 1]
print '---'
return 1
sensor = read_sensor(sensor_name)
sensor.resample('1D', on='timestamp').apply(get_single)