While pd.apply()
works for small datasets like the example from the docs
df['ADJUSTED'] = df.apply(lambda x: cpi.inflate(x.MEDIAN_HOUSEHOLD_INCOME, x.YEAR), axis=1)
it quickly falls apart if one tries to inflate long series because it inflates each value one at a time instead of taking advantage of numpy
and pandas
vectorization.
CPI
already can handle numpy
arrays and has both pandas
and numpy
as dependencies.
(100,000,000 rows in less than 2 seconds, pretty cool.)
The problem:
CPI
takes year_or_month
as either int
or a date
object and retrieves the corresponding source_index
from cpi.db
. This, as far as I understand, would need to be done for every item in the array therefore it would still be very time-consuming for very large datasets.
The solution:
I still don't have any solid solutions.
One way to approach this could be:
-
receive a numpy
array of dates for year_or_month
- clean it so they all have 01 as day of month
-
grab the unique values in this array of dates
- even if you have 100,000,000 rows, you definitely don't have 100,000,000 different year-month combinations.
- BLS' data goes back to 1913 (2017-1913=104 years, 104 * 12 = 1248 months + 10 months of 2018 as of now = 1258 unique values at most)
- create a
numpy
array of those values matching their date (or a dict()
to later use .map()
on the dates array.)
-
map the source_index
values to the array of dates
- look up the CPI value for each of those unique dates and map it back to the original
numpy
array of dates
-
cpi.inflate()
already just multiplies (value * target_index) / float(source_index)
numpy
will take care of the rest
Even though most likely one would be inflating values to one specific year or month, this method could be applied to both year_or_month
and to
to inflate a series of values from a series of dates to a different series of dates.
The use:
The particular use I came up with was normalizing different types of incomes from public use microdata. For example, if I go to ipums and grab ACS data from 2000-2016 for incomes (earned wages, household income, farm income, social security, etc).
There are only 16 distinct years but if I use pd.apply()
it would go row by row and it would simply never end:
I don't have a experience with sqlite so I couldn't put together a proof of concept but I hope this explanation is helpful.