First of all, please do allow me to thank you greatly for this package, it is very convenient to be able to debug the code meant to create a pipeline of delta live table without having the run the entire thing. However I am currently experiencing some problems that I find hard to resolve.
@dlt.table(name = "customer_order_silver_v2")
def capping_unitPrice_Qt():
df = dlt.read("customer_order_silver")
boundary_unit = [0,0]
boundary_qty = [0,0]
boundary_unit = df.select(col("UnitPrice")).approxQuantile('UnitPrice',[0.05,0.95], 0.25)
boundary_qty = df.select(col("Quantity")).approxQuantile('Quantity',[0.05,0.95], 0.25)
print(boundary_unit)
print(boundary_unit[0])
print(boundary_unit[1])
df = df.withColumn('UnitPrice', F.when(col('UnitPrice') > boundary_unit[1], boundary_unit[1])
.when(col('UnitPrice') < boundary_unit[0], boundary_unit[0])
.otherwise(col('UnitPrice')))
df = df.withColumn('Quantity', F.when(col('Quantity') > boundary_qty[1], boundary_qty[1])
.when(col('Quantity') < boundary_qty[0], boundary_qty[0])
.otherwise(col('Quantity')))
return df
When I run this the code for this DLT, the approxQuantile() in it seems to be not working. What I get after running this:
#this way of writing might be too complex. An alternative solution is to write the DLT as a general function and then pass it as a function.
@dlt.create_table(name = "customer_order_silver_v2")
@dltwithdebug(globals())
# @dlt.table(name = "customer_order_silver_v2")
def capping_unitPrice_Qt():
df = dlt.read("wtchk_customer_order_filtered")
boundary_unit = [0,0]
boundary_qty = [0,0]
boundary_unit = df.select(col("UnitPrice")).approxQuantile('UnitPrice',[0.05,0.95], 0.25)
boundary_qty = df.select(col("Quantity")).approxQuantile('Quantity',[0.05,0.95], 0.25)
print(boundary_unit)
print(boundary_unit[0])
print(boundary_unit[1])
df = df.withColumn('UnitPrice', F.when(col('UnitPrice') > boundary_unit[1], boundary_unit[1])
.when(col('UnitPrice') < boundary_unit[0], boundary_unit[0])
.otherwise(col('UnitPrice')))
df = df.withColumn('Quantity', F.when(col('Quantity') > boundary_qty[1], boundary_qty[1])
.when(col('Quantity') < boundary_qty[0], boundary_qty[0])
.otherwise(col('Quantity')))
return df
showoutput(capping_unitPrice_Qt)
The code runs and it produces the table. as well as the value that I need:
I really cannot wrap my head around as what is not well written. I would appreciate any kind of input or advice. Thank you very much!