I have a version of these nodes working via MPS for those with macbooks. On my M1 Pro

for me now work thanks with <a class="user-mention notranslate" data-hovercard-type="u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Got it working on MacBook (MPS),about kijai/comfyui-liveportraitkj

Comments (48)

tryx78 commented on September 3, 2024

for me now work thanks with @Grant-CP repo
thanks
i use M3 Pro 14inch - 18gb ram

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@tryx78 Glad it's working for you! Can you tell me how long it takes? Just the overall time for prompt evaluation and how many frames you set in the video import node.

Also, can you confirm that it runs just fine without PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py to set the proper fallback for comfyui? I believe that was necessary on my M1, but I forgot to write it in the repo originally.
@melMass I'm not sure if setting an environment variable like this fits into your patch solution.

@melMass Thanks. Good idea with the patch. A change I didn't make is that the code is still using CUDA as the execution provider for onnx networks in that file. I think it will just fallback to CPU, but it would be nice to not have the error message any more. I know in comfyui there is also the CoreML execution provider which I believe is fairly new but would be great to use if it works.

from comfyui-liveportraitkj.

tryx78 commented on September 3, 2024

@Grant-CP now i have this error
Error occurred when executing LivePortraitProcess: The operator 'aten::grid_sampler_3d' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

how fix this?

from comfyui-liveportraitkj.

tryx78 commented on September 3, 2024

@Grant-CP
48 fps in 38.80 seconds with PYTORCH_ENABLE_MPS_FALLBACK=1
48 fsp in 53.46 seconds whitout PYTORCH_ENABLE_MPS_FALLBACK=1

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@tryx78 Thanks so much! I added the Pytorch fallback to the README of my repo.

Good to see that your M3 is so much faster. Let me know if you run into any other issues!

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

I changed all the hardcoded cuda stuff to use the comfy detection, and put that one tensor operation in try/except block as it probably fails on MPS, but I can't test if that's enough.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai Great I've looked through the changes and I will test them out later.

In line 62 of liveportrait/modules/dense_motion.py, if you want to change the code for the assertion error I believe the error is: AttributeError: module 'torch.mps' has no attribute 'FloatTensor'.

Its a stupid error that's been in PyTorch for multiple years. Presumably the error will change from AttributeError to something more reasonable in a future version of pytorch.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai I had a chance to test it and I get a torch autocast error. This might be a bug on comfy's end as it seems to me like their get_autocast_device should handle mps not being supported. See the error here:

Traceback (most recent call last):
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 151, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 81, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 74, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/nodes.py", line 283, in process
    cropped_frames, full_frame = pipeline.execute(img, driving_images_np)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_pipeline.py", line 53, in execute
    x_s_info = self.live_portrait_wrapper.get_kp_info(I_s)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_wrapper.py", line 95, in get_kp_info
    with torch.autocast(device_type=get_autocast_device(self.device_id), dtype=torch.float16, enabled=self.cfg.flag_use_half_precision):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'

In my repo I was only able to get it working by disabling autocast entirely (not even setting device to cpu works). For example:

with torch.no_grad():
            #HACK
            # with torch.autocast(device_type='cpu', dtype=torch.float16, enabled=self.cfg.flag_use_half_precision):
            #     feature_3d = self.appearance_feature_extractor(x)
            feature_3d = self.appearance_feature_extractor(x)

Again, I'm not sure if there's a way to fix this with comfy's autocast manager. I won't have time to look into that today.

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

Traceback (most recent call last):
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 151, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 81, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 74, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/nodes.py", line 283, in process
    cropped_frames, full_frame = pipeline.execute(img, driving_images_np)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_pipeline.py", line 53, in execute
    x_s_info = self.live_portrait_wrapper.get_kp_info(I_s)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_wrapper.py", line 95, in get_kp_info
    with torch.autocast(device_type=get_autocast_device(self.device_id), dtype=torch.float16, enabled=self.cfg.flag_use_half_precision):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'

In my repo I was only able to get it working by disabling autocast entirely (not even setting device to cpu works). For example:

with torch.no_grad():
            #HACK
            # with torch.autocast(device_type='cpu', dtype=torch.float16, enabled=self.cfg.flag_use_half_precision):
            #     feature_3d = self.appearance_feature_extractor(x)
            feature_3d = self.appearance_feature_extractor(x)

Again, I'm not sure if there's a way to fix this with comfy's autocast manager. I won't have time to look into that today.

I've done it like that before too yeah, can just the manager mps detection to skip the whole autocast conditionally.

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

@Grant-CP Can you try now? Skipping the whole autocast based on the dtype.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai I'm getting the same error. Truncated error is below. I don't believe I've set any half_precision flags so I assume that comes from elsewhere in the code. I am using the fp16 models though, as I assume most people would.

do you think try:... except RuntimeError: would be bad? Another option would be to use torch.backends.mps.is_available(). That should return true only on macbooks. Not sure if there's another issue you are trying to solve with moving the half_precision flag out of the autocast() call though.

File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_pipeline.py", line 53, in execute
    x_s_info = self.live_portrait_wrapper.get_kp_info(I_s)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_wrapper.py", line 94, in get_kp_info
    with torch.autocast(get_autocast_device(self.device_id), dtype=torch.float16) if self.cfg.flag_use_half_precision else nullcontext():
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

@kijai I'm getting the same error. Truncated error is below. I don't believe I've set any half_precision flags so I assume that comes from elsewhere in the code. I am using the fp16 models though, as I assume most people would.

do you think try:... except RuntimeError: would be bad? Another option would be to use torch.backends.mps.is_available(). That should return true only on macbooks. Not sure if there's another issue you are trying to solve with moving the half_precision flag out of the autocast() call though.
File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_pipeline.py", line 53, in execute
    x_s_info = self.live_portrait_wrapper.get_kp_info(I_s)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_wrapper.py", line 94, in get_kp_info
    with torch.autocast(get_autocast_device(self.device_id), dtype=torch.float16) if self.cfg.flag_use_half_precision else nullcontext():
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'

This is with fp32 selected though? I should probably automate that.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai my apologies. I had the model loader pipeline set to fp16. So the idea of this code is to force MPS users to use fp32 for these particular models because MPS doesn't support autocast. My mistake! I think some other parts of comfyui will print a message saying "mixed precision is not supported on this device, reverting to full precision" or something like that. If you wanted to be nice to mac users you could do a check of cfg.flag_use_half_precision and if it's true then throw a descriptive error message.

It will be interesting to test whether setting it to fp32 will hurt performance. I'm not sure what the default was in the original repo, but I assume it was fp16 since I was also running into an error on this line originally?

Anyways, we are down to the last error, which is:

ile "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/modules/dense_motion.py", line 81, in forward
    deformed_feature = self.create_deformed_feature(feature, sparse_motion)  # (bs, 1+num_kp, c=4, d=16, h=64, w=64)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/modules/dense_motion.py", line 50, in create_deformed_feature
    sparse_deformed = F.grid_sample(feature_repeat, sparse_motions, align_corners=False)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/nn/functional.py", line 4353, in grid_sample
    return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: The operator 'aten::grid_sampler_3d' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

I solved this one by running my comfyui with PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py. One user near the top of this thread sounded like they maybe got it to work more slowly without this flag on their M3 macbook? They weren't totally clear.

An option might be to move both feature_repeat and sparse_motion to cpu before this operation? I assume that's what the fallback flag does. Unfortunately the pytorch function grid_sample looks like just a wrapper for non-python code.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai I can confirm that the node works with the env flag set. So your node is just as functional. I do think it is marginally slower, maybe because of the fp32 setting on the model loader? For example 57 seconds originally vs 60 seconds now for 32 frames.

I think it would be good to skip the autocast context on MPS, rather then relying on the MPS users loading everything into fp32. I'm going to check and see if I can skip the fallback call though

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

@kijai I can confirm that the node works with the env flag set. So your node is just as functional. I do think it is marginally slower, maybe because of the fp32 setting on the model loader? For example 57 seconds originally vs 60 seconds now for 32 frames.

I think it would be good to skip the autocast context on MPS, rather then relying on the MPS users loading everything into fp32. I'm going to check and see if I can skip the fallback call though

I added auto option as precision choose to the loader, all it does is change the flag. With fp32 the autocast context should be skipped.

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

I have always been confused if MPS supports fp16 at all, from what I understand it should, but it's just torch autocasts that don't? If we can't use autocast it would probably need more work overall to manually cast stuff.

Also I think fp32 is just default in this code until the autocasts anyway. I'm not really an expert in all that.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai I'm pretty sure mps supports float16. See the following code:

a = torch.Tensor([[1.,1.]]).to('mps')
a = a.to(torch.float16)
print(a.dtype)
a = a.to(torch.bfloat16)
print(a.dtype)
print(a.device)
#out
torch.float16
torch.bfloat16
mps:0

I'm pretty sure some stuff was being done in float16 before since your node is slightly slower which I confirmed at a few more sizes. It could be related to other parts of the code maybe though.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai I got it working without the flag. Again it's a little bit slower than running my repo with the fallback flag, but I suspect it is because of the precision. For example, originally 600 frames was 650 seconds, now it is 692 seconds. Not the worst drop though. See the following code:

In dense_motion.py line 50

def create_deformed_feature(self, feature, sparse_motions):
        bs, _, d, h, w = feature.shape
        feature_repeat = feature.unsqueeze(1).unsqueeze(1).repeat(1, self.num_kp+1, 1, 1, 1, 1, 1)      # (bs, num_kp+1, 1, c, d, h, w)
        feature_repeat = feature_repeat.view(bs * (self.num_kp+1), -1, d, h, w)                         # (bs*(num_kp+1), c, d, h, w)
        sparse_motions = sparse_motions.view((bs * (self.num_kp+1), d, h, w, -1))                       # (bs*(num_kp+1), d, h, w, 3)
        #HACK
        if torch.backends.mps.is_available():
            print('converting mps tensors to cpu for grid_sample')
            feature_repeat = feature_repeat.to('cpu')
            sparse_motions = sparse_motions.to('cpu')
            sparse_deformed = F.grid_sample(feature_repeat, sparse_motions, align_corners=False).to('mps')
        else:
        #HACK END
            sparse_deformed = F.grid_sample(feature_repeat, sparse_motions, align_corners=False)
        sparse_deformed = sparse_deformed.view((bs, self.num_kp+1, -1, d, h, w))

In util.py line 157

def forward(self, x):
        out = self.conv(x)
        out = self.norm(out)
        out = F.relu(out)
        #HACK
        try:
            out = self.pool(out)
        except NotImplementedError:
            out = self.pool(out.to('cpu')).to('mps')
        #HACK End
        return out

In warping_network.py line 12, then line 49

#HACK
import torch.backends.mps as mps

#line 49
def deform_input(self, inp, deformation):
        #HACK
        if mps.is_available():
            return F.grid_sample(inp.to('cpu'), deformation.to('cpu'), align_corners=False).to('mps')
        #HACK END
        return F.grid_sample(inp, deformation, align_corners=False)

I assume replacing the mps.is_available() call with a comfyui alternative would be good. From brief testing this seems about the same speed as the pytorch fallback flag?

A better code option might be to create a wrapper for F.grid_sample that puts the inputs to cpu when on mps, then returns an mps tensor. Or to define higher up a mps_grid_sample that does the same and import and use it in the two places where we need it. The other unsupported necessary operation is nn.AvgPool3d which is pretty silly not to be supported. I don't think I'll have time in the next few weeks to rewrite the code to not need the 3d pool. the grid_sample I don't understand.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai If there's a good way to set that fallback environment variable just for the execution of this node, then that's probably a better way to make this future proof. My manual swapping of tensors doesn't seems to be faster (or much slower surprisingly?).

Another better way to write my code would be to have all three block use the except NotImplementedError. I think I like that the best and that will set us up well for the future.

from comfyui-liveportraitkj.

cchance27 commented on September 3, 2024

Silly question but given the small size of insightface and liveportrait models, wondering if anyone has tried to do a conversion to convert them over to CoreML as the ANE would likely be the fastest way to run things no?

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@cchance27 Can you point to any other places where you've seen that done? I have no idea where to start, but I had thought the same thing. CoreML requires using Swift right?

from comfyui-liveportraitkj.

x4080 commented on September 3, 2024

@Grant-CP no, python can call c code for coreml I think

Edit: using coremltools

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@x4080 Thanks I read through that a bit. It looks like converting from onnx to CoreML is a little annoying at the moment but definitely very possible.

I also see in deepinsight/insightface#2238 that insightface onnx seems to be supported with CoreML as the execution provider on recent versions of the onnx runtime. I'll probably try that as a first option. I doubt it's as fast as full conversion and compilation, but I bet it's way better than CPUExecutionProvider.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

Looks like CoreMLExecution provider can work for some parts, but not for the main FaceAnalysisDIY call. I get the error below. It sounds to me like CoreML expects statically sized tensors at each step of the way. Where the CoreML runtime is getting the idea for what size this tensor should be, I have no idea. I tried adjusting the size of the input image, input video, and number of video frames, and none of them changed the size of this tensor. I also made sure I had the latest onnx and onnxruntime

So moral of the story is I'm not going to work on CoreML for now. While the cropper can work on coreml, the main meat of the process cannot. I think the main changes that we will make with @kijai is to try and support fp16 on mac, and to make inference work without the fallback flag, both of which have implementations described in this thread.

File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/utils/face_analysis_diy.py", line 47, in get
    bboxes, kpss = self.det_model.detect(img_bgr, max_num=max_num, metric='default')
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py", line 224, in detect
    scores_list, bboxes_list, kpss_list = self.forward(det_img, self.det_thresh)
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py", line 152, in forward
    net_outs = self.session.run(self.output_names, {self.input_name : blob})
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running CoreML_1312385619456144913_6 node. Name:'CoreMLExecutionProvider_CoreML_1312385619456144913_6_6' Status Message: Exception: /Users/runner/work/1/s/onnxruntime/core/providers/coreml/model/model.mm:71 InlinedVector<int64_t> (anonymous namespace)::GetStaticOutputShape(gsl::span<const int64_t>, gsl::span<const int64_t>, const logging::Logger &) inferred_shape.size() == coreml_static_shape.size() was false. CoreML static output shape ({1,1,1,2048,1}) and inferred shape ({3200,1}) have different ranks.

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

Looks like CoreMLExecution provider can work for some parts, but not for the main FaceAnalysisDIY call. I get the error below. It sounds to me like CoreML expects statically sized tensors at each step of the way. Where the CoreML runtime is getting the idea for what size this tensor should be, I have no idea. I tried adjusting the size of the input image, input video, and number of video frames, and none of them changed the size of this tensor. I also made sure I had the latest onnx and onnxruntime

So moral of the story is I'm not going to work on CoreML for now. While the cropper can work on coreml, the main meat of the process cannot. I think the main changes that we will make with @kijai is to try and support fp16 on mac, and to make inference work without the fallback flag, both of which have implementations described in this thread.
File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/utils/face_analysis_diy.py", line 47, in get
    bboxes, kpss = self.det_model.detect(img_bgr, max_num=max_num, metric='default')
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py", line 224, in detect
    scores_list, bboxes_list, kpss_list = self.forward(det_img, self.det_thresh)
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py", line 152, in forward
    net_outs = self.session.run(self.output_names, {self.input_name : blob})
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running CoreML_1312385619456144913_6 node. Name:'CoreMLExecutionProvider_CoreML_1312385619456144913_6_6' Status Message: Exception: /Users/runner/work/1/s/onnxruntime/core/providers/coreml/model/model.mm:71 InlinedVector<int64_t> (anonymous namespace)::GetStaticOutputShape(gsl::span<const int64_t>, gsl::span<const int64_t>, const logging::Logger &) inferred_shape.size() == coreml_static_shape.size() was false. CoreML static output shape ({1,1,1,2048,1}) and inferred shape ({3200,1}) have different ranks.

So the onnx option should include CoreLM then? The version currently in dev branch separates the process anyway to the cropper and the rest.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai My mistake the cropper is the one with FaceAnalysisDIY, and that fails with CoreML. It's the LandmarkRunner that can handle CoreML, although there's no speedup, and it might just be falling back to cpu anyway. So I would not suggest at this time making CoreML an option. It might be reducing heat production on my macbook but that's hard for me to measure, there's no speed up.

If I set the onnx provider to CoreML (screenshot below) then that is when I get the error. I can hardcode the LandmarkRunner to the CoreML (regardless of node provider choice) and that works, but either choosing CoreML in the node or hardcoding the FaceAnalysisDIY to use CoreML results in the same error message above.

class LandmarkRunner(object):
    """landmark runner"""
    def __init__(self, **kwargs):
        ckpt_path = kwargs.get('ckpt_path')
        onnx_provider = kwargs.get('onnx_provider', 'cuda')  # 默认用cuda
        device_id = kwargs.get('device_id', 0)
        self.dsize = kwargs.get('dsize', 224)
        self.timer = Timer()

        #HACK
        self.session = onnxruntime.InferenceSession(
                ckpt_path, providers=[
                    ('CoreMLExecutionProvider', {'device_id': device_id})
                ]
            )

        # if onnx_provider.lower() == 'cuda':
        #     self.session = onnxruntime.InferenceSession(

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

Here's the piece of the cropper that errors. If I select "CoreML" in the node or if I hard code it (commented line) I believe the same exact code gets run.

self.face_analysis_wrapper = FaceAnalysisDIY(
            name='buffalo_l',
            root=os.path.join(folder_paths.models_dir, 'insightface'),
            #HACK
            #providers = ['CoreMLExecutionProvider']
            providers=[provider + 'ExecutionProvider',]
        )

from comfyui-liveportraitkj.

cchance27 commented on September 3, 2024

Sorry haven't been around to respond since i opened the ANE/CoreML rabbit hole :)

If you want to monitor ANE usage and GPU you can use asitop while the run is happening to see if it's executing on what during the inferrence.

I get the error below. It sounds to me like CoreML expects statically sized tensors at each step of the way.

yes CoreML is normally statically sized, the coreml models can be compiled with a set of tensor sizes supported, or can be compiled to support a ranged set but i imagine that might not work using onnx without a pre-conversion.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@cchance27 Thanks for letting me know about asitop. Looks like it might not totally support sonoma, but I'll give it a try.

I wasn't able to find info about the interaction between static sizes and onnx-runtime. My imagination would be that I would try looking in the onnx graph/metadata itself and editing it and seeing how the error message changes.

I also imagine that the onnx-runtime doesn't get to use the ANE since it sounds like CoreML programs have to be specifically compiled with support for it?

from comfyui-liveportraitkj.

Creative-comfyUI commented on September 3, 2024

Is there a solution for the bug "The operator 'aten::grid_sampler_3d' is currently not implemented for the MPS device" without affecting the speed ? (Mac M2) thanks

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@Creative-comfyUI I have a pull request pending here: #59 which avoids the need for the torch fallback flag. As for not affecting speed, I'm not sure that operation is really the bottleneck anyway and I think it's only called like 20 times. So you can use my fork (not the one which says "-mps", the actual fork from the pull request) do use my pull request.

I think that @kijai has been spending time working on other nodes and on the dev branch for now. No idea on when that pull request will be merged, but they said they were interesting in merging it.

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

@Creative-comfyUI I have a pull request pending here: #59 which avoids the need for the torch fallback flag. As for not affecting speed, I'm not sure that operation is really the bottleneck anyway and I think it's only called like 20 times. So you can use my fork (not the one which says "-mps", the actual fork from the pull request) do use my pull request.

I think that @kijai has been spending time working on other nodes and on the dev branch for now. No idea on when that pull request will be merged, but they said they were interesting in merging it.

As soon as I dare to merge the dev branch, it's kinda overhauling everything by now so I'd appreciate if you could test it now with mps? Mostly worried if Kornia works on mps, it allows doing the final composition on GPU so it's overall faster.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai Ok I figured it was something like that. I tried the Dev branch probably about a week and a half ago and it worked on MPS with some of the same changes. I swapped back because the nodes and workflow were different.

@kijai Will using the most obvious workflow test all the things you want to test? Like Kornia is integrated into the main inference node? Maybe there's a new example workflow in the dev branch anyway so I'll check that when I test

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

@kijai Ok I figured it was something like that. I tried the Dev branch probably about a week and a half ago and it worked on MPS with some of the same changes. I swapped back because the nodes and workflow were different.

@kijai Will using the most obvious workflow test all the things you want to test? Like Kornia is integrated into the main inference node? Maybe there's a new example workflow in the dev branch anyway so I'll check that when I test

The example workflow currently in develop will do, I gained couple of it/s when using Kornia and haven't ran into any issues so I decided to keep it, as kornia is also part of ComfyUI default requirements now.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai Here's some results from my testing. I'm using the vid2vid workflow with d6 as driving and d3 as target. 32 frames.

First off, fp16 is not supported because of autocast not being supported. As we discussed earlier on the main branch, I think fp16 does work as long as you skip the autocast context completely on MPS:

File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_wrapper.py", line 71, in get_kp_info
    with torch.autocast(get_autocast_device(self.device_id), dtype=torch.float16) if self.cfg.flag_use_half_precision else nullcontext():
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'

All further stuff is done in fp32. At this point, I can quickly get to "animating", then we get to the same issues that my pull request solves:

  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/modules/dense_motion.py", line 50, in create_deformed_feature
    sparse_deformed = F.grid_sample(feature_repeat, sparse_motions, align_corners=False)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/nn/functional.py", line 4353, in grid_sample
    return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: The operator 'aten::grid_sampler_3d' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

Going forward, I'm running comfyui with PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py.

I get 1.10s/it in the animating phase. It takes a little longer (74s total vs like 65 before), but presumably that's vid2vid being slow as opposed to anything else. Unfortunately, I get a blank output (left half of video is driving video, right half is black square), likely related to this error. I just updated comfyui this morning so I don't think it's a versioning error probably?:

/Users/grant/Documents/Repos/ComfyUI/comfy/utils.py:548: RuntimeWarning: invalid value encountered in cast
  images = [Image.fromarray(np.clip(255. * image.movedim(0, -1).cpu().numpy(), 0, 255).astype(np.uint8)) for image in samples]

Also worth noting that if I had used the requirement.txt file that onnxruntime-gpu would have given me an error on mac.

@kijai I can make a kind of weird video that is correctly animated (face-wise) using the "cropped-images" output of the LivePortraitProcess node. So however the full_image is being created is the problem. Below is the video I get by combining the cropped images. So I feel like we are pretty close. Fix this issue and then redo my pull request to solve the fallback issue then it should work?

LivePortrait_00002.mp4

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai It's not a dim mismatch issue or something like that because I tried again with d6.mp4 as both driving and target and it still gave me an error (slightly different) and had black squares as the "full_images" output

/Users/grant/Documents/Repos/ComfyUI/nodes.py:1437: RuntimeWarning: invalid value encountered in cast
  img = Image.fromarray(np.clip(i, 0, 255).astype(np.uint8))

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai One last comment. I tried again with different target and driving videos and the whole process takes about 50 seconds. So actually the dev branch seems to be quite a bit faster even on fp32 than any other version I've tried so far. Usually model loading is quite fast on MacBooks, but there must have been some initial load time or something for my first run.

When I google the error I run into onnx/onnx#4774 which talks about the same error message for converting to int64 on mac. It is possible that somewhere a number is using float64 or something? I know that MPS doesn't support float64 but I would expect a more aggressive error message. Maybe Comfyui is catching the real error message?

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

invalid value error is due to something generating NaNs, which probably is exactly the Kornia or the torch operations done. Probably should try to run that part on CPU to see if it's due to that.

I don't think anything is fp16 until the autocasts, if you skip it then it should remain fp32 as nothing in the code does any casting.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai Last update for now. Confirming that even making sure I installed the current comfyui base repo requirements Downloading kornia-0.7.3-py2.py3-none-any.whl (833 kB) does not fix the issue. I still get the same flavor of error:

/Users/grant/Documents/Repos/ComfyUI/comfy/utils.py:548: RuntimeWarning: invalid value encountered in cast
  images = [Image.fromarray(np.clip(255. * image.movedim(0, -1).cpu().numpy(), 0, 255).astype(np.uint8)) for image in samples]

Looks like the error can occur in a number of different places depending on input sizes and other factors? This most current one is inside the lanczos function in utils.py, whereas the other was in the save_image function in nodes.py. I'm pretty sure the offending call is .astype(utint8) but why it's complaining I don't know.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai I think the tensor is already moved to cpu before the casting to uint8 happens. See for example:

def lanczos(samples, width, height):
    images = [Image.fromarray(np.clip(255. * image.movedim(0, -1).cpu().numpy(), 0, 255).astype(np.uint8)) for image in samples]
    images = [image.resize((width, height), resample=Image.Resampling.LANCZOS) for image in images]

I think I may be confused what you are asking though. I've set "CPU" as my onnx device already. Is there a particular place in the LivePortraitProcess code where I should try to move to cpu do you think?

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

Doesn't lanczos generally work on Macs? It has to be done on cpu and I just defaulted to it as it's best quality. You can easily change it by just changing this bit of the code (nodes.py) to something like "bilinear":

With the Kornia it should be possible to run it with cpu by changing the device in the _transform_img_kornia function

from comfyui-liveportraitkj.

Creative-comfyUI commented on September 3, 2024

@Creative-comfyUI I have a pull request pending here: #59 which avoids the need for the torch fallback flag. As for not affecting speed, I'm not sure that operation is really the bottleneck anyway and I think it's only called like 20 times. So you can use my fork (not the one which says "-mps", the actual fork from the pull request) do use my pull request.
I think that @kijai has been spending time working on other nodes and on the dev branch for now. No idea on when that pull request will be merged, but they said they were interesting in merging it.

As soon as I dare to merge the dev branch, it's kinda overhauling everything by now so I'd appreciate if you could test it now with mps? Mostly worried if Kornia works on mps, it allows doing the final composition on GPU so it's overall faster.

I will be happy to do tests for you, tell me what I have to do, I am not sure about the MPS. I am wondering why the mac metal acceleration didn't do the job. I am not sure using PYTORCH_ENABLE_MPS_FALLBACK=1 will work better.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai Yes Lanczos works just fine on mac (I tested earlier today before reporting the errors). That's why I'm surprised I'm getting an error out of base Comfyui. I might have to raise an issue there once we figure this out. I'll try messing with the kornia calls and see if it can work.

@Creative-comfyUI For you, the easiest thing to do is to run PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py instead of just python main.py for running comfyui and to make sure you use fp32 for the models. If you want to dive deeper into the weeds, do the same thing (set the fallback flag) but use the develop branch of this repo. It comes with a vid2vid example which is what we are currently working on.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai Perfect suggestion. Running kornia on CPU doesn't seem to take any time? Maybe it will matter for bigger images and bigger frame counts. Here was a simple fix that worked for me:

"""
'dsize: Target shape (width, height).
    """
    #HACK
    #device = mm.get_torch_device()
    device = torch.device('cpu')
    # Convert dsize to tensor shape (H, W)
    _dsize = torch.tensor([dsize[1], dsize[0]])  # Kornia expects (H, W)

@kijai What would you like from me then? A pull request onto the dev branch which implements an mps check in the kornia function along with try/except blocks like with my previous pull request to the main branch? Separate ones? Do you just want to edit and makes commits yourself?

An alternative could be to try and figure out exactly where in the kornia transform function we get an error and try to implement a try/except block in there.

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

@kijai Perfect suggestion. Running kornia on CPU doesn't seem to take any time? Maybe it will matter for bigger images and bigger frame counts. Here was a simple fix that worked for me:
"""
'dsize: Target shape (width, height).
    """
    #HACK
    #device = mm.get_torch_device()
    device = torch.device('cpu')
    # Convert dsize to tensor shape (H, W)
    _dsize = torch.tensor([dsize[1], dsize[0]])  # Kornia expects (H, W)
@kijai What would you like from me then? A pull request onto the dev branch which implements an mps check in the kornia function along with try/except blocks like with my previous pull request to the main branch? Separate ones? Do you just want to edit and makes commits yourself?

An alternative could be to try and figure out exactly where in the kornia transform function we get an error and try to implement a try/except block in there.

I'll handle it, thank you for testing. I might separate the whole last phase to another node still, to allow some options for the composition and make it optional for performance gains in situations the composited image isn't even wanted.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai Great! The results and speed have come a long way.

The only other error I ran into was when I had an image as the target (not a video). Lets say I have 128 frames of video and 32 frames of image, then once I get to the processing of the 33rd frame I get:

File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_pipeline.py", line 155, in execute
    R_new = driving_rot_list_smooth[i]

The same sort of error occurs with a single unbatched image of course, but you don't have to wait for 32 frames to be processed for hitting an error. It might be convenient to have a check at the beginning that both image batch inputs have the same batch dimension in the LivePortraitProcess node. I think two other reasonable choices could be cropping the larger batch to the smaller of the two batch sizes, or growing the smaller batch to the larger of the two batch sizes. The second would make inputting a single image very easy.

from comfyui-liveportraitkj.

kijai commented on September 3, 2024

@kijai Great! The results and speed have come a long way.

The only other error I ran into was when I had an image as the target (not a video). Lets say I have 128 frames of video and 32 frames of image, then once I get to the processing of the 33rd frame I get:
File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_pipeline.py", line 155, in execute
    R_new = driving_rot_list_smooth[i]
The same sort of error occurs with a single unbatched image of course, but you don't have to wait for 32 frames to be processed for hitting an error. It might be convenient to have a check at the beginning that both image batch inputs have the same batch dimension in the LivePortraitProcess node. I think two other reasonable choices could be cropping the larger batch to the smaller of the two batch sizes, or growing the smaller batch to the larger of the two batch sizes. The second would make inputting a single image very easy.

Yeah the new smoothing stuff doesn't take into account mismatching frame counts.

I pushed your changes and the Kornia CPU fallback to develop now.

from comfyui-liveportraitkj.

Grant-CP commented on September 3, 2024

@kijai Works great without any fallback environment variable. The only thing preventing it from working out of the box for mac users is the onnxruntime-gpu line in the requirements file.

A few other minor comments about the included example workflow.

I would use the Video Info (Loaded) node to get the frame rate from the driving video and connect that to the last video combine node.
I would connect the audio from the driving video loader to the last video combine node.
To help test using a single image as target, I might include a Load Image and RepeatImageBatch node, where the "amount" on the repeater node is converted to input and connected to the "Get Image Size and Count" node for the driving image. I might bypass those nodes, or just leave them with no connected output node.

Just for logging purposes, I get a warning the first time I run it. This warning has mystified me from the beginning of working on this project, since it implies that some torch operations automatically have a cpu fallback even without the environment variable being set.

Animating...:   0%|                                                                                                    | 0/32 [00:00<?, ?it/s]
/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/nn/functional.py:4032: UserWarning: The operator 'aten::upsample_nearest3d.vec' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:13.)
  return torch._C._nn.upsample_nearest3d(input, output_size, scale_factors)

from comfyui-liveportraitkj.

Got it working on MacBook (MPS) about comfyui-liveportraitkj HOT 48 OPEN

Comments (48)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent